Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 959 (2015)
(USC DC Other)
USC Computer Science Technical Reports, no. 959 (2015)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EfficientAlgorithmsforAnsweringReverse Spatial-KeywordNearestNeighborQueries YingLu † GaoCong ‡ JiahengLu ♯ Cyrus Shahabi † † IntegratedMedia Systems Center, Universityof Southern California, Los Angeles, CA 90089 ‡ School of Computer Engineering, NanyangTechnological University, Singapore, 639798 ♯ School of Information, Renmin University of China, Beijing, 10087 † {ylu720, shahabi}@usc.edu ‡ {gaocong}@ntu.edu.sg ♯ {jiahenglu}@ruc.edu.cn ABSTRACT With the proliferation of local services and GPS-enabled mobile phones, reverse spatial-keyword Nearest Neighbor queries are be- coming an important type of query. Given a service object (e.g., shop) q as the query, which has a location and a text description, we return customers such that q is one of top-k spatial-keyword relevantserviceobjectsforeachresultcustomer. The existing algorithms for answering reverse nearest neighbor queries cannot be used for processing reverse spatial-keyword n- earest neighbor queries due to the additional text information. To design efficient algorithms, for the first time we theoretically ana- lyzeanidealcase,whichminimizestheobject/indexnodeaccesses, for processing reverse spatial-keyword nearest neighbor queries. Under the derived theoretical guidelines, we design novel search algorithms for efficiently answering the queries. Empirical stud- ies show that the proposed algorithms offer scalability and are or- ders of magnitude faster than existing methods for reverse spatial- keywordnearestneighborqueries. Keywords Reverseknearestneighbor,Spatial-keywordquery. 1. INTRODUCTION TheInternetisacquiringaspatialdimension, withcontent(e.g., pointsofinterestandWebpages)increasinglybeinggeo-positioned andaccessedbymobileusers. Therefore,hereversespatial-keyword nearest neighbor query [11], which considers the fusion of spatial informationandtextualdescription,isbecominganimportanttype ofqueriesinthelocalservicesofsearchengines(e.g.,GoogleMap- s)andmanyotherwebsites(e.g.,travelplanningwebsites). Reverse spatial-keyword nearest neighbor queries come in two flavors: Bichromatic Reverse Spatial-Keyword nearest neighbor (BRSKkNN)queriesandMonochromaticReverseSpatial-Keyword nearest neighbor (MRSKkNN) queries. BRSKkNN queries involve twotypesofobjects(e.g.,customersandshops),while MRSKkNN Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bearthisnoticeandthefullcitationonthefirstpage. Tocopyotherwise,to republish,topostonserversortoredistributetolists,requirespriorspecific permissionand/orafee. Copyright20XXACMX-XXXXX-XX-X/XX/XX...$15.00. s 1 s 2 s 4 q s 3 c 3 c 2 c 4 c 1 (a) Distribution of cus- tomersandshops Customers x y Specified Keywords c 1 4 1 (laptop, 1) c 2 3 4 (camera, 1) c 3 10 5.5 (laptop, 1) c 4 7 6 (sportswear, 1) (b) Preferencesofcustomersin(a) Shops x y Textual Descriptions s 1 6 0 (laptop,8), (stationery,7) s 2 8 4 (laptop,4), (stationery,8) s 3 2 8 (camera,8), (sportswear,8) s 4 6 8 (laptop,12), (camera,4) q 9 7 (laptop,1), (camera,1), (sportswear,8) (c) Locationsandtextdescriptionsofshopsin(a) Figure1: Examplesof BRSKkNNand MRSKkNNqueries queriesinvolveonetypeofobjects(e.g.,shops). Next, we takethe BRSKkNN query as an exampleto explainre- versespatial-keywordnearestneighborqueries. LetC andS bethe customersetandserviceset,respectively. Eachcustomerobjectin C has a location and a set of keywords representing the preference ofthecustomer;eachserviceobjectinS hasalocationandatextu- aldescription. GivenaserviceobjectqastheBRSKkNNquery,the resultwillbeasetofcustomersinC thathavethequeryobjectqas oneoftheirtop-kmostspatial-keywordrelevantobjectsamongthe objectsintheservicesetS. Here,the spatial-keyword relevanceis measured by both the spatial proximity to the query location and the text relevance to the query keywords. We proceed to illustrate the BRSKkNNquerywithanexample. Figure 1 displays the spatial layout and textual descriptions for a setS of shops and a setC of customers. Points c 1 ,···,c 4 (inC) shown in Fig. 1(a) represent customers whose locations and key- wordpreferencesaregiveninFig.1(b),andpointss 1 ,···s 4 ,q(inS) represent shops with locations and texts given in Fig. 1(c) , where thenumberfollowingawordistheweight,intuitivelyrepresenting therelevanceofakeywordtoashop. Giventheshopqasthequery object,andk=1,theresultsforthetraditionalBichromaticReverse kNNquery(withoutconsideringtextualcomponent,e.g.,[3,21])is {c 3 }, while with the additional text information, the results of the BRSKkNN query will be {c 4 }. Note that 1) c 4 becomes a result of theBRSKkNNqueryqsinceqisthemostspatial-keywordrelevant shopof c 4 . Thetextualrelevanceof c 4 tothetextualdescriptionof query shop q is high, though q is not the nearest neighbor of c 4 in terms of spatial distance alone; 2) with keyword preference, c 3 is not a result of the BRSKkNN query q due to the low relevance of the keyword component of c 3 to the textual description of shop q (themostspatial-keywordrelevantshopof c 3 is s 4 ). In addition to the traditional applications for reverse k nearest neighbor queries, such as service location planning and profile- basedmarketing,BRSKkNNquerieshavemoreinterestingapplica- tions. For example, a service provider wants to compose a textual description for its presence in a local service (or choose both loca- tion and text description) such that it will be in the results of the top-k spatial-keyword queries as many as possible, and thus cus- tomersaremorelikelytobeinterestedinitthanothercompetitors. The core problem for efficiently answering RkNN queries is: which objects or index nodes should be visited and in what order to minimize the number of index node / object accesses and thus I/O cost? None of the existing studies can effectively investigate this key problem. The algorithm [11], to our knowledge, which is the state-of-the-art solution for processing the MRSKkNN query (and can be extended to process the BRSKkNN query), prioritizes traversingthetop-kspatial-keywordrelevantserviceobjectsofcus- tomer objects. Similarly, the algorithm [2] for answering RkNN withouttextprioritizesthetop-k nearestneighborsofcustomersto visit. They assume that visiting the union of these top-k objects will minimize the I/O cost. However, if we consider the problem globally (rather than focusing on a single node E c ), i.e., to find all theresultnodesandpruneallthenon-resultnodes,itmaysufficeto visit a much smaller set of service objects than the set of customer nodesvisitedbytheAlgorithm[11]. Weillustratethiswiththeear- lier example. The method [11] needs to visit two service objects: s 1 and s 3 (s 1 and s 3 are the most relevant service object of c 1 and c 2 , respectively) to prune both customers c 1 and c 2 . However, we findthatwecanprunebothcustomersc 1 andc 2 byvisitingonlys 4 , sinces 4 ismorerelevantthanqforbothc 1 andc 2 ,thoughs 4 isnot theirmostrelevantserviceobject. To this end, we analyze an ideal case that aims to minimize the indexnodeaccessesforprocessingtheBRSKkNNquery. Wederive praticalguidelinesforthefollowingquestions,whicharecrucialfor theperformanceofalgorithmsforansweringtheBRSKkNNquery: Which customer (resp. service) index nodes or objects must be visited? Which customer (resp. service) index nodes have higher prioritytobetraversedfirstsothatwecanavoidvisitingmanyother nodes? Under the guidelines, we design an efficient solution for answeringthe BRSKkNNquery. Thecontributionsofthispaperaresummarizedasfollows. 1. We analyze the problem of answering the BRSKkNN query anddefineanidealcaseforitsuchthattheindexnodeaccessesare minimized(Section4). Wearenotawareofthesimilaranalysisfor BRSKkNNproblems. 2. Basedontheanalysis,wedeveloppracticalguidelinesforde- signingalgorithmsforBRSKkNN. Undertheguidelines,wedesign an efficient search algorithm that exploits a novel search strategy. It visits the must-be-visited nodes; it prioritizes the visiting order fortheothernodes,aimingtoreducenodeaccesses. 3. We also develop a new method of estimating bounds for spatial-keyword relevance between nodes, which is an essential componentoftheproposedsolution. 4. Results of empirical studies demonstrate the scalability and efficiency of the proposed algorithms: 1) the proposed algorithm for BRSKkNN outperforms the baseline algorithm [11] by 1-2 or- ders of magnitude, and 2) our algorithm outperforms the algorith- m[11]for MRSKkNNbyanorderofmagnitude. 2. PROBLEMDEFINITION 2.1 RSKkNNqueries LetC be the customer database in which each customer c in C is represented by a pair (c.p, c.kw), where c.p is a point location (represented by longitude and latitude information) and c.kw is a set of keywords. Similarly, let S be the service object database. Each service object s inS is defined as a pair (s.p, s.doc), where s.p is a point location and s.doc is a textual description (e.g., the listofservicesofferedby s). BeforewedefinetheBRSKkNNqueryandtheMRSKkNNquery, we fist give the definition of the spatial-keyword relevance [4,11– 13]inEqn(1),whichisusedtomeasurethepreferenceofcustomer ctoshopsintermsofbothspatialproximityandtextualrelevance. D ST (c,s)=α(1− Dist(c.p,s.p) maxD )+(1−α) Rel(s.doc|c.kw) maxR , (1) where parameter α∈ [0,1] is used to adjust the importance of s- patial proximity and the textual relevance at the query time. Nor- malizationconstants maxDand maxRdenotethemaximumspatial distance and textual relevance between customer inC and shop in S, respectively. Distance Dist(c.p,s.p) is the Euclidian distance between locations c.p and s.p. The text relevance Rel(s.doc|c.kw) ofcustomerctotheserviceobjectsiscomputedbyaninformation retrieval model. We use the Okapi BM25 model [14], a popular informationretrievalmodel,tobedetailedinSec.4.2.3. Givenaserviceobjectqasthequeryandapositiveintegerk,the BichromaticReverseSpatialKeywordNearestNeighbor(BRSKkNN) queryisdefinedasfindingallthecustomersinC thathaveqasone of their top-k most spatial-keyword relevant objects, and it can be formallydefinedby: BRSKkNN(q,k,C,S)={c∈C|q∈TopkSK(c,k,S)} (2) The Monochromatic Reverse Spatial Keyword Nearest Neighbor (MRSKkNN)queryisdefinedononetypeofobjects. Supposethat MRSKkNN is defined on service objectsS w.l.o.g. Intuitively, the MRSKkNN query q is to find the set of service objects that have q as one of their top-k spatial-keyword relevant service objects. For- mally,givenserviceobject q,a MRSKkNNqueryisdefinedas: MRSKkNN(q,k,S)={s∈S|q∈TopkSK(s,k,S)} (3) Next, we illustrate the BRkNN query and BRSKkNN query with examples. Fig.1displaysthespatiallayoutandtextualdescription- s for a setS of shops and a setC of customers. Points c 1 ,···,c 4 (inC) shown in Fig. 1(a) represent customers whose locations and keyword preferences are given in Fig. 1(b), and points s 1 ,···s 4 ,q (inS) represent shops with locations and texts given in Fig. 1(c), where the number following a word is the weight calculated us- ingtheTF-IDFmeasure,intuitivelyrepresentingtherelevanceofa keywordtoashop. Giventheshop qasthequeryobject, k =1and α =0.5,theresultforthetraditionalBRkNNqueryis{c 3 }. Howev- er,withtheadditionaltextinformation,theresultoftheBRSKkNN query will be {c 4 }. Note that 1) customer c 4 becomes a result of theBRSKkNNqueryqsinceqisthemostspatial-keywordrelevant shopof c 4 . Thetextualrelevanceof c 4 tothetextualdescriptionof query shop q is high, though q is not the nearest neighbor of c 4 in terms of spatial distance alone; 2) with keyword preference, c 3 is not a result of the BRSKkNN query q due to the low relevance of the keyword component of c 3 to the textual description of shop q (themostspatial-keywordrelevantshopof c 3 is s 2 ). 3. RELATEDWORKANDBASELINES 3.1 RkNNqueries Both Bichromatic and MonochromaticRkNN queries have been extensively studied as important database queries [2,5,7,10,15, 16,20,21]. Existing methods for answering RkNN queries can be divided into three categories. The first category [3,10,15,16,21] is based on the filter-refinement paradigm, and generally follows two steps: they first compute a set of candidate objects in the filter step, and then eliminate false positives from candidates to find the final results in the refinement step. The second category [5,7] of methodsisbasedonpre-processing. TheyprecomputethekNNfor each object in the database. At query time, for each object, they compare the precomputed k-th NN with query object q to deter- mine whether the object is a result. These methods do not support dynamic parameter k. The third category of algorithms [2,17] an- swers the query in a branch-and-bound fashion by computing the kNN lower and upper bounds for each index node/object E to de- cidewhetherE isaresult. Thesealgorithmsvisitthenodes/objects near E to compute the bounds. The existing solutions developed forRkNN queries without text cannot be used to process RSKkNN queriesbecausetheymakeuseofthespatialgeometrypropertiesto prunethesearchspacewithoutconsideringthetextualinformation. 3.2 MRSKkNNand BRSKkNNqueries Lu et al. [11] proposed a branch-and-bound algorithm for pro- cessing the MRSKkNN query, which is referred as our baseline al- gorithm for the MRSKkNN query, denoted by MoBase. For each indexnode/objectE,MoBasecomputedthelowerandupperbound- softhespatial-keywordsimilaritybetweenE anditskthmostsimi- larobjecttodecidewhetherE canbeprunedornot. MoBasevisits the nodes/objects near E to compute the bounds. MoBase can be extendedtoprocessthe BRSKkNNquery. Before discussing the extension, we first briefly review the IR- tree [4], which is used as the spatial-keyword index in the base- linemethodandouralgorithmsfor MRSKkNNqueries, BRSKkNN queries. 3.2.1 IR-tree The IR-tree is essentially an R-tree [8] extended with inverted files. Each leaf node in the IR-tree contains entries of the form (o,o.λ), where o refers to an object in the dataset, and o.λ is the bounding rectangle of o. Each non-leaf node R in the IR-tree con- tainsanumberofentriesoftheform (cp,mbr),where cppointsto achildnodeof R, mbr istheminimumboundingrectangle(MBR) ofallrectanglesinentriesofthechildnodepointedby cp. s 1 s 2 s 3 s 4 E s1 E s2 InvFile E s2 InvFile SRoot InvFile E s1 (a) ServiceIR-tree c 1 c 2 c 3 c 4 E c1 E c2 InvFile E c2 InvFile CRoot InvFile E c1 (b) CustomerIR-tree Figure2: IR-treesofobjectsinFig.1 Table1: TheinvertedfilesoftheextendedIR-treeinFig. 2(a) SRoot E s1 E s2 laptop: ⟨E s1 ,8,4⟩,⟨E s2 ,12,12⟩ ⟨s 1 ,8,8⟩,⟨s 2 ,4,4⟩ ⟨s 4 ,12,12⟩ stationery: ⟨E s1 ,8,7⟩ ⟨s 1 ,7,7⟩,⟨s 2 ,8,8⟩ sportswear: ⟨E s2 ,8,8⟩ ⟨s 3 ,8,8⟩ camera: ⟨E s2 ,8,4⟩ ⟨s 3 ,8,8⟩,⟨s 4 ,4,4⟩ Each node contains a pointer to an inverted file that describes the objects in the subtree rooted at the node. The inverted file for a node X contains a set of posting lists, each of which relates to a term t. Each posting list is a sequence of pairs⟨cp,maxt i ⟩, where cp is a child of X and maxt i is the maximum weight for term t i among the objects in the subtree rooted at cp. The weight enables derivationofanupperboundonthetextrelevancytoaqueryofany objectcontainedinthesubtreerootedat cp[4]. We extend the IR-tree in two ways to accommodate the need of the proposed algorithms. First, we extend each entry of the IR- tree with an additional element, the number of objects under the entry. Fig.2illustratestheextendedIR-treesforserviceobjectsand customers in Fig. 1. In Fig. 2(a), at entry E s1 (resp. E s2 ) of node SRoot, the number of objects is 2 (resp. 2). Second, for each pair ⟨cp,maxt i ⟩, we append the smallest weight for term t i among the objects in the subtree rooted at cp, denoted as mint i . As a specific example in Table 1, the maximum weight of the term laptop in entry E s1 of node SRoot is 8, which is the maximal weight of the term in the two documents (s 1 and s 2 ) in node E s1 (leaf node); the minimumweightoflaptopinentryE s1 is4,whichistheminimum weightof laptopinnode E s1 . For ease of presentation, at an entry cp, we define two text vec- tors, a union and an intersection vector, in which each elemen- t corresponds to a distinct term. The weight of term t i in union vector is maxt i and the weight of t i in intersection vector is mint i . For example, at entry E s1 of node SRoot in Fig. 2(a), the union vector is⟨laptop : 8,stationery : 8⟩ and the intersection vector is ⟨laptop:4,stationery:7⟩. 3.2.2 Baseline methods for BRSKkNN (BiBase) We use two extended IR-trees to organize the two types of ob- jects, customer and service objects. For the baseline method of the BRSKkNN query, denoted byBiBase, we extend theMoBase algorithm. BiBasealgorithmvisitsthetop-k spatial-keywordrele- vantserviceobjectsforeachcustomerentryE c todeterminewhether E c containstheresultcustomers. Specifically,wetraversefromthe root of the customer index and the root of the service index si- multaneously. For each visited customer entry E c , we estimate the relevancebetweenE c andalltheserviceentriestraversedcurrently, andusetheestimationtoupdatethelowerandupperboundsofthe spatial-keyword relevance between E c and its k-th relevant service object. TheboundsareusedtodecidewhetherE c canbeprunedor contain results: i) E c is pruned if the maximum relevance between E c and query service object q is smaller than its lower bounds; i- i) All the customers in E c are reported as results if the minimum relevancebetween E c and qislargerthanitsupperbound. Remark. BiBase algorithm is based on the assumption that top-k spatial-keywordrelevantserviceobjectsofeachcustomerentry E c arediscriminativeindecidingwhetherE c containsresultsorcanbe pruned. However, if we consider the problem globally, rather than focusingonthetop-kmostrelevantobjectsforeachcustomerentry, itmaysufficetovisitamuchsmallersetofserviceobjectsthanthe set of service objects visited by the MoBase. For the example of Fig.1,MoBaseneedstovisittwoserviceobjects: s 1 ands 3 (s 1 and s 3 are the most relevant service object of c 1 and c 2 , respectively) topruneboth c 1 and c 2 . However, wecanpruneboth c 1 and c 2 by visitingonlys 4 ,sinces 4 ismorerelevantthanqforbothc 1 andc 2 , though s 4 isnottheirmostrelevantserviceobject. 4. GUIDE-BASEDALGORITHM The ideal cases for MRSKkNN and BRSKkNN queries are simi- lar,andarestraightforwardtoderivefromeachother. Withoutloss of generality, we analyze an ideal case for BRSKkNN queries in Sec.4.1andpresentpracticalguidelinesderivedfromtheidealcase and design efficient guide-based algorithms, denoted by BiGuide, for answering BRSKkNN queries in Sec. 4.2. Finally, we discuss the extension, denoted by MoGuide, for answering MRSKkNN queriesinSec.4.3. 4.1 Idealcasesanalysis ManyalgorithmshavebeendevelopedforansweringRkNNqueries[3, 7,16,21](seeSec.3). Theyfocusonhowtoreducetheaccessesto indexnodes,thusreducingI/Ocostandcomputationalcost,which is the key problem for the algorithms of processingRkNN queries. However, none of them tries to analyze the optimal solution such thattheaccessestoindexnodesareminimized. Next, we first define an ideal case for processing BRSKkNN queries. The ideal case analysis will identify which entries must be visited, and which entries should be given priority to be visited first to reduce the cost. We assume that customers and service ob- jectsareorganizedintwoseparateextendedIR-trees. Notethatthe analysisandtheproposedalgorithmsareapplicableifotherindexes (e.g.,[9,11])areinplace. Given a BRSKkNN query q, the minimum set of (customer and service)objectsandindexnodesthatneedtobevisitedtoanswerq iscalledtheidealsearchregion(ISR)ofq. Asearchalgorithmthat only visits the objects and index nodes in the ISR of q is called an idealcaseforprocessingquery q. Let S be the service object database, and C be the customer database. LetC r be the set of customers that are results of q, i.e., foranycustomerc∈C r ,queryserviceobjectqisthet+1-th(t <k) nearest object of c. Let C p be the set of customers that are not results of q, i.e., for any customer c∈C p , query object q is NOT one of top k nearest objects of c. Next, we first define the ISR for the BRSKkNN query when objects are not indexed, and then we analyzethecaseinthepresenceofindextrees. DEFINITION 1. Given a service object s and a customer ob- ject c, we say that s contributes to c (or c is contributed by s) iff the spatial-keyword relevance between s and c is the same or greater than the relevance between query service object q and c, i.e., D ST (c,s)≥ D ST (c,q). 2 Intheabsenceofanindex,allthecustomersmustbevisitedbe- causeweneedtoaccesseverycustomerc∈C todeterminewhether cisaresult. Theanalysisfortheserviceobjectsismorecomplicat- ed. Twosetsofserviceobjectsmustbevisited: 1)S 1 = ∪ c∈C r ToptSK(c,t,S): Thesetoftop-t (t<k,t isthenum- ber of service objects that can contributed to c) service objects for every customer in result setC r . A customer c can be confirmed to be a result iif fewer than k service objects can contribute to c; thus the set of top-t (t < k) nearest service objects of a result c in C r mustbevisitedinordertoidentify ctobearesult. 2) S 2 : The minimum contributing set (MCS). To determine that the set of non-result customers C p are not results, ideally we visit a minimum set of service objects. Recall that in order to know that a customer c is not a result we need to find at least k service objectsthatcancontributeto c. Theserviceobjectsin S 1 mayalso contribute to some customers in non-resultC p . We define MCS as theminimumsetS ′ ofserviceobjectsinS suchthateachcustomer cinnon-resultC p canbecontributedbyatleastkserviceobjectsin S ′ ∪S 1 . Intuitively,MCSistheminimumsetofserviceobjectsthat mustbevisitedtodeterminethatthecustomersinC p arenotresults. The example shown in Fig. 4.1 illustrates the two parts, S 1 , S 2 , for a set of customer objects for the BRSKkNN query. Suppose k=2, result customer is C r ={c 4 }, and non-result customers are C p = {c 1 , c 2 , c 3 }. Wehave S 1 ={s 1 }since s 1 isthetop-1serviceobject for result customer c 4 . And we have S 2 ={s 4 } since {s 4 } is the minimum set of service objects such that at least k (=2) service objectsin S 1 ∪S 2 contributetoeachnon-resultcustomerinC p . Based on the above analysis, we proceed to analyze the ISR of the BRSKkNNqueryqinthepresenceofindextrees. Wefirstcon- sidertheISRforcustomerindexnodes. Wesayanindexnodecon- Service objects Customer objects s 1 s 2 s 3 s 4 c 1 c 2 c 3 c 4 t 1 =3 t 2 =3 t 4 =1 t 2 =3 Figure 3: Illustration of the two parts S 1 , S 2 for BRSKkNN queries. Anedge from s to crepresents scontributesto c. The number t i following a customer c i is the number of service ob- jectsthatcontribute c i tains customer c if the subtree rooted at the node contains c. If an index node contains any result customer (inC r ), we must visit the node since the result customer must be visited. Ideally, we do not visit the index nodes that do not contain result customers. Hence, the ISR for customer index consists of all the customer nodes that containresultcustomersinC r . For service index, the ISR w.r.t. BRSKkNN query q consists of two parts: 1) For each customer c in result C r , the service index nodes whose maximum spatial-keyword relevance to a customer c in resultC r is equal to or larger than the relevance of query object q to c must be visited to identify c is a result. 2) The minimum contributingset(MCS)ofserviceindexnodes. Similartotheanal- ysis without index, we want to identify the minimum set of index nodes, denoted by MCS, and we can identify the set of non-result customers C p by visiting nodes in MCS and nodes in part 1) (i.e., foreach cinC p weidentify k serviceobjectsthatcontributeto c). TheproblemoffindingMCSiscomputationallyintractablesince it can be reduced from the minimum Set-Cover problem, which is an NP-hard problem. Moreover, finding MCS requires an “oracle" that knows which non-result customer can be contributedby a ser- viceobject. Hence,finding MCSisinfeasibleinpractice. Remark. The purpose of our analysis is not to develop an algo- rithm for achieving the ideal case. Nevertheless, the analysis on theidealcaseindicateswhattypesofindexentriesshouldbevisit- ed, what types of entries can be selectively visited, and the princi- ple of scheduling the orders of visiting them. These offer practical guidelinesfordevelopingefficientalgorithms,whichhavenotbeen exploredbytheexistingstudieson BRSKkNNqueries. 4.2 SearchAlgorithmfor BRSKkNN BasedontheanalysisinSection4.1,wederivethreeguidelines: Guideline 1 For each result customer c in C r , we must identify all its top-t (t<k) most spatial-keyword relevant service objects to identify c as a result customer. Thus, for each c in C r we must visit all the service index nodes whose maximum spatial-keyword relevance to c is the same as or more than the relevance between c andq. Notethatitisnon-trivialtoestimatethemaximumrelevance andthiswillbecoveredinSection4.2.3. ConsiderthedatainFig.1 and the IR-trees for the data shown in Fig. 2. The result customer is c 4 , and service index nodes SRoot and E s2 must be visited since themaximumrelevancebetween c 4 and E s2 (resp. SRoot)islarger thantherelevancebetween c 4 and q. Guideline 2 Customer index nodes that contain result customers must be visited. For example, customer index nodes CRoot and E c2 mustbetraversedsincetheycontainresultcustomer c 4 . Guideline 3 We need to visit service objects prune all the non- result customers. Ideally, we can minimize the number of service objectaccesses,i.e.,visitingonlytheobjectsin MCS aswediscuss in Section 4.1. It is desirable to visit such service objects that are the same or more spatial-keyword relevant than is query service object q for a large number of non-result customers since this will reducetheaccessesofserviceentries. FortheexampleinFigure1, ideally an algorithm only visits service object s 4 in MCS, and then allthenon-resultcustomers{c 1 ,c 2 ,c 3 }canbepruned. Under these guidelines, we develop a novel algorithm for an- swering the BRSKkNN query. We design different search strate- gies to process potential result customers and potential non-result customers. The search algorithm works as two steps: Preliminary Diagnose(PD)StepandConfirmedDiagnose(CD)Step. In the PD step, we aim to 1) identify customer entries that are likely to contain result customers (which will be further checked andconfirmedintheCDstep)(guideline2),2)identifyandvisitthe serviceentriesthatmustbevisited(guideline1),and3)prunenon- result customer entries (i.e., entries that do not contain any result customer)asmuchaspossiblebyvisitingaminimumsetofservice objects(guideline3). In the CD step, we aim to 1) find the top-t (t<k) service object- s for result customer entries from the PD step to confirm them to be results (guideline 1), and 2) selectively visit a minimum set of service entries such that they can contribute to the non-result cus- tomerentries(fromthePDstep),thuspruningnon-resultcustomers (guideline3). 4.2.1 Algorithm PD Before presenting Algorithm PD, we introduce several defini- tionsandlemmas. DEFINITION 2 (LB∆((E c ,E s )−(E c ,q))). Givenacustomeren- try E c and a service entry E s , the lower bound gap between 1) the spatial-keyword relevance of a customer object c in E c to a ser- vice object s in E s and 2) the relevance of c to query service object q is defined as min{D ST (c,s)−D ST (c,q)|∀c∈ E c ,∀s∈ E s }. It is denoted by LB∆((E c ,E s )−(E c ,q)). DEFINITION 3 (UB∆((E c ,E s )−(E c ,q))). Givenacustomeren- try E c and a service entry E s , the upper bound gap between 1) the spatial-keyword relevance of a customer object c in E c to a ser- vice object s in E s and 2) the relevance of c to query service object q is defined as max{D ST (c,s)−D ST (c,q)|∀c∈ E c ,∀s∈ E s }. It is denoted byUB∆((E c ,E s )−(E c ,q)). The two bounds play a very important role in our proposed al- gorithm. It is nontrivial to estimate the two bounds, and we will presentthetechniquesofestimatingtheboundsinSection4.2.3. DEFINITION 4. Given a customer entry E c , its contribution number is the number of service objects that contribute to each customer in E c . Given service entry E s and customer entry E c , we say that service entry E s “contributes to” E c if all service object- s in E s are the same or more relevant to any customer in E c than is q; E s “cannot contribute to” E c if each service object in E s is less relevant to any customer in E c than is q; otherwise, we cannot determine whether E s contributes to E c and we say that E s “may contributeto” E c . 2 The three relationships between a service entry E s and a cus- tomerentry E c canbefurthercapturedbythefollowingcorollary. COROLLARY 1. E s “contributesto”E c iffLB∆((E c ,E s )−(E c ,q)) ≥ 0; E s “cannot contribute to” E c iffUB∆((E c ,E s )− (E c ,q)) < 0; Otherwise, E s “may contribute to” E c . DEFINITION 5. Let T be a set of service entries that do not have ancestor-descendant relationship. For a customer entry E c , thelowerboundofthecontributionnumberofE c ,denotedasLCN E c , is defined as: LCN E c = ∑ E s ∈T LB∆((E c ,E s )−(E c ,q))≥0 |E s | (4) where|E s | is the number of service objects contained in E s . 2 WehavethelowerboundnumberbecauseE s “contributesto”E c and all service objects in E s contribute each customer in E c when LB∆((E c ,E s )−(E c ,q))≥0. LEMMA 1. Given a customer entry E c , if LCN E c ≥ k, then E c does not contain any result customer and can be pruned. PROOF. There exit at least k service objects that contribute to anycustomer c∈E c . Thuswecansafelyprune E c . AccordingtoCorollary1,E s cannotcontributetoE c ifUB∆((E c ,E s ) −(E c ,q)) < 0. When UB∆((E c ,E s )− (E c ,q))≥ 0, E s may con- tributetoE c . Thus,wecanderiveanupperboundforthecontribu- tionnumberof E c . DEFINITION 6. Let T be a set of service entries that do not have ancestor-descendant relationship. Given a customer entry E c , the upper bound of the contribution number of E c , denoted as UCN E c , is defined as: UCN E c = ∑ E s ∈T UB∆((E c ,E s )−(E c ,q))≥0 |E s | (5) 2 LEMMA 2. Given a customer entry E c , if its upper bound on contribution number UCN E c < k, then all the customers in E c be- long to results. PROOF. Fewerthan k serviceobjectscancontributetoanycus- tomer object in E c , and thus all objects in E c will be reported as partoftheanswers. Next we introduce Algorithm PD. The algorithm works in an iterative manner. In each iteration, it employs the k most spatial- keyword relevant service objects of q to process customer entries in the customer index tree to see if a customer entry contains re- sult customers. Intuitively, service objects that are most spatial- keyword relevant to q are likely to be the top-t (t < k) spatial- keywordrelevantservice objects ofa result customer, and theyare likely to be effective in pruning non-result customers. The choice istofulfilltheaims2and3outlinedinSection4.2ofPDstep. The pseudocode of PD is outlined in Algorithm 1. PD main- tains the following variables: 1) a max-priority queue U s on ser- vice entries to be visited, and the key of an element E s ∈U s is the maximumspatial-keywordrelevancebetweenserviceentry E s and query service object q; 2) a list of customer entries, L c , to be pro- cessed in the PD step; 3) a list of customer entries, L ca , that will be processed in the CD step; 4) two lists of service objects kNew, nkOld, which store the service objects relevant to q found in the currentiterationandalltheiterationsinthePDstep,respectively. The algorithm invokes procedure FindNextkNN that incremen- tally finds top-k spatial-keyword relevant service objects for query service object q using algorithm [4]. The returned k objects are stored in kNew, and max-priority queue U s keeps track of service entriesthathavenotbeenvisited. Theupperboundforthespatial- keyword relevance of each node to the query q is used as the key forpriorityqueueU s (Line3). Thenweusethekserviceobjectsin Algorithm 1 PD (SR: the root of shop index tree, CR: the root of customerindextree, q: queryserviceobject) Output: result: thesetofBRSKkNNresultobjects. 1: U s ← InitPriorityQueue(SR); L c ← InitList(CR) 2: while L c ̸= / 0do 3: (U s , kNew)←FindNextkNN(U s , q) 4: nkOld← nkOld∪ kNew; L n = / 0 5: foreachentry E c in L c do 6: L c ← L c \ E c 7: case←PreferCase(E c , kNew, q) 8: if(case=CannotContribute)then L ca ←L ca ∪{E c } 9: elseif(case=MayContribute)then 10: if(E c isanindexnode)then 11: foreachchildentryCE c of E c do 12: LCN CEc = LCN Ec ; UCN CEc = LCN Ec 13: L n ← L n ∪{CE c } 14: else L ca ← L ca ∪{E c } 15: L c ← L c ∪{L n } 16: result←CD(EnQueue(U s , nkOld),L ca , q) ProcedurePreferCase(E c : Customerentry, kNew, q) 17: foreachservice sin kNewdo 18: if LB∆((E c ,s)−(E c ,q))≥0then 19: LCN Ec ←LCN Ec +1; UCN Ec ←UCN Ec +1; 20: if LCN Ec = kthenReturnCanContribute; 21: if ∀s∈kNew, UB∆((E c ,s) − (E c ,q)) < 0 then Return CannotContribute; 22: ReturnMayContribute; kNewtoprocesseachcustomerentry E c in L c (Line5)byinvoking PreferCase(Line7). TheresultofPreferCasewillbeoneofthe followingthreecases: • CanContribute: The k service objects in kNew can con- tribute to E c . In this case, k service objects are the same or more spatial-keyword relevant to E c than is query object q, the customer entry can be pruned according to Lemma 1 (Line20). • CannotContribute: The k serviceobjects in kNewcan- notcontributetoE c . WhenwecannotpruneE c andE c satis- fies one of the two heuristics (to be presented), we consider E c as a candidate result entry that will be further checked whetheritcontainsresultsintheCDstep(Line8). • MayContribute: ThekserviceobjectsinkNewmaycon- tribute to E c . Consider a customer entry E c that does not belong to the first two cases. If E c is an index node, then weinitializethecontributionnumberofitschildentriesCE c withthatofE c (Line12),andaddCE c intoL n . Intuitively,we can estimate tighter bounds of contribution number for the childentriescoveringsmallerspatialregionsandfewwords, and they will be processed by the next k service objects in the next while-loop. If E c is a customer object, we add E c into L ca ,whichwillbeprocessedintheCDstep(Line14). In the next while-loop, Algorithm PD incrementally finds the nextkmostspatial-keywordrelevantserviceobjectstoqueryq,and uses them to repeat the above procedure until L c is empty. Finally, those retrieved spatial-keyword relevant service objects in nkOld are enqueued into priority queueU s (which contains unvisited ser- viceentries). WepassU s andthesetofcandidatecustomerentries L ca totheCDstep. InprocedurePreferCase,ifserviceobjectscancontributecus- tomer entry E c (i.e., LB∆((E c ,s)−(E c ,q))≥ 0), we increase the contribution number of E c by 1 (Line 19). When the contribution number of E c reaches k, PreferCase returns case 1(Line 20). If customer entry E c satisfies the following heuristic (Line 21), then PreferCasereturnscase2. Otherwise,itreturnscase3. s6 s4 q s5 c1 s3 s1 s2 (a) when E c (i.e., c 1 ) con- tainsresults s6 s3 q s5 s4 s1 s2 MinDist(Ec, q) Dist(s 4 , q) Ec (b) whenE c doesnotcontain results Figure4: Heuristicillustrationforcustomerentry E c Heuristic: If all of the k service objects in kNew cannot con- tributeto E c ,i.e.,foreach sinthe k objects,UB∆((E c ,s)−(E c ,q)) <0 (Line 21), then E c is moved to the CD step. A customer entry E c satisfying the heuristic follows two possibilities: 1) E c is likely to be a result entry. For example, in Figure 4(a), let k=2 and con- siderspatialinformationonlyforintuitiveillustration. Both s 1 and s 2 , which are the 2NN of q, cannot contribute to customer c 1 and thus c 1 satisfies the heuristic. We move c 1 to CD step, which will find q is the top-1 service object (the nearest neighbor) of c 1 and thusidentify c 1 astheRkNNresultof q. Inthisway,wecanavoid visiting the rest service objects s 3 ,···,s 6 . 2) E c is not a result en- try which cannot be pruned effectively by the service objects near queryobjectq. Forexample,inFigure4(b),letk=2andweassume currently kNew ={s 3 ,s 4 }. We move E c to the CD step since the serviceobjectss 1 ,···,s 4 closetoqcannotbeusedtopruneE c ,and intheCDstep, someotherserviceobjects(e.g., s 5 and s 6 near E c ) willbevisitedtoprune E c . Example1: We use an example to illustrate Algorithm PD. Recall the data in Fig. 1 and the index trees in Fig. 2. Consider the query service object q, k=1, and α=0.6. The trace of Algorithm PD is shown in Table 2. Step 1: we find the most spatial-keyword rele- vant object s 4 to query q. Step 2: Next we use s 4 to check whether customer entries in L c can be pruned or not. We prune E c1 since s 4 can contribute to E c1 and PreferCase(E c1 ,s 4 ,q) returns case 1 (Line 20). We traverse down E c2 since PreferCase(E c2 ,s 4 ,q) re- turns case 3 (Line 22, Line 9-13): E c2 does not prefers 4 toq (Line 18-20), nor does E c2 satisfy the heuristic (Line 21). Step 3: Sim- ilarly, in the next while-loop, we incrementally find next object s 3 nearby q. Step 4: We use s 3 to check whether c 3 and c 4 can be pruned or not, and we move both c 3 and c 4 to L ca since s 3 cannot contributeto them, satisfying heuristic 1(Line 21). Then c 3 andc 4 are passed to CD step. Table2: TraceofthePDStepfor BRSKkNNinExample1 Step Actions kNew nkOld U s L c L ca 1 FindNextNN s 4 s 4 s 3 ,E s1 E c1 ,E c2 / 0 2 Prune E c1 ;Visit E c2 s 4 s 4 s 3 ,E s1 c 3 ,c 4 / 0 3 FindNextNN s 3 s 4 ,s 3 E s1 c 3 ,c 4 / 0 4 Move c 3 and c 4 to L ca s 3 s 4 ,s 3 E s1 / 0 c 3 ,c 4 4.2.2 Algorithm CD ThecustomerentriesL ca generatedbythePDstepareverylikely to contain result customers. The CD step progressively computes their lower and upper bounds of contribution number by visiting serviceentriesinabranch-and-boundmannertodeterminewhether the customer entries in L ca are results. The challenge here is how wecanvisitasfewaspossibleserviceentries. Toachievethis,Al- gorithm CD selectively visits a set of service and customer entries andaccessestheminanorderbasedontwopriorityqueues: 1) It maintains a max-priority queue U s2 on the service entries to be traveled. The structure of U s2 is different from the queue U s used in PD step (both for service entries). Each element E s in U s2 is associated with a set E s .C of customer entries that may be contributedbyE s . TherationaleformaintainingE s .C forE s isthat E s and its descendant entries are not necessary to be visited for the customer entries that are not in E s .C according to Corollary 1. Inotherwords,AlgorithmCDonlyneedstoconsiderthecustomer entriesinE s .CwhenprocessingE s . ThekeyofanelementE s inU s2 is the total contribution number of customer entries in E s .C. This isbecauseserviceentriescontributingtomorecustomerentriesare likelytoprunemorenon-resultcustomers(underGuideline3). For example,inFigure5(a),E s1 .C={E c2 },E s2 .C={E c1 ,E c2 }. Hence the key of E s1 is|E c2 | = 2 and the key of E s2 is|E c1 | +|E c2 | = 3. Since the key of E s2 is larger than that of E s1 , E s2 has a higher priority to be visited than E s1 . In this way, we can improve the pruning power of the algorithm because i) if visiting E s2 first, we can prune both customer entries E c1 and E c2 (i.e., service object s 3 can prune E c1 and s 4 can prune E c2 ); but ii) if visiting E s1 first, we can only prune E c1 (i.e., both service objects s 1 and s 2 in E s1 cannotprune E c2 ),andthusstillneedvisit E c2 . s3 q s4 s1 s2 Ec2 Es1 Es2 Ec1 (a) U s2 q s2 s1 c1 Ec2 c2 Ec1 Es2 Es1 (b) U ca Figure5: IllustrationfortheprioritiesofU s2 andU ca 2)CDmaintainsamax-priorityU ca onthecustomerentriesthat need to be checked whether to be results. Each customer entry E c is associated with a set E c .S of service entries that may contribute to E c . The key of an element E c in U ca is the total contribution numberof service entries in E c .S to E c . Intuitively, acustomer en- try associated with a large number of service entries is likely to be diverse in their spatial and textual information, and thus it is diffi- culttoprocesstheentryasawhole. Henceweprioritizeprocessing suchentries(i.e.,visitingtheircomponententries). Incontrast,the bounds for customer entries associated with fewer service entries aremorelikelytobetight,andthusitismorelikelytodetermineif suchaentryasawholecontainsresultsornotwithoutaccessingits component entries. For example, we consider spatial information onlyinFigure5(b)forintuitiveillustration. SupposeE c1 .S={E s2 } and E c2 .S = {E S1 , E S2 }. Hence the key of E c1 is|E s2 | = 2, and the keyof E c2 is|E s1 |+|E s2 |=4. Sincethekeyofcustomerentry E c2 is larger than that of E c1 , we visit E c2 first. We observe that E c2 has a larger MBR. It is difficult to estimate tight bounds for E c2 . Hence, we visit it first to get its component entries (c 1 and c 2 ). In contrast,E c1 withasmallerMBRcanbepruned(byserviceobject s1in E s2 )asawholewithoutvisitingitscomponents. WenextpresentacorollarytobeusedinAlgorithmCD. COROLLARY 2. Given a service entry E s , if∀E c ∈ U c ,E s / ∈ E c .S,then E s willnotbevisitedinAlgorithmCD. 2 PROOF. If∀E c ∈U c , E s / ∈E c .S, then E s cannot contribute to al- l the customer entries that need to be checked in Algorithm CD. AccordingtoCorollary1,Corollary2istrue. Algorithm2CD(L s , L ca , q) 1: Initializetwomax-priorityqueuesU s2 andU ca 2: foreachcustomerentry E c in L ca do 3: foreachserviceentry E s in L s do 4: UpdateBounds(E c , E s , q,U s2 ) 5: if(IsHitOrDrop(E c )=false)thenEnQueue(U ca , E c ) 6: whileU ca ̸=/ 0do 7: E s ←DeQueue(U s2 ) 8: foreachcustomerentry E c in E s .Cdo 9: UCN Ec ← UCN Ec −|E s | 10: foreachchildentryCE s of E s do 11: UpdateBounds(E c ,CE s , q,U s2 ) 12: if(IsHitOrDrop(E c )=true)thenU ca ←U ca \ E c 13: E c ←DeQueue(U ca ) 14: foreachchildentryCE c of E c do 15: LCN CEc = LCN Ec ; UCN CEc = LCN Ec 16: for each service entry E s in E c .S in increasing order of UB∆((E c ,E s )−(E c ,q))do 17: UpdateBounds(CE c , E s , q,U s2 ) 18: if(IsHitOrDrop(CE c )=false)thenEnQueue(U ca ,CE c ) 19: Return results ProcedureUpdateBounds(E c , E s , q,U s2 ) 20: if LB∆((E c ,E s )−(E c ,q))≥0then//E s contributesto E c 21: LCN Ec ← LCN Ec +|E s |; UCN Ec ← UCN Ec +|E s | 22: elseif UB∆((E c ,E s )−(E c ,q))≥0then//E s maycontributeto E c 23: UCN Ec ← UCN Ec +|E s | 24: Add E c into E s .C;Add E s into E c .S //otherwise,E s cannotcontributeto E c 25: if((E s isanindexnode)∧(∃ce∈E s .C, E s ∈ce.S))then 26: EnQueue(U s2 , E s ) ProcedureIsHitOrDrop(E c ) 27: if(LCN CEc ≥k or UCN CEc <k)then 28: ∀se∈CE c .S,se.C=se.C\CE c ; 29: if(UCN CEc <k)then results.add(subtree(E c )) 30: returntrue 31: elsereturnfalse The pseudocode of CD is outlined in Algorithm 2. The argu- mentsL s andL ca storetheserviceandcustomerentriespassedfrom the PD step, respectively. For each customer entry E c in L ca , we compute the bounds of contribution number for E c using service entry E s in L s (Line 2–5) by invoking procedure UpdateBounds (Line4). ThenwecheckwhethercustomerentryE c isaresultentry (“hit”) or can be pruned (“drop”) by invoking procedure IsHitOr- Drop. Weenqueue E c intoU ca if E c isneithera“hit”nora“drop” (Line5). In procedure UpdateBounds (Line 20-26), 1) if E s contributes toE c ,thenweincreaseboththelowerandupperboundofcontribu- tionnumberofE c bythenumberofserviceobjectsatE s (Line21). 2) If E s may contribute to E c , then we increase the upper bound contribution number of E c by the number of service objects in E s (Line 23), as well as update the corresponding sets of contribu- tion entries (Line 24). 3) Otherwise, we confirm that E s cannot contribute to E c . After that if the following two conditions are met(Line25): i)E s isanindexnode,andii)thereexistsacustomer entry ce in E s .C such that E s may contribute to ce, then E s will be enqueued into U s2 to be further considered (Line 26). Otherwise, wedisregardE s accordingtoCorollary2. InprocedureIsHitOrDrop,wecheckwhethercustomerentryE c can be pruned or reported as results (Line 27-31). 1) If its lower bound contribution number LCN E c ≥ k, we can prune E c based on Lemma 1; 2) or if its upper bound UCN E c < k, objects in E c are results (Line 29) based on Lemma 2, then we remove E c from thecorresponding places(Line 28), and thenreturn true(Line 30). Otherwise,wereturnfalse(Line31). Next we update the contribution number bounds for candidates in U ca (Line 6–19) to identify whether E c in U ca contains results. We use the child entries of service index node E s dequeued from U s2 to update contribution number bounds for customer entries in E s .C (Line 7–12). The upper bound of customer entries in E s .C should be subtracted by|E s | (Line 9) to avoid double counting the service objects in the subtree of E s . Meanwhile, we use service entries in E c .S to update the bounds for the child customer entries of E c dequeued from U ca (Line 13–18). To accelerate the algo- rithm, the lower and upper bounds of contribution number of each child entry are initialized by the lower bound of its parent entry (Line 15). The elements E s in E c .S are ranked in ascending order ofUB∆((E c ,E s )−(E c ,q))since E s withalargerkeyvalueismore likelytobeoneofthetop-k servicesof E c (underGuideline1). Example 2: Continue with Example 1. We illustrate the CD al- gorithm as shown in Table 3. Step 0: initially, from PD Step, the customer entries in L ca are{c 3 ,c 4 } and the service entries in U s are{E s1 ,s 3 ,s 4 }. Step 1: we check whether each service entry in U s cancontributeforcustomersc 3 orc 4 (Line2-5). Forc 3 ,service objects 4 inU s contributeit,andthusweprunec 3 (Line28);forc 4 , it may be contributed by E s1 , and thus we add c 4 to E s1 .C, add E s1 to c 4 .S (Line 24). Step 2: we expand service entry E s1 to visit its child entries s 1 and s 2 , and we check whether they can contribute toc 4 inE s1 .C (Line7-12). Finally,wereportc 4 tobearesultsince service objects{s 1 ,s 2 } cannotcontributeto c 4 (Line 29). Table3: TraceofCDStepfor BRSKkNNinExample2 Steps Actions L s L ca U s2 U ca 0 E s1 ,s 3 ,s 4 c 3 ,c 4 / 0 / 0 1 Prune c 3 / 0 / 0 E s1 , c 4 , Add c 4 to E s1 .C; E s1 .C:c 4 c 4 .S:E s1 Add E s1 to c 4 .S 2 Visit E s1 ; / 0 / 0 / 0 / 0 Report c 4 tobearesult; 4.2.3 Computing Lower and Upper Bounds We proceed to present a novel approach to effectively estimate the lower and upper bound gaps in Corollary 1, which are used to compute the bounds of contribution number. The bound compu- tation is an essential part of the proposed algorithm, and is very importantforthealgorithmperformance. Let E.corners denote the four corners of the MBR of an entry E. Let o be the middle point between a service object s and the query object q. Let E s .mint i (resp. E s .maxt i ) be the minimal (re- sp. maximal) weight of word t i in the document E s .doc of service entry E s . Let E c .intkw (resp. E c .unikw) be the intersection (resp. union) keywords of the customer entry E c . Further let|p 1 p 2 | de- note the Euclidean distance between two points p 1 and p 2 . The text relevance Rel(s.doc|c.kw) of customer c and service object s inEqn(1)iscomputedbytheOkapiBM25model[14]inEqn(6). Rel(s.doc|c.kw)= ∑ t i ∈c.kw log N df i · (k 1 +1)tf i,s k 1 ((1−b)+ b·L s L avg )+tf i,s (6) where N isthenumberofserviceobjectsinS,and df i isthenum- ber of service objects containing term t i inS; tf i,s represents the frequencyoftermt i inobjects,L s isthedocumentlengthofs;L avg is the average document length of all objects; parameters k 1 , b are setat k 1 =1.2, b=0.75byfollowing[14]. DEFINITION 7 (COMPUTING LB∆). Given a customer entry E c ,aserviceentryE s andaqueryserviceobjectq,thelowerbound gap LB∆((E c ,E s )−(E c ,q)) is computed by: LB∆((E c ,E s )−(E c ,q)) = (1−α) ∆ MinT (E c ,E s ,q) MaxR −α ∆ MaxD (E c ,E s ,q) MaxD (7) ∆ MaxD (E c ,E s ,q), i.e., Max c∈E c .mbr,s∈E s .mbr {|cs|-|cq|}, is: ∆ MaxD (E c ,E s ,q)=Max s∈E s .coners { ∆ MaxD (E c ,s,q) } (8) ∆ MaxD (E c ,s,q)= { |cs|−|cq| if cos< − → oc, − → oq>≤0 |sq|cos< − → oc, − → oq> otherwise where c∈E c .corners, s.t.,cos< − → oc, − → oq> is maximum. ∆ MinT (E c ,E s ,q),i.e. Min ∀c∈Ec,∀s∈Es {Rel(c.kw|s.doc)−Rel(c.kw|q.doc)}, is defined as: ∑ t i ∈Ec.intkw (E s .mint i −q.t i )+ ∑ t i ̸∈E c .intkw t i E s .mint i =< oq oe e , θ (a) ∆ MaxD (E c ,E s ,q) c2 c4 c3 c1 > =< oq oe e , θ (b) ∆ MinD (E c ,E s ,q) Figure 6: Illustrations of ∆ MaxD (E c ,E s ,q) and ∆ MinD (E c ,E s ,q) when E s isanobject When E s is a service object s, as shown in Fig. 6(a), we plot the hyperbola with the focal points s and query service object q, withthemiddlepoint oof sand qastheoriginalpoint, throughan arbitrary point c in a customer entry E c . And then we plot the two asymptotes of the hyperbola in the solid lines. Let θ e = arctanK e , whereK e istheslopeoftheasymptoteonthesideofpointeonthe hyperbola,thenwehavethefollowinggeometryproposition: PROPOSITION 1. GivenahyperbolaH, x 2 a 2 − y 2 b 2 =1,a>0,b>0, with focal points s and q, then any point e on the hyperbolaH sat- isfies|es|−|eq| =|sq|cosθ e . PROOF. It is known that (|es|−|eq|) 2 = (2∗a) 2 , a 2 = ( |sq| 2 ) 2 - b 2 ,and K e =( b a ) 2 ,thenwehave: (|es|−|eq|) 2 = (|sq| 2 1+K 2 e =|sq| 2 ∗(cosθ e ) 2 (12) Sincethesignsymbolsof|es|−|eq|andcosθ e arethesame,thus Proposition1holds. Ifcos< − → oc, − → oq>≤0,asthecasesE c1 andE c4 showninFig.6(a), obviously,Eqn(10)istrue. Otherwise,asthecasesE c2 andE c3 ,we have cosθ e >cosθ c . Further according to Proposition 1, we have: ∀c∈ E c ,∀s∈ E s ,|sq|∗cos< − → oc, − → oq>=|sq|∗cosθ e >|sq|cosθ c = |cs|−|cq|HenceEqn(10)istruewhen E s isanobject. When E s is an index node, Eqn(10) also holds since the values ∆ MaxD (E c ,s,q) of the corners of E c dominate the values of objects in E c . We proceed to prove Eqn(11). According to the ranking model BM25inEqn(6),whichismonotonicwithrespecttothenumberof the specified keywords and the weights of words in the documents of service objects. Thus for the keywords E c .kw, ∑ t i ∈E c .kw (E s .mint i − q.t i ) is minimum. Thus for all the keywords in E c , the combining scoreofthetwopartsobtainedinEqn(9)isdefinitelynolargerthan rel(c.kw|s.doc)−rel(c.kw|q.doc),∀c∈E c ,∀s∈E s }. HenceEqn(11) istrue. Putthesetogether,Lemma3holds. DEFINITION 8 (COMPUTINGUB∆). Given a customer entry E c ,aserviceentryE s andaqueryserviceobjectq,theupperbound gapUB∆((E c ,E s )−(E c ,q)),is computed by: UB∆((E c ,E s )−(E c ,q)) = (1−α) ∆ MaxT (E c ,E s ,q) MaxR −α ∆ MinD (E c ,E s ,q) MaxD (13) where∆ MinD (E c ,E s ,q),i.e. Min c∈E c .mbr,s∈E s .mbr {|cs|-|cq|}, is: ∆ MinD (E c ,E s ,q)=Min s∈E s .coners { ∆ MinD (E c ,s,q) } (14) ∆ MinD (E c ,s,q)= { |cs|−|cq| if cos< − → oc, − → oq>>0 |sq|cos< − → oc, − → oq> otherwise where c∈E c .corners, s.t.,cos< − → oc, − → oq> is minimum. ∆ MaxT (E c ,E s ,q),i.e. Max ∀c∈Ec,∀s∈Es {Rel(c.kw|s.doc)−Rel(c.kw|q.doc)}, isdefined as: ∑ t i ∈Ec.intkw (E s .maxt i −q.t i )+ ∑ t i ∈E c .unikw−E c .intkw E s .maxt i >q.t i (E s .maxt i −q.t i ) (15) 2 WenextprovethatUB∆((E c ,E s )−(E c ,q))isaupperboundfor thegap D ST (c,s)−D ST (c,q). LEMMA 4.∀c∈ E c ,∀ s∈ E s , s.t. D ST (c,s)− D ST (c,q)≤ UB∆((E c ,E s )−(E c ,q)). PROOF. AccordingtoEqn(1)andEqn(7),itcanbederivedthat ∀c∈E c ,∀s∈E s , ∆ MaxST (E c ,E s ,q)≥D ST (c,s)−D ST (c,q) ⇔(1−α) ∆ MaxT (E c ,E s ,q) MaxT −α ∆ MinD (E c ,E s ,q) MaxD ≥(1−α) rel(c.kw|s.doc)−rel(c.kw|q.doc) maxT −α |cs|−|cq| maxD Thusweneedtoprove,∀c∈E c ,∀s∈E s , ∆ MinD (E c ,E s ,q)≤dist(c.p,s.p)−dist(c.p,q.p) (16) ∆ MaxT (E c ,E s ,q)≥rel(c.kw|s.doc)−rel(c.kw|q.doc) (17) FirstweproveEqn(16),andthenEqn(17). When E s is a service object s, if cos< − → oc, − → oq> > 0, as the cases E c2 and E c3 shown in Fig. 6(b), Eqn(16) is obviously true. Oth- erwise, as the cases E c1 and E c4 , we have cosθ e < cosθ c . Fur- ther based on Proposition 1, we have:∀c∈ E c ,∀s∈ E s ,|sq|cos< − → oc, − → oq>=|sq|cosθ e <|sq|cosθ c =|cs|−|cq|. HenceEqn(16)hold- s. When E s isanindexnode,Eqn(16)alsoobviouslyholds. We proceed to prove Eqn(17). According to the BM25 mod- el, we have, ∑ t i ∈E c .kw (E s .maxt i −q.t i ) is maximum. Correspondingly, Eqn(17)istrue. Overall,Lemma4holds. 4.3 SearchAlgorithmfor MRSKkNN The guidelines discussed in Sec. 4.2 and the PD and CD search framework can be applied to process MRSKkNN queries with s- traightforward extension. To illustrate this, we use the MRSKkNN query in Fig. 1 as an example, where the service objects compose theuniversaldatasetandtheindexisgiveninFigure2(a). The three guidelines in Section 4.2 are applicable to MRSKkNN withadaptation: 1)Guideline1. Foreachresultobject s, thetop-t (queryobject qisthet+1threlevantobject)mostspatial-keyword relevant objects of s must be visited, and so must the index nodes whose maximum spatial-keyword relevance to s is the same as or morethanthatbetweensandq. Forexample,indexnodeE s1 must be visited since the maximum relevance between result object s 4 andE s1 islargerthanthatbetweens 4 andq. 2)Guideline2. Index nodes that contain result objects must be visited. For example, in- dexnodesSRootandE s2 mustbevisitedsincetheycoverresults 4 . 3)Guideline3. Weaimatvisitingaminimumsetofentriesprune all the non-result objects with the minimum entries accessing. For example,ideallyweonlyvisits 4 ,andthenallthenon-resultobjects {s 1 ,s 2 ,s 3 }canbepruned. 4.4 TheoreticalAnalysis Theorem6givestheI/Ocostoftheproposedalgorithmwiththe BRSKkNNquery. ItsproofandtheI/OcostanalysisforMRSKkNN queriescanbefoundin[1]. THEOREM 1. Assume that the locations of N s service objects andN c customerobjectsareuniformlydistributedina2-dimension space,theaveragenumberofwordspercustomeris1,andtheword frequenciesineachserviceobjectfollowtheZipfdistribution. With high probability, the total I/O cost of the Algorithm 1 and 2 based on the IR-tree for the BRSKkNN query isO(⌈ N c k+N s k N s f s ⌉), where k is the parameter in BRSKkNN, f s (resp. f c ) is the fanout of the service (resp. customer) IR-tree. 5. EXPERIMENTALSTUDIES Wepresenttheevaluationresultsontheefficiencyandscalability of the proposed search framework for answering BRSKkNN and MRSKkNNqueries. Implemented algorithms. For BRSKkNN queries, we compare our proposed algorithm (denoted by BiGuide) with the baseline method extended from the solution for MRSKkNN [11] (denoted by BiBase). For MRSKkNN queries, we compare our algorithm (denotedbyMoGuide)withthebaselinesolutionin[11](denoted byMoBase). Datasets and Queries. We use two datasets in our experiments: GN and WEB. Table 4 gives the properties of each data set. GN is a real dataset, which is from the U.S. Board on geographic names with a large number of words to describe the information about each geographic location. We use a real spatial dataset containing the tiger census blocks in Iowa, Kansas, Missouri and Nebraska (www.rtreeportal.org)andarealdocumentdatasetWEBSPANUK2007 1 consistingofalargenumberofwebdocumentstogenerateadataset WEBbyrandomlyselectingadocumentforaspatialobject. These datasetsdifferfromeachotherintermsofdata-size,spatial-distribution andtext-size. Table4: Datasetsfortheexperiments Statistics GN WEB Total#ofobjects 1,868,821 579,727 Total#ofuniquewords 222,409 2,899,175 Total#words 18,374,228 249,132,883 ForMRSKkNNqueries,onlyserviceobjectsarerequiredandthe two datasets are used as service objects. However, for BRSKkNN queries, apart from the service objects, we still need to generate synthetic keyword queries for the customer objects since spatial- keyword queries of customers on the two data are not available. We generate a customer c as follows: we randomly generate the coordinate values in the region of the dataset for c, and then ran- domly choose m words from the text of a service object s that is randomlyselected fromthe original dataset. Thenumber m isran- domlygeneratedbyfollowingthequerylengthdistribution—most of queries are of length 1-2 according to the analysis on a large map query log [22]. The ratio between the number of service ob- jects and the number of generated customer objects is 1:10. For eachdataset,wegenerate100queryobjectsinthesamewayaswe generatecustomers. Wereporttheaveragerunningtimeofthe100 queries. In addition, to evaluate the scalability, we generate 5 dataset- s containing 1∼ 9 million service objects: the location of each object is generated randomly in the space of GN dataset, and we randomly select a document from the document collection in GN to associate it with a data point. Correspondingly, we generate 5 datasets containing 10∼ 90 million customer objects according to 1:10ratio. In the experiments, by default we use the GN dataset listed in Table4asserviceobjects,andthegenerateddataset(containing18 millionobjects)ascustomerobjects. Setup. We implement all the algorithms in VC++ and run on an Intel(R) Core(TM)2 Quad CPU Q8200 @2.33GHz with 4GB of RAM. We implement two versions of the proposed algorithms us- ing disk-resident and memory-resident indexes, respectively. The disk-resident version is used as the default. The page size is set as 4KB. In our experiments, the parameter k varies from 1 to 32 1 barcelona.research.yahoo.net/webspam/datasets/uk2007 andissetas4bydefault;theparameterα forboth MRSKkNNand BRSKkNNqueriesvariesfrom0to1andissetas0.9bydefault. PerformanceandscalabilityofalgorithmsforBRSKkNNqueries. This set of experiments is to evaluate the performance the search algorithmsfor BRSKkNNqueries: BiBaseandBiGuide. (1)AsshowninFig.7,BiGuiderunsabouttwoordersofmagni- tude faster thanBiGuide on datasets of different sizes. This is be- causeofthepracticalguidelinesusedforoptimizingtheaccessesin ouralgorithm. Besidesthat,itisalsobecausethatoursearchstrat- egy is supported by the tight gap bound estimations, which, with text component, are much tighter than the spatial-keyword bound estimationsinBiBase[11]. (2) Additionally, Fig. 8 demonstrates that BiGuide outperforms consistentlyBiBaseatdifferentk values. (3) Fig. 9 shows the performance when we vary α from 0 to 1, adjusting the importance between keyword relevance and spa- tial proximity. We observe thatBiGuide significantly outperforms BiBase irrespective of α. When the textual component takes an indispensable role (α is 0∼ 0.8), both BiBase and BiGuide are insensitivetoα. However,whenα isin0.8∼1,thequerytimeof BiBaseandBiGuidedrop. (4)Thissetofexperimentsistostudytheheuristicintroducedin thePDalgorithmwhenvaryingkforBRSKkNNqueries. Asshown inFig.10,wecanseethattheheuristiccansignificantlyaccelerate the search algorithm. It is because that the heuristic can identify 1) customer entries that are more likely to contain results and 2) customer entries that cannot be effectively pruned by the service objects nearby the query object, and then we pass them to the CD steptoreducethepageaccesses. (5) This experiment is to evaluate the performance of BiBase and BiGuide on GN when the index is in memory. The results, as shown in Fig. 11, are consistent with those obtained for disk- resident dataset. BiGuide runs much faster than BiBase at differ- ent k values. Experiments on WEB dataset. We also conduct extensive ex- periments on dataset WEB. We only report results when varying parameter k of BRSKkNN in Fig. 12. We observe the results are consistentwiththoseonGNshowninFig.8. Performance study of algorithms for MRSKkNN queries. This setofexperimentsistoevaluatetheperformanceandscalabilityof algorithms MoBase and MoGuide. Fig. 13 shows the query time whenwevarythesizeofdatasetGNfrom11Mto99M.Wecansee MoGuideoutperformsMoBasesignificantly,andscaleswellwith the size of the dataset. The results on dataset WEB when varying datasizearequalitativelysimilarandareomitted. To summarize, our experimental results show that the proposed searchframeworkcanbesuccessfullyappliedonto BRSKkNNand MRSKkNN queries, and it significantly outperforms the baseline method[11]. 6. CONCLUSION Inthispaper,westudythereversespatial-keywordnearestneigh- bor query. We analyze an ideal case, which minimizes the page accesses, for processing BRSKkNN queries. Under the derived guidelines,wedesignanefficientsearchalgorithmbasedonanov- el search strategy and a new method of estimating bounds for the spatial-keywordrelevancebetweennodes. Thesearchalgorithmis also successfully applied to process MRSKkNN queries. The ex- perimental study shows that the proposed algorithms outperform thestate-of-the-artmethodsbyordersofmagnitudefor BRSKkNN queries,andbyseveraltimesfor MRSKkNN. 0.1 1 10 100 1000 11M 33M 55M 77M 99M query time (sec) # of objects BiBase BiGuide (a) Querytime 0 5000 10000 15000 20000 25000 30000 11M 33M 55M 77M 99M # page accesses # of objects BiBase BiGuide (b) Pageaccesses Figure7: VaryingdatasetsizesofGNfor BRSKkNN 0 10 20 30 40 50 60 70 1 2 4 8 16 32 query time (sec) k BiBase BiGuide Figure 8: Varying k on GN for BRSKkNN 0.1 1 10 100 1000 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 query time (sec) alpha BiBase BiGuide Figure 9: Varying α on GN for BRSKkNN 0 5 10 15 20 25 30 35 40 1 2 4 8 16 32 query time (sec) k Without Heuristic With Heuristic Figure 10: Performance of heuristic 0 5 10 15 20 1 2 4 8 16 32 query time (sec) k BiBase BiGuide Figure 11: Memory resident for BRSKkNN 0 20 40 60 80 100 1 2 4 8 16 32 query time (sec) k BiBase BiGuide Figure 12: Vary k on WEB for BRSKkNN 0 1 2 3 4 5 6 7 8 11M 33M 55M 77M 99M query time (sec) # of objects MoBase MoGuide Figure 13: Vary datasize for MRSKkNN 7. REFERENCES [1] EfficientAlgorithmsforAnsweringReverse Spatial-KeywordNearestNeighborQueries. http://www.cs.usc.edu/research/ technical-reports-list.htm?#2015. [2] E.Achtert,H.-P.Kriegel,P.Kröger,M.Renz,andA.Züfle. Reversek-nearestneighborsearchindynamicandgeneral metricdatabases.In EDBT,pages886–897,2009. [3] M.A.Cheema,X.Lin,W.Zhang,andY.Zhang.Influence zone: Efficientlyprocessingreverseknearestneighbors queries.In ICDE,pages577–588,2011. [4] G.Cong,C.S.Jensen,andD.Wu.Efficientretrievalofthe top-kmostrelevantspatialwebobjects.In PVLDB,pages 337–348,2009. [5] C.YangandK.I.Lin.Anindexstructureforefficientreverse nearestneighborqueries.In ICDE,pages485–492,2001. [6] M.H.DeGrootandM.J.Schervish.Probabilityand statistics.In PearsonEducation (US),2004. [7] F.KornandS.Muthukrishnan.Influencedsetsbasedon reversenearestneighborqueries.In SIGMOD,pages 201–212,2000. [8] A.Guttman.R-trees: adynamicindexstructureforspatial searching.In SIGMOD,pages47–57,1984. [9] I.D.Felipe,V.Hristidis,andN.Rishe.Keywordsearchon spatialdatabases.In ICDE,pages656–665,2008. [10] I.Stanoi,M.Riedewald,D.Agrawal,andA.E.Abbadi. Discoveryofinfluencesetsinfrequentlyupdateddatabases. In VLDB,pages99–108,2001. [11] J.Lu,Y.Lu,andG.Cong.Reversespatialandtextualk nearestneighborsearch.In SIGMOD,pages349–360,2011. [12] Y.Lu,J.Lu,G.Cong,W.Wu,andC.Shahabi.Efficient algorithmsandcostmodelsforreversespatial-keyword k-nearestneighborsearch. ACMTrans.Database Syst., 39(2):13:1–13:46,May2014. [13] B.Martins,M.J.Silva,andL.Andrade.Indexingand rankingingeo-irsystems.In GIR,2005. [14] S.Robertson,S.Walker,S.Jones,M.Hancock-Beaulieu,and M.Gatford.Okapiattrec-3.In TREC,pages109–126,1994. [15] I.Stanoi,D.Agrawal,andA.E.Abbadi.Reversenearest neighborqueriesfordynamicdatabases.In ACM SIGMOD Workshop on DMKD,pages44–53,2000. [16] Y.Tao,D.Papadias,andX.Lian.Reverseknnsearchin arbitrarydimensionality.In VLDB,pages744–755,2004. [17] Y.Tao,M.L.Yiu,andN.Mamoulis.Reversenearest neighborsearchinmetricspaces. IEEE TKDE, 18:1239–1252,2006. [18] Y.TheodoridisandT.Sellis.Amodelforthepredictionof r-treeperformance.In PODS,pages161–171,1996. [19] E.C.Titchmarsh.Thetheoryoftheriemannzeta-function. In Oxford University Press,2005. [20] R.C.-W.Wong,M.T.Özsu,P.S.Yu,A.W.-C.Fu,and L.Liu.Efficientmethodformaximizingbichromaticreverse nearestneighbor.In PVLDB,volume2,August2009. [21] W.Wu,F.Yang,C.-Y.Chan,andK.-L.Tan.Finch:evaluating reversek-nearest-neighborqueriesonlocationdata.In PVLDB,pages1056–1067,2008. [22] X.Xiao,Q.Luo,Z.Li,X.Xie,andW.Ma.Alarge-scale studyonmapsearchlogs.In ACM Transactions on the Web (TWEB),volume4,July2010. 8. PROOF OF CORRECTNESS OF BRSKkNN ALGORITHM THEOREM 2. Given an integer k, a query service object q, a customer index and a service index, Algorithms 1 and 2 correctly return all BRSKkNN customer results. PROOF. We prove that (1) Algorithm 1 and Algorithm 2 do not returnfalsepositive,thatis,allreturnedobjectsarethedesiredan- swers; and that (2) the returned results are complete (no false neg- ative). Correctness: The search strategy in Algorithms 1 and 2 is to report customer entries E c by comparing the upper bound UCN E c and k (line29 inAlgorithm 2). Specifically, if UCN E c < k, then all thecustomersinthesubtreeofE c arereportedasresults. Fromthe definitionofUCN E c inDefinition13,wecanseeUCN E c iscalculat- edbymeansofUB∆((E c ,E s )−(E c ,q)). Accordingtotheproperty ofUB∆((E c ,E s )−(E c ,q)) in Lemma 4 as well as the property of UCN E c in Lemma 2, customer entry E c can be safely reported as a result entry if UCN E c < k with the condition that there are at most kserviceobjects(amongalltheserviceobjects)thataremorerele- vantthanqueryserviceobject q. Completeness:AccordingtoAlgorithms1and2,eachcustomer object in the database is either reported as a result or is pruned. Furthermore, all the customers pruned by the algorithms are not results. In lines 22–23 in Algorithm 2, a customer entry E c is pruned iff the lower bound contribution number LCN E c of E c is e- qual to or larger than parameter k. We know LCN E c is computed by LB∆((E c ,E s )−(E c ,q)) (Definitions 5 and 7). According to the propertyofLB∆((E c ,E s )−(E c ,q))inLemma3aswellastheprop- ertyofLCN E c inLemma1,ifcustomerentryE c satisfiesthecondi- tion,thenE c isnotaresult,andthuscanbesafelypruned.Thusour algorithmiscomplete,i.e.,itcanreturnalltheBRSKkNNcustomer objects. Hence,Theorem2istrue. 9. PROOFS OF PERFORMANCE ANALY- SIS We propose a cost model to analyze the performance of our search algorithm for MRkNN, BRkNN, MRSKkNN, and BRSKkNN queries,respectively. 9.1 Costestimationfor MRkNN THEOREM 3. Assume that the locations of N objects are uni- formlydistributedin2-dimensionalspace. ThetotalI/Ocostofthe searchalgorithmbasedon theR-treefor the MRkNN querywithout text component discussed isO(⌈ k f ⌉), where k is the parameter in MRkNN, and f is the fanout of the R-tree. PROOF. Followingtheassumptionin[18],weassumetheloca- tionsofN objectsareuniformlydistributedin2-dimensionalspace. Our goal is to estimate the expected number DA of the R-tree leaf nodes, which will dominate the number of internal nodes, for the searchalgorithmof MRkNNquerieswithouttextinformation. q S1 S1 S1 S1 S1 S1 S1 S1 S2 S3 S3 S3 S3 S3 S2 S3 S2 S3 S2 S3 S2 S3 S3 S2 S3 S2 S3 S2 S3 S2 S3 S3 S3 S3 S3 S2 S2 S2 S2 S3 S3 S3 S3 S3 S3 S2 S2 S2 Figure 14: Layout of entries S i and query object q. S i denote the entries that are i-th circuit around q. Note that a service entry S 1 points to another entry S 2 means S 1 “dominates” S 2 , i.e., all the objects in S 1 are closer than query object q for all theobjectin S 2 . LetPSbethenumberofprunedentriesintheR-tree. Wegivethe layout of leaf nodes in the R-tree and the query object q as shown inFigure14. ThefanoutoftheR-treeis f. Thusthenumberofleaf nodes as shown in Figure 14 is N f . Let t be the distance between the centers of two consecutive MBRs in Figure 14. Eqn(18) gives thecomputationoft [18]. t = √ f N (18) Asshownin14,wecaneasilyseethatindexnodesonthecircle i can be “dominated” by index nodes on the circle j, j < i. Thus givenk,ouralgorithmjustneedsvisitthe⌈ k f ⌉nodesclosetoquery object qsincetheycanprunealloftheirouternodes. HenceTheo- rem3holds. 9.2 Costestimationfor MRSKkNN THEOREM 4. Assume that the locations of N objects are uni- formly distributed in 2-dimensional space, and the word frequen- cies in each object follow the Zipf distribution. With high proba- bility, the total I/O cost of the Algorithm for the MRSKkNN query isO(⌈ k f ⌉), where k is the parameter in MRSKkNN, and f is the fanout of the IR-tree. PROOF. Followingtheassumptionin[18],weassumetheloca- tions of N objects are uniformly distribute in 2-dimensional space. The word frequencies in each service object follow the Zipf distri- bution. Specifically, we assume that there is a word pool with M distinct words whose frequencies are assumed to follow the Zipf distribution, i.e., the frequency of the k-th most popular word is 1/k s ∑ M i=1 (1/i s ) , where k∈[1,M], and s is a parameter characterizing the distribution. Thenwerandomlyselectmwordsfromthewordpool foreachobject. Ourgoalistoestimatetheexpectednumber DAof the IR-tree leaf nodes, which will dominate the number of internal nodes,byAlgorithms1and2for MRSKkNNqueries. We know that an IR-tree is an R-tree augmented with inverted listsforthetextualinformationoneachentryinR-tree. Henceonly considering spatial component, the number of index nodes pruned intheIR-treePSiscomparativewiththeI/OcostoftheRkNNquery withouttextcomponent,O(⌈ k f ⌉),giveninTheorem4. We next describe how to take into account the textual informa- tion. The main idea is to investigate how the textual information canaffectthenumberofprunedindexnodes. Claim 4.1: Consider both spatial and textual information, the number P of pruned entries in the IR-tree satisfies: P≥ PS− 8 MaxT(S,q)−MinT(S,S ′ ) t , where MaxT(S,q) is the maximum textual relevance between an entry S and query object q, MinT(S,S ′ ) is the minimum textual relevance between two entries S and S ′ . RecallintheCDalgorithm. WecanpruneanentrySifandonly if: LB∆((S,S ′ )−(S,q))≥k ⇒ α(1− MaxDist(S,S ′ ) maxD )+(1−α) MinT(S,S ′ ) maxR ≥α(1− MinDist(S,q) maxD )+(1−α) MaxT(E,q) maxR ⇒ MinD(S,q)≥MaxD(S,S ′ )+ 1−α α ∗ maxD maxR ∗[MaxT(S,q)−MinT(S,S ′ )] Theaboveformulashowsthatthenumberofprunednodesshould be reduced due to the textual values. In particular, consider Figure 15 again, since we can prove that the gap between two adjacent cells with respect to the values A i,j is no less than t l , and there are at most 8 entries in any cell of Figure 15 (i.e.,∀B ′ i,j ≤ 8 ), we can 1 2 3 4 1 2 3 > + + < 4 , ) 2 1 ( 2 1 2 2 s t s ) ( > + + < 4 , ) 2 1 2 ( 2 1 2 2 s t s ) ( > + + < 4 , ) 2 1 3 ( 2 1 2 2 s t s ) ( > + + + < 4 , ) 2 1 ( 2 1 2 2 s t s t ) ( > + + + < 8 , ) 2 1 2 ( s 2 1 2 2 s t t ) ( > + + + < 8 , ) 2 1 3 ( 2 1 2 2 s t s t ) ( > + + + < 4 , ) 2 1 2 ( 2 1 2 2 2 s t s t ) ( > + + + < 8 , ) 2 1 3 ( 2 1 2 2 2 s t s t ) ( > + + + < 4 , ) 2 1 3 ( 2 1 3 2 2 s t s t ) ( Figure15: Summaryfortheminimaldistancebetweenentries S i,j and query q in Figure.14 and the number of entries S i,j . Let denote the entry S i,j at row i and column j, where A ′ i,j is the minimal spatial distance MinDist(S i,j ,q) be- tween entry S i,j and query q; B ′ i,j is the number of entries S i,j s.t. MinDist(S i,j ,q)=A ′ i,j . derive the number of pruned entries has been changed as shown in Claim4.1. HenceClaim4.1holds. Claim 4.2: Consider both spatial and textual information, the number P of pruned entries in the IR-tree satisfies: P≥ PS− 8 t ∗ 1−α α ∗ maxD maxR ∗ (2ζ(s) 2mf i+1 −2)(2 s −1)+ 2 m ζ(s) 2mf i+1 √ 2 s −1 ,whereζ(s)= lim M→∞ ∑ M i=1 (1/i s ), whichisknownasRiemann’szetafunction[19],sistheparameter in Zipf distributionassumption. We first estimate the values of MinT(S,S ′ ) and MinT(S,q) in Claim4.1. Recall the assumption about textual distribution that each ob- ject contains m words randomly selected from a word pool with M distinct words following Zipf distribution: the frequency of the k-th most popular word w k is P k = 1/k s ∑ M i=1 (1/i s ) . In our algorithm for MRSKkNN queries, MinT(S,S ′ ) (resp. MinT(S,q)) is given in [11]. For simplicity, assume that all weights are binary, i.e., 0 or 1. Therefore, the key idea is to estimate the total number of intersectionwordsforeachobjectbetweentwoentriesSandS ′ (re- sp. objects in entry S and query object q). LetX n be the random variable representing the number of intersection words of all the n objects. Then the probability that there are x common words ap- pearing in the n objects is no less than that of one special case, where the word w 1 (which is the most popular word in Zipf distri- bution) is the only common words for all n objects, and the word w 1 appearsxtimesinallthenobjects,andtherestm−xwordsfor n−1objects(amongthenobjects)arealsowordw 1 ,butthem−x wordsforthelastobjectareword w 2 . Therefore, Pr(X n =x) ≥ (P x 1 ) n ∗P (n−1)(m−x) 1 ∗P m−x 2 ∗ ( n 1 ) = 2 sx ∗n (∑ M i=1 (1/i s )) nm ∗2 sm (19) ThentheexpectationE(X n )ofrandomvariableX n is: E(X n ) = m ∑ x=1 x∗Pr(X n =x) ≥ n (∑ M i=1 (1/i s )) nm ∗ m2 s −m−1 2 s −1 ≥ n ζ(s) nm ∗ m2 s −m−1 2 s −1 (20) When n = 2f, then the expectation of the intersection word- s number for 2f objects in entries S and S ′ is 2f ζ(s) 2fm ∗ m2 s −m−1 2 s −1 , andwecangettheexpectationE(MinT(S,S ′ )): E(MinT(S,S ′ )) ≤ 2f ζ(s) 2fm ∗ m2 s −m−1 2 s −1 ∗ 1 2fm = 2 s −1− 1 m ζ(s) 2fm (2 s −1) . (21) AccordingtotheMarkov’sinequality[6],wehave: Pr ( (1−MinT(S,S ′ ))≥ ζ(s) 2fm (2 s −1)−2 s +1+ 1 m ζ(s) 2fm √ 2 s −1 ) ≤ 1−E(MinT(S,S ′ ) ζ(s) 2fm (2 s −1)−2 s +1+ 1 m ζ(s) 2fm √ 2 s −1 = 1 √ 2 s −1 (22) Hence,togetherwithClaim4.1andEqn(22)thenumberPofthe prunedentriesconsideringbothspatialandtextualinformationcan bederivedasinClaim4.2. ThusClaim4.2holds. Inparticular,ifs→∞,thenprobability1− 1 √ 2 s −1 →1,andζ(s) →1[19]. Then,withahighprobability,thetotalnodeaccessesDA ≤ N f −P≤⌈ k f ⌉=O(⌈ k f ⌉),whichconcludestheproofforTheorem 4. 9.3 Costestimationfor BRkNN THEOREM 5. Assume that the locations of N s service objects andN c customerobjectsareuniformlydistributedina2-dimensional space. ThetotalI/OcostofAlgorithms1and2basedontheR-tree fortheBRkNNquerywithouttextcomponentisO( N c k N s f c + k f s ),where k is the parameter in BRkNN, and f s (resp. f c ) is the fanout of the service (resp. customer) R-tree. PROOF. Followingtheassumptionin[18],weassumetheloca- tions of N s service objects and N c customer objects are uniformly distributein2-dimensionspace. Ourgoalistoestimatetheexpect- ednodenumber DAoftheR-treebytheAlgorithm1and2. According to the proof in Theorem 4, we know only consider- ing spatial information, for MRkNN queries, our algorithm needs access⌈ k f s ⌉ service nodes which can prune all the service nodes in their outer circle as shown in Figure 14. For BRkNN queries, N c f c customer leaf nodes are also distribute in the same space as in Figure 14. In the area of the⌈ k f s ⌉ service nodes in service IR-tree nearby query service object q, there are⌈ N c k N s f c ⌉ customer nodes. Hence N c f c -⌈ N c k N s f c ⌉ customer nodes can be pruned by the⌈ k f s ⌉ ser- vicenodesnearby q. Thusthetotalnumberofcustomernodesand servicenodestobevisitedisO( N c k N s f c + k f s ),whichprovesTheorem 5istrue. 9.4 Costestimationfor BRSKkNN THEOREM 6. Assume that the locations of N s service objects andN c customerobjectsareuniformlydistributedina2-dimension space,theaveragenumberofwordspercustomeris1,andtheword frequenciesineachserviceobjectfollowtheZipfdistribution. With high probability, the total I/O cost of the Algorithm 1 and 2 based on the IR-tree for the BRSKkNN query isO(⌈ N c k+N s k N s f s ⌉), where k is the parameter in BRSKkNN, f s (resp. f c ) is the fanout of the service (resp. customer) IR-tree. PROOF. Followingtheassumptionin[18],weassumetheloca- tions of N s service objects and N c customer objects are uniformly distribute in 2-dimensional space. The average number of words per customer is 1, and the word frequencies in each service object follow the Zipf distribution. Specifically, we assume that there is awordpoolwith M distinctwordswhosefrequenciesareassumed to follow the Zipf distribution, i.e., the frequency of the k-th most popular word is 1/k s ∑ M i=1 (1/i s ) , where k∈ [1,M], and s is a parameter characterizing the distribution. Then we randomly select m words fromthewordpoolforeachserviceobject. Ourgoalistoestimate the expected node number DA of the IR-tree by Algorithms 1 and 2. By only considering spatial information, the number of pruned indexnodesintheIR-treeis N c f c −⌈ N c k N s f s ⌉asgiveninTheorem5. In the following, we show how to take into account the textual information. Claim 6.1: Consider both spatial and textual information, the number P c of pruned entries in the IR-tree satisfies: P c ≥ PS c − 8 t c ∗ 1−α α ∗ maxD maxR ∗(1− ∑ M i=1 (1/i (s−1) ) √ ∑ M i=1 (1/i s ) ), where t c = f c N c is the distance between the centers of two consecutive leaf customer MBRs, and s isthe parameterin Zipf distribution. Similarly, Claim 4.1, which is proved for MRSKkNN queries, still holds for BRSKkNN queries, i.e., the relationship between the number of pruned entries P c considering both spatial and textual informationandthenumberofprunedentriesPS c consideringonly spatial information is P c ≥ PS c −8 MaxT(E c ,q)−MinT(E c ,E s ) t c ≥ PS c − 8 1−MinT(E c ,E s ) t c . We then compute the minimum textual relevance MinT(E c ,E s ) betweenacustomerleafnode E c andaserviceleafnode E s . Recalltheassumptionabouttextualdistributionthateachservice object contains m words randomly selected from a word pool with M distinct words following Zipf distribution: the frequency of the k-th most popular word w k is P k = 1/k s ∑ M i=1 (1/i s ) , and the textual key- word is randomly selected from the word pool. Further, we know MinT(E c ,E s )= ∑ M i=1 (1/i (s−1) ) ∑ M i=1 (1/i s ) . AccordingtotheMarkov’sinequality[6],wehave: Pr ( (1−MinT(E c ,E s ))≥ ∑ M i=1 (1/i (s−1) ) √ ∑ M i=1 (1/i s ) ) ≤ 1−E(MinT(E c ,E s ) ∑ M i=1 (1/i (s−1) ) √ ∑ M i=1 (1/i s ) = 1 √ ∑ M i=1 (1/i s ) (23) ThusClaim6.1holds. Inparticular,if s→∞,thenprobability1− 1 √ ∑ M i=1 (1/i s ) →1,the value ∑ M i=1 (1/i s )→ 1, and thus the number of pruned customer nodes P c → N c f c −⌈ N c k N s f s ⌉. Then, with a high probability, the total accesstocustomernodesisnomorethan N c f c −P c ≤⌈ N c k N s f s ⌉. Adding thenumberofservicenode⌈ k f s ⌉,thefinalnumberofnodeaccesses is O(⌈ N c k+N s k N s f s ⌉) as shown in Theorem 6. Therefore, Theorem 6 holds.
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 948 (2014)
PDF
USC Computer Science Technical Reports, no. 964 (2016)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
PDF
USC Computer Science Technical Reports, no. 754 (2002)
PDF
USC Computer Science Technical Reports, no. 839 (2004)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 962 (2015)
PDF
USC Computer Science Technical Reports, no. 840 (2005)
PDF
USC Computer Science Technical Reports, no. 740 (2001)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 736 (2000)
PDF
USC Computer Science Technical Reports, no. 744 (2001)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 721 (2000)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 739 (2001)
PDF
USC Computer Science Technical Reports, no. 869 (2005)
PDF
USC Computer Science Technical Reports, no. 886 (2006)
PDF
USC Computer Science Technical Reports, no. 968 (2016)
Description
Ying Lu, Gao Cong, Jiaheng Lu, and Cyrus Shahabi. "Efficient algorithms for answering reverse spatial-keyword nearest neighbor queries." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 959 (2015).
Asset Metadata
Creator
Cong, Gao
(author),
Lu, Jiaheng
(author),
Lu, Ying
(author),
Shahabi, Cyrus
(author)
Core Title
USC Computer Science Technical Reports, no. 959 (2015)
Alternative Title
Efficient algorithms for answering reverse spatial-keyword nearest neighbor queries (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16271023
Identifier
15-959 Efficient Algorithms for Answering Reverse Spatial-Keyword Nearest Neighbor Queries (filename)
Legacy Identifier
usc-cstr-15-959
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/