Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Location-based spatial queries in mobile environments
(USC Thesis Other)
Location-based spatial queries in mobile environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOCATION BASED SPATIAL QUERIES IN MOBILE ENVIRONMENTS by Wei-Shinn Ku A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2007 Copyright 2007 Wei-Shinn Ku Dedication This dissertation is dedicated to my parents, whose love and support accompanied me the entire way. ii Acknowledgements There are numerous people either at USC or in other institutes who helped me, in many different ways, through out my Ph.D. studies. Without them, the completion of my dissertation would not have been possible. First and foremost, I would like to thank my advisor, Dr. Roger Zimmermann, who patiently led me through the endless Ph.D. study process. His advice and guid- ance have been invaluable for the duration of my Ph.D. studies and dissertation writing. With Dr. Zimmermann’s mentoring, I have learned to explore research topics, to design efficient solutions, and to write papers for reporting research re- sults, all of which are essential for a successful researcher. In short, I was very fortunate to have had Dr. Zimmermann as my advisor during my five years at USC. Also, I am very grateful to Dr. Kai Hwang, who supervised my security and privacy related research in the USC Electrical Engineering Department. Through doing research with Dr. Hwang, I learned worthwhile knowledge and research experience. As well, Dr. Hwang also provided me much help with my job search in academia. iii Dr. Cyrus Shahabi served on my defense committee and provided precious suggestions to improve my dissertation. Additional members on my qualifying exam committee were Dr. Dennis McLeod and Dr. Bhaskar Krishnamachari extremely knowledgeable people in my area of research who provided me with expert opinions. Special thanks to Dr. Jean-Pierre Bardet who was the PI of our NSF ITR project and provided me with plenty advice and support. There are many other people at USC that deserve my heartfelt thanks as well. Beomjoo Seo, Haojun Wang, Kun Fu, Min Qin, Yu Chen, Yuling Hsueh, Sakire Arslan, Chi-Ngai Wan, Fang Liu, and Amir Zand were a few of my close friends, fellow students, and colleagues during my time at USC. Many thanks to the rest of the DMRL (Data Management Research Laboratory) group at USC. Finally, I want to thank the most important people in my life, my father Juin- Wei Ku and my mother Ping-Hsien Wei. Without their unconditional love, support, and encouragement, I would never have made it this far. Acknowledges to NSF ITR grant CMS-0219463, equipment gifts from Intel and Hewlett-Packard, and the Integrated Media Systems Center, a National Sci- ence Foundation Engineering Research Center, Cooperative Agreement No. EEC- 9529152. iv Contents Dedication ii Acknowledgements iii List Of Tables viii List Of Figures ix Abstract xii 1 Introduction 1 1.1 Overview. ................................ 1 1.2 Characteristics of Location-based Services and Mobile Environments 4 1.3 Contributions .............................. 6 1.4 Organization . ............................. 8 2 Related Work 9 2.1 WirelessDataBroadcast ........................ 9 2.2 CommonSpatialQueryTypes . .................... 13 2.2.1 NearestNeighborQueries ................... 13 2.2.2 WindowQueries ........................ 15 2.2.3 SkylineQueries ......................... 16 2.3 CooperativeCaching .......................... 17 2.3.1 Page-basedCaching....................... 17 2.3.2 SemanticCaching........................ 18 2.4 Peer-to-PeerDataSharing ....................... 18 2.5 LocationPrivacyPreservation ..................... 19 3 Sharing-based Spatial Query Processing for Retrieving Static In- formation 21 3.1 Overview. ................................ 21 3.2 SystemArchitecture . ......................... 23 3.3 SharingBasedNearestNeighborQueries ............... 24 v 3.3.1 Nearest Neighbor Verification (NNV) . . . . . . . . . . . . . 26 3.3.2 ApproximateNearestNeighbor ................ 29 3.3.3 Broadcast Channel Data Filtering . . . . . . . . . . . . . . . 32 3.4 SharingBasedWindowQueries .................... 34 3.4.1 WindowQueryVerification . ................. 35 3.4.2 Broadcast Channel Data Filtering . . . . . . . . . . . . . . . 36 3.4.3 The Relationship Between the Verified Region Size and Query WindowSize . ......................... 36 3.5 Evaluation. ............................... 39 3.5.1 SimulatorImplementation ................... 40 3.5.2 SimulationParameterSets ................... 43 3.5.3 Experimental Results with the k Nearest Neighbor Query . . 44 3.5.4 Experimental Results with the Window Query . . . . . . . . 48 3.5.5 Experimental Results of the Broadcast Packet Access Rate . 50 3.6 Summary ................................ 54 4 Sharing-based Spatial Query Processing for Retrieving Dynamic Information 55 4.1 Overview. ................................ 55 4.2 SystemInfrastructure. ......................... 56 4.3 TravelTimeNetworks . ........................ 58 4.4 Travel Time Network Nearest Neighbor Queries.................................. 60 4.4.1 Local-based Greedy Nearest Neighbor Queries . . . . . . . . 63 4.4.2 Traffic Event Collection and Distribution of the Traffic Infor- mationServer .......................... 67 4.4.3 Global-based Adaptive Nearest Neighbor Queries . . . . . . 68 4.5 Evaluation. ............................... 70 4.5.1 SimulatorImplementation ................... 72 4.5.2 ImplementationofTravelTimeNetwork ........... 76 4.5.3 Experiments........................... 78 4.6 Summary ................................ 81 5 Privacy Protection for Query Processing in Mobile Environments 83 5.1 Overview. ................................ 83 5.2 SystemArchitecture . ......................... 85 5.2.1 TheLocationCloaker...................... 86 5.2.2 Location-based Service Providers . . . . . . . . . . . . . . . 87 5.3 PrivacyProtectedQueryProcessing.................. 90 5.3.1 Privacy Protected Nearest Neighbor Query on Spatial Networks 90 5.3.2 Privacy Protected Range Query on Spatial Networks . . . . 95 5.4 Evaluation. ............................... 98 5.4.1 SimulatorImplementation ................... 98 vi 5.4.2 Experiments with Performance Influential Factors . . . . . . 99 5.4.3 Experiments with Real World Parameter Sets . . . . . . . . 102 5.5 Summary ................................ 104 6 A Distributed Geotechnical Data Management Architecture 106 6.1 GIMEOverview. ............................ 110 6.1.1 Geotechnical Web Services Functionalities . . . . . . . . . . 113 6.1.2 Efficient Query Routing with Spatial Indexing . . . . . . . . 114 6.2 GIMEClientApplicationExample .................. 119 6.3 ExperimentalValidation . ....................... 120 6.4 Summary ................................ 124 7 Conclusions 125 7.1 Summary ................................ 125 7.2 FutureWork. .............................. 127 7.2.1 EnergyConsumption ...................... 127 7.2.2 CommunicationProtocols . .................. 127 7.2.3 Approximate Spatial Queries in Mobile Environments . . . . 128 Reference List 129 vii List Of Tables 3.1 SymbolicnotationsforChapter3. . .................. 25 3.2 The data structure of the heap H. . .................. 28 3.3 Parameters for the simulation environment. . . . . . . . . . . . . . 40 3.4 The simulation parameter sets for Los Angeles County, Riverside County, and the synthetic suburbia. . . . . . . . . . . . . . . . . . . 43 4.1 SymbolsoftheGANNalgorithm. . .................. 70 4.2 Parameters for the simulation environment. . . . . . . . . . . . . . 73 4.3 The simulation parameter sets for Los Angeles County, Riverside County, and the synthetic suburbia. . . . . . . . . . . . . . . . . . . 76 5.1 SymbolicnotationsforChapter5. . .................. 86 5.2 Thesimulationparametersets. .................... 103 6.1 The relationship between data updates, MBR updates, and index updates as a function of the number of servers and updates. . . . . 118 viii List Of Figures 2.1 The data and index organization of the (1, m) indexing scheme with sample tuning time and access latency. . . . . . . . . . . . . . . . . 10 2.2 The Hilbert-curve based index structure. The numbers represent indexvalues................................ 12 3.1 Systemenvironment. . ......................... 24 3.2 Anonair kNNqueryexample. . ................... 25 3.3 Because e 1 has the shortest distance to q and||q,o 1 ||≤||q,e 1 ||,POI o 1 is verified as a valid NN of mobile host q. . ............ 26 3.4 Because of some unverified regions, o 4 cannot be verified as a top k NN of q. . ................................ 28 3.5 The correctness probability of the unverified POI o 4 can be estimated basedonthesizeofitsunverifiedregion. . .............. 32 3.6 A window query on the Hilbert-curve index structure. . . . . . . . . 34 3.7 POI o 1 and o 4 are the query results of this sharing based window query WQ1. ............................... 36 3.8 The access latency of the window query (WQ) can be largely de- creasedbythecacheddata. . ..................... 38 3.9 The effect of various verified region size and query window size. . . 39 3.10 The percentage of resolved queries as a function of the wireless trans- missionrange. .............................. 46 3.11 The percentage of resolved queries as a function of the mobile host cachecapacity. ............................. 47 3.12 The percentage of resolved queries as a function of k. . ....... 48 3.13 The percentage of resolved queries as a function of the wireless trans- missionrange. .............................. 50 3.14 The percentage of resolved queries as a function of the mobile host cachecapacity. ............................. 51 3.15 The percentage of resolved queries as a function of query window size. 52 3.16 The packet access comparison between EOASQ and OASQ. We nor- malized the required packet number of EOASQ to OASQ. . . . . . 53 4.1 Thesysteminfrastructure. ....................... 58 4.2 A spatial network (left) and its travel time network (right). . . . . . 60 ix 4.3 Nearest neighbor search in a spatial network environment with the IERalgorithm. . ............................ 62 4.4 Nearest neighbor search in a spatial network environment with the INEalgorithm. . ............................ 63 4.5 An example of traffic time networks in LANN. . . . . . . . . . . . . 65 4.6 Thesimulatoranditsvisualizationinterface. ............. 77 4.7 The percentage of driving time that are saved by the LANN algo- rithm as a function of the mobile host transmission range. . . . . . 80 4.8 The percentage of driving time that are saved by the LANN and the GANN algorithm as a function of congestions on the pre-computed shortestpath. . ............................. 81 4.9 The percentage of driving time that are saved by the LANN and the GANN algorithm as a function of detour on the pre-computed shortestpath. . ............................. 82 5.1 Thesystemarchitecture. ........................ 86 5.2 Twonovelquerytypes. ......................... 89 5.3 The two possible connection statuses of the network segments inside a cloaked area A c .. ........................... 92 5.4 Searching k network nearest neighbors with PSNN where the cloaked area contains k objects (k=1inthisexample). . .......... 93 5.5 Searching k network nearest neighbors with PSNN where the cloaked area contains fewer than k objects (k = 2 in this example). . . . . . 95 5.6 NetworkrangequerywithPSRQ. . .................. 97 5.7 The effect of the cloaked region size. . . . . . . . . . . . . . . . . . 100 5.8 The effect of k. . ............................ 101 5.9 TheeffectofthePOInumber...................... 102 5.10 The effect of kwithreal-worldparametersets. ............ 104 6.1 The photographs illustrate the geotechnical boring activities from drilling until the soil samples are examined. The result is a boring log showing stratigraphy, physical sampling record and SPT blow counts. . ................................. 108 6.2 The GIME architecture is composed of multiple, distributed data archives. Some archives are read-only while others allow read and write access. Each archive contains a middleware utilizing replicated spatialindexstructures. . ....................... 112 6.3 A sample Borehole Query & Drafting client application. After a query has been issued a small pop-up display (the left window) illus- trates the metadata of a borehole when the user clicks on one of the dots (indicating a borehole location) on the map. The fence diagram rendering (the right window) is generated from three selected XML datafiles. . ............................... 123 x 6.4 Performance of two query routing techniques. Fig. 6.4(a) compares the query message traffic between the EQR mechanism and the tree- based designs over a ten hour period. The update message traffic is negligible (synthetic and real-world data sets). Fig. 6.4(b) illustrates that the performance of the tree-based designs remains stable when the number of servers increases from ten to one thousand. . . . . . 123 xi Abstract Location-dependent queries, such as determining the proximity of points of in- terest (e.g., hotels, gas stations) to a mobile host, are an important class of inquiries. I propose novel approaches to support spatial queries from mobile hosts with high scalability, short response time, and strong user privacy protection. There are four main sub-topics in this dissertation. The first topic is related with static point of interest (e.g., restaurants, gas stations) information retrieval. I illustrate how previous query results cached in the local storage of neighboring mobile peers can be leveraged to either compute spatial queries at a local host or improve the query result accuracy by leveraging the sharing capabilities of wireless ad-hoc networks (e.g., IEEE 802.11b/g). Since users have mobility, the second topic is about uti- lizing dynamic information (e.g., real-time traffic information) to improve query accuracy. Security and privacy are always important issues for all systems. I de- sign solutions for spatial query privacy protection in mobile environments which is the third topic. The last topic concerns a distributed geotechnical information management architecture as an application of several proposed algorithms. Each topic is described in more detail as follows. xii Static Information In order to improve query efficiency, each mobile host caches data packets downloaded from the data source (e.g., a data broadcast channel or a DB server) in its local memory. Since all the data objects located inside mobile host cache are all from database servers, I define the area which is covered by the cached data as a verified region. For any spatial query, a mobile host can execute both mechanisms – accessing the data source and requesting verified region information from peers. Afterward the mobile host reorganizes the cached data returned from peers and attempts to fulfill his/her spatial query. This process is termed the verification procedure. Because the access latency is relatively long in broadcast systems, the verification procedure can usually be accomplished before the required data packets arrive to improve query efficiency. Dynamic Information Most nearest neighbor algorithms rely on static dis- tance information to compute queries (e.g., Euclidean distance or spatial network distance). However, the final goal of a user when performing an NN search is often to travel to one of the points of the search result. In this case, finding the nearest neighbors in terms of travel time is more important than the actual distance. In the existing NN algorithms, dynamic real-time events (e.g., traffic congestions, detours, etc.) are usually not considered and hence the pre-computed nearest neighbor ob- jects may not accurately reflect the shortest travel time. I propose a novel travel time network that integrates both spatial networks and real-time traffic event infor- mation. Based on this foundation of the travel time network, I develop a local based xiii greedy nearest neighbor algorithm and a global-based adaptive nearest neighbor al- gorithm that both utilize P2P sharing of real-time information to provide adaptive query search results. Privacy Protection With the proliferation of mobile devices, location-based services have become more and more popular in recent years. However, users have to reveal their location information to access location-based services with existing service infrastructures. It is possible that adversaries could collect the location information, which in turn invades users’ privacy. There are existing solutions for query processing on spatial networks and mobile user privacy protection in Euclid- ean space. However there is no solution for solving queries on spatial networks with privacy protection. Therefore, I provide network distance spatial query solutions which can preserve user privacy by utilizing K-anonymity mechanisms. Geotechnical Information Management Architecture Web services can help facilitate the exchange and utilization of geotechnical information. Such data is of critical interest to a growing number of municipal, state, and federal agencies as well as private enterprises. However, the lack of service infrastructures among heterogeneous data sources operating under different administrative organizations or agencies hampers the full use of geotechnical information. I describe a Web- services-based architecture to manage geotechnical data via XML. xiv Chapter 1 Introduction 1.1 Overview Spatial Query (SQ) processing which includes several different query types (e.g., nearest neighbor queries, window queries, spatial join queries) is one of the princi- pal problems in database research and the main topic of this dissertation. Spatial queries are extensively utilized in spatial databases, multimedia databases, geo- graphical information systems, and many related applications. With the advance of wireless communication technologies, location-based spa- tial queries have become popular where a user with a location-aware mobile device would launch spatial queries with respect to his/her current position (e.g., the user wants to search the closest gas station as he/she moves along). The traditional solution is to forward the query to a centralized database server, where the query is processed and the result is returned to the user via wireless networks. However due to the mobility of the user, the result may be invalidated before the result is 1 returned and to avoid this the user has to pose new queries with high frequency. Consequently, the situation could lead to high network overhead and extra process- ing effort of the centralized database server. As we can observe, the conventional approach falls short of accuracy and scalability in mobile environments. We noticed that the wireless data broadcast model could be a promising solution, however the main limitation lies in its high access latency. In Chapter 3, I present a novel query processing technique that, while maintaining high scalability and accuracy, man- ages to reduce the latency considerably in answering location-based spatial queries. Our approach is based on peer-to-peer sharing, which enables us to process queries without delay at a mobile host by using query results cached in its neighboring mobile peers. Furthermore, nearest neighbor queries are of significant interest for applications that work with spatial data. A sample query could be to “find the nearest gas station from my current location.” Previous work [RKV95, HS99] has resulted in efficient techniques to compute NN queries in Euclidean space. More recently, novel algorithms [SKS02, PZMT03] have been proposed to compute nearest neighbor queries in spatial networks. These methods extend nearest neighbor queries by considering the spatial network distance, which provides a more realistic measure for applications where objects are constrained in their movements. However, these existing techniques only consider static models of spatial networks: pre-defined road segments with fixed road conditions are used in computing nearest neighbors. 2 Thus, any real-time events (e.g., detours, traffic congestions, etc.) affecting the spatial network cannot be reflected in the query result. For example, a traffic jam occurring on the route to the computed nearest neighbor most likely elongates the total driving time. More drastically, the closure of a restaurant which was found as the nearest neighbor might even invalidate a query result. This motivates the need for new algorithms which extend existing nearest neighbor query techniques by integrating real time event information and the details are illustrated in Chapter 4. The convenience of location-based services have made them more and more pop- ular in recent years. However, mobile users have to reveal their location information to access location-based services with existing service infrastructures. It is possible that adversaries could collect the location information, which in turn invades user’s privacy. There are existing solutions for query processing on spatial networks and mobile user privacy protection in Euclidean space. However there is no solution for solving queries on spatial networks with privacy protection. Therefore, we aim to provide network distance spatial query solutions which can preserve user pri- vacy by utilizing K-anonymity mechanisms [Swe02]. In this dissertation, I present two novel query algorithms, PSNN (Section 5.3.1) and PSRQ (Section 5.3.2), for answering nearest neighbor queries and range queries on spatial networks without revealing private information of the query initiator. 3 In addition to novel spatial query algorithms, I also present the practical use of Web services applied to a distributed architecture aimed at facilitating the ex- change and utilization of geotechnical information and spatial data. Such data is of critical interest to a large number of municipal, state, and federal agencies as well as private enterprises involved with civil infrastructures. The utilization of geotechnical information is currently hampered by a lack of service infrastructure among the heterogeneous data sources operated under different administrative con- trol. We describe a Web services based infrastructure to manage geotechnical data via XML as the common data format in Chapter 6. 1.2 Characteristics of Location-based Services and Mobile Environments Location-Based Services can be generally defined as services that integrate the lo- cation of mobile devices with other information so as to provide added value to mobile device users [Spi04]. For example, a motorist can find his/her closest gas station through LBS when he/she drives on a highway. LBS have been well de- veloped during the past decade and many of them are very popular in our daily lives (e.g., navigation services, friend finding services [EA04], etc.). However these prevalent location-based services could be a potential threat to user privacy. Con- sequently both location-based service providers and mobile users should be careful 4 and sensitive regarding the way location related information is handled. In addi- tion, governments (both the U.S. and EU) have also legislated regulations on the usage of personal location information [Spi04]. In a mobile environment, a mobile user with a mobile device (e.g., PDAs, cell phones, etc.) can access various information via wireless communication. There are several unique features of data management in mobile environments and these features are summarized as follows [Bar99]. • Limited computational capacity: Compared with regular computers, mobile devices hold limited CPU power and memory space. In addition, the battery- based power supply also limits their computational capacity. These con- straints bring about challenges in designing efficient algorithms at the mobile user side. • Unrestricted user mobility: Mobile users equipped with wireless connection to information networks can move at his/her own will when communicating with servers and peers. From data management perspective, user mobility leads to volatile data and query results and it also influences the efficiency of algorithms for mobile computing. • Communication models: For mobile devices, the energy required for message sending is more than that required for message receiving [JK96]. Furthermore, the energy consumed by a mobile computer in its active mode is more than 5 that consumed in its sleep mode [JK96]. In view of this, a mobile device should remain in sleep mode as long as possible for saving power. • Constrained energy capacity: Mobile apparatuses use batteries for their op- erations without connecting to any permanent power supply. Sheng et al. estimated that only a modest improvement (around 20%) in battery capacity could happen over the next few years [SCB]. Consequently, an important algorithm design issue of mobile device is to conserve the energy. This research addresses the first three characteristics of mobile computing. 1.3 Contributions The main contribution of this dissertation is a group of algorithms for process- ing spatial queries in mobile environments. The innovations of this research are the sharing-based solutions for decreasing access latency of broadcast systems, dy- namic information sharing mechanisms for improving query accuracy, privacy pro- tection techniques for query processing in mobile environments, and a distributed geotechnical data management architecture. The list of itemized contributions are as follows. • Identification of certain characteristics of location-based spatial queries that enable the development of effective sharing methods in broadcast environ- ments. 6 • Introduction of a set of algorithms that verify whether data received from neighboring clients are complete, partial, or irrelevant answers to the posed query [KZW07]. • Utilization of a peer-to-peer (P2P) based sharing method to improve the current approaches in answering on-air k nearest neighbor queries and window queries. • Introduction of the concept of travel time networks (TTN). A travel time network integrates the real time traffic information with a static data of a spatial network. The length of each edge in the travel time network represents the driving time of each road segment [KZWW05]. • Proposition of two adaptive nearest neighbor query algorithms (local and global) which incrementally select the road segments to the nearest neighbor based on the most current instantiation of the travel time network [KZWW05]. • Design of two novel query algorithms for answering nearest neighbor queries and range queries on spatial networks without revealing private information of the query initiator. • Proposition of a Web services based infrastructure to manage geotechnical data via XML as the common data format [ZKW + 06]. • Implementation and evaluation of the performance of the proposed algorithms and frameworks. 7 1.4 Organization The remainder of this dissertation is organized as follows. In Chapter 2, I review the related work for this research. Chapter 3 illustrates the algorithms for retrieving static information in mobile environments. The real-time information retrieval and utilization algorithms are discussed in Chapter 4. Chapter 5 describes the privacy protection techniques for processing spatial queries in mobile environments. Chapter 6 demonstrates the architecture for managing geotechnical information. Finally, Chapter 7 concludes this dissertation and presents directions on future research for this work. 8 Chapter 2 Related Work I summarize the related work on wireless data broadcast, common spatial query types, cooperative caching, peer-to-peer data sharing, and privacy issues in mobile environments. 2.1 Wireless Data Broadcast Generally speaking, there are two approaches for mobile data access. One is the on-demand access model and the other is the wireless broadcast model.Fortheon- demand access model, point-to-point connections are established between the server and the mobile clients, and the server processes queries which the clients submit on demand. For the wireless broadcast model, the server repeatedly broadcasts all the information in wireless channels and the clients are responsible for filtering the information. The advantage of the broadcast model over the on-demand model is that it is a scalable approach. However, the broadcast model has large latency, as 9 clients have to wait for the information it needs in a broadcasting cycle. If a client misses the packets which it needs, it has to wait for the next broadcast cycle. Previous Cycle Index Segment Next Cycle ..… Broadcast Cycle 12 Data 1 3m Data 2 Data m Tuning Time Access Latency Tuning Time Figure 2.1: The data and index organization of the (1, m) indexing scheme with sample tuning time and access latency. To facilitate information retrieval on wireless broadcast channels, the server usu- ally broadcasts an index structure along with data objects. A well known broadcast index structure is the (1, m) indexing allocation method [IVB97]. As we can see from Figure 2.1, the whole index is broadcast preceding every 1/m fraction of the data file. Because the index is available m times in one cycle, it gives a mobile client easy access to the index, so that it can predict the arrival time of its desired data in a timely manner, and once it knows the arrival time, it only needs to tune into the broadcast channel when the data bucket arrives. Thus, the general access protocol for retrieving data on a wireless broadcast channel involves three main steps [IVB97]: • The initial probe A client tunes into the broadcast channel and determines when the next index segment will be broadcast. 10 • Index search The client accesses a sequence of pointers in the index segment to figure out when to tune into the broadcast channel to get the required data. • Data retrieval The client tunes to the channel when packets containing the required data arrive and then downloads all the required data. Two parameters, access latency and tuning time, characterize the broadcast model. The access latency is the time duration from the point that a client requests its data to the point that the desired data is received. The tuning time is the amount of time spent by a client listening to the broadcast channel, which proportionally represents the power consumption of the client [IVB97]. However, nearly all the existing spatial access methods are for databases with random access disks. These existing techniques cannot be used in a wireless broadcast environment, where only sequential data access is supported. Zheng et al. [ZLL04, ZLL03] proposed to index the spatial data on the server by a space- filling curve. The Hilbert curve [Jag97] is chosen for this purpose because of its superior locality. The index values of the data packets represent the order in which these data packets are broadcast. For example, the Hilbert curve in Figure 2.2 tries to group data of close values so that they can be accessed within a short interval when they are broadcast sequentially. The mobile hosts use on-air search algorithms [ZLL04, 11 ZLL03] to answer location-based spatial queries (k nearest neighbor and window queries) over data that arrives in the order prescribed by the Hilbert curve. 2 1 3 01 014 2 2 3 7 4 56 910 8 11 13 12 15 1 3 0 Figure 2.2: The Hilbert-curve based index structure. The numbers represent index values. Current research on data broadcast can be divided into three categories [LC00]: • Determining the data for broadcasting The volume of broadcast data influences the access latency. A small set of data items is preferred to be broadcast. Usually, only most frequently accessed data items will be broad- cast. How to dynamically broadcast the most frequently accessed data is the main problem. A common technique is to drop a data item from the broadcast channel and re-estimate its access frequency from the on-demand data requests to make sure that the dropped data item is actually no more frequently accessed. • Scheduling the data for broadcasting The data items to be broadcast may have different access frequencies. Given the data access frequencies, 12 the problem is how to arrange the broadcast data to minimize the average response time for data requests. • Indexing the broadcast data Building indexes on the broadcast data helps users decide whether their desired data are in the broadcast channel and when they become available. During the time waiting for the data, mobile devices can be turned into a power saving mode. The problem is how to mix the index with the data items such that the response time and the number of index accesses on the broadcast channel can be minimized. 2.2 Common Spatial Query Types There are several common spatial query types. In this section, we only cover the ones which are related with this research. 2.2.1 Nearest Neighbor Queries During the last two decades, numerous algorithms for k nearest neighbor queries have been proposed. In this section I roughly divide these solutions into three groups, regular k nearest neighbor queries, continuous k nearest neighbor queries, and spatial network nearest neighbor queries. Regular Nearest Neighbor Queries A k nearest neighbor (kNN) query re- trieves the k (k≥ 1) data objects closest to a query point q. The R-tree [Gut84] and 13 its derivatives [SRF87, BKSS90] have been a prevalent method to index spatial data and increase query performance. To find nearest neighbors, branch-and-bound al- gorithms have been designed that search an R-tree in a depth-first manner [RKV95] or a best-first manner [HS99]. The NN search algorithm proposed in [HS99] is op- timal, it only visits the node necessary for obtaining the nearest neighbors, and incremental, i.e., it reports neighbors in ascending order of their distance to the query point. Both algorithms can be easily extended for the retrieval of k nearest neighbors. Continuous Nearest Neighbor Queries The NN algorithms discussed in the previous paragraph are mainly designed for searching stationary objects. With the emergence of mobile devices, attention has focused on the problem of continuously finding k nearest neighbors for moving query points. A naive approach might be to continuously issue kNN queries along the route of a moving object. This solution results in repeated server accesses and nearest neighbor computations and is therefore inefficient. Sistla et al. in [SWCD97] first proposed the importance of the continuous nearest neighbor queries, the modelling methods, and related query languages, however they did not discuss the processing methods. Song et al. in [SR01] proposed the first algorithm for continuous NN queries. Their approaches are based on performing several point NN queries at predefined sample points. Saltenis et al. in [SJLL00] propose a time parameterized R-tree, an index struc- ture for moving objects, to address continuous kNN queries for moving objects. 14 Tao et al. in [TPS02] present a solution for continuous NN queries via performing one single query for the entire route based on the time parameterized R-tree. The main shortcoming of this solution is that it is designed for Euclidean spaces and users have to submit predefined trajectories to the database server. Spatial Network Nearest Neighbor Queries Initially, nearest neighbor searches were based on Euclidean distance between the query object and the sites of inter- est. However, in many applications objects cannot move freely in space but are constrained by a network (e.g., cars on roads, trains on tracks). Therefore, in a re- alistic environment the nearest neighbor computation must be based on the spatial network distance, which is more expensive to compute. A number of techniques have been proposed to manage the complexity of this problem [PZMT03, KS04] 2.2.2 Window Queries For window queries that find objects within a given window, the R-tree fam- ily [SRF87, BKSS90] and the Quadtree [AA01] provide efficient access to solve window queries in disk-based databases. Basically, a tree index structure groups objects close to each other into a minimum bounding rectangle (MBR), and a win- dow query only visits the MBRs that overlap with the query window. 15 2.2.3 Skyline Queries The skyline of a d-dimensional dataset contains the points which are not dominated by any other data object on all dimensions. Given a set of objects p 1 , p 2 ,..., p N ,the operator returns all objects p i such that p i is not dominated by another object p j . Skyline computation has recently receives considerable attention in the database community, especially for progressive methods that can quickly return the initial results without reading the entire database [PTFS05]. The first work addressing skyline queries in the context of databases was by Borzsonyi et al. [BKS01]. Kossmann et al. [KRR02] presented an algorithm called NN due to its reliance on nearest neighbor search, which applies the divide-and conquer framework on data sets indexed by R-trees. The NN algorithm outperforms previous skyline algorithms in terms of overall performance. However, NN also has some shortcomings such as need for duplicate elimination, multiple node visits, and large space requirements. Consequently, Dimitris et al. proposed an improved progressive algorithm called branch and bound skyline (BBS) [PTFS03], which (like NN) is based on nearest neighbor search on multidimensional access methods, but (unlike NN) is optimal in terms of node accesses. The experimental evaluation of BBS in [PTFS05] showed that BBS outperforms NN for all problem instances, which incurring less space overhead. 16 2.3 Cooperative Caching Caching is a key technique to improve data retrieval performance in widely distrib- uted environments. With the increasing deployment of new peer-to-peer wireless communication technologies (e.g., IEEE 802.11b/g and Bluetooth), peer-to-peer cooperative caching becomes an effective sharing alternative [CLC04]. With this technique, mobile hosts communicate with neighboring peers in an ad-hoc manner for information sharing, instead of relying solely on the communication between remote information sources. Peer-to-peer cooperative caching can bring about sev- eral distinctive benefits to a mobile system: to improve access latency, to reduce server workload, and to alleviate point-to-point channel congestion. Two common caching models, page caching and semantic caching, have been studied in mobile environments and they are summarized as follows. 2.3.1 Page-based Caching The page-based caching model is widely used in RDBMS. The cached items are typically disk pages or tuples, which can be looked up by their identifiers. The mo- bile caching model in on-demand environments was first studied in [HSW94], where a dynamic client data replication mechanism was proposed. Chan et al. [CSL01] discovered that paged-based caching is not suitable in mobile environments due to the lack of locality among data objects. They proposed three different levels of caching granularity – attribute caching, object caching, and hybrid caching, for 17 utilizing spatial and temporal localities of client queries. Since no query related in- formation is stored compared with semantic caching, page caching can only support new queries with object identifiers. 2.3.2 Semantic Caching The idea of semantic caching is that the client maintains in the cache both semantic description and results of previous queries. If a new query is totally answerable from the cache, no communication with the server is necessary. If the query can only be partially answered, the original query is trimmed and the trimmed part is sent to the server and processed there. By processing queries in this way, the amount of data transferred over the network can be substantially reduced [RDK03, LLZX06]. 2.4 Peer-to-Peer Data Sharing Recently, considerable attention has focused on an emerging class of large scale dis- tributed data management systems classified as peer-to-peer (P2P) systems. Some of the key characteristics of P2P systems are their very dynamic topology, their heterogeneity, and their self-organization. The query processing and routing ap- proaches of some of the initial P2P systems focused on either a centralized index server (e.g., Napster 1 ) or a flooding mechanism (e.g., Gnutella 2 ). These techniques 1 http://www.napster.com/ 2 http://www.gnutella.com/ 18 are considered either not very scalable or inefficient. More recently, distributed hash tables (DHT) have been proposed to achieve massive scalability and efficient query forwarding. The most prominent representatives are Chord [SMK + 01, SMLN + 03] and CAN [RFH + 01]. DHTs provide a mechanism to perform object location within a potentially very large overlay network of nodes connected to the Internet. The distributed hash mechanisms work by transforming a key value into a number that is then mapped to a node whose identifier is numerically closest to the key. Key location is very efficient and the expected number of routing steps is O(logN), where N is the number of nodes in the system. An important property of a good hash function (e.g., SHA-1) is a close to uniformly random distribution of keys to numeric values. Hence, key values that are close to each other (or otherwise related) can be arbitrarily far apart in the generated index space. This property makes standard DHTs unsuitable for range queries which are very common with spatial data. Some research for preserving spatial locality information while also keeping some of the load balancing properties of DHT based systems are discussed in [WZK05, ZKW04]. 2.5 Location Privacy Preservation With the popularity of location-based services, privacy protection for mobile users has become an important issue [GL04, SHG03]. Gruteser et al. [GG03] proposed a middleware architecture and algorithms for maintaining location K-anonymity. 19 Their algorithms adjust the resolution of location information along spatial or tem- poral dimensions to fulfill the required anonymity constraints. Based on the work in [GG03], a unified privacy personalization framework is proposed in [GL05] to support different levels of anonymity according to the requests of users. However, these previous research approaches mainly focused on the system architecture or the location anonymizing mechanism design rather than query processing. Mokbel proposed to employ a trusted third party, the location anonymizer, which expands the user location into a spatial region for protecting user privacy [Mok06]. Mokbel et al. also proposed related privacy-aware query processing algorithms [MCA06]. However their query processing solutions are based on Euclidean metrics. In real life, mobile users cannot move freely in space but are usually constrained by un- derlying networks (e.g., cars on roads, trains on tracks, etc.). Therefore, we need solutions for processing privacy protected queries on spatial networks. 20 Chapter 3 Sharing-based Spatial Query Processing for Retrieving Static Information In this section, I describe our approach for supporting LBSQs of static data objects in a wireless broadcast environment. The fundamental idea behind my methodology is to leverage the cached results from prior spatial queries at reachable mobile hosts for answering future queries at the local host. 3.1 Overview The wireless data broadcast model has good scalability for supporting an almost unlimited number of clients [IVB97]. Its main limitation lies in its sequential data access; the access latency becomes longer as the number of data items increases. If we can give approximate, but highly accurate answers to spatial queries before the 21 arrival of the related data packets, we will overcome the limitation of the broadcast model. A novel component in our methodology is a verification algorithm that verifies whether a data item from neighboring peers is part of the solution set to a spatial query. Even if the verified results constitute only part of the solution set, in which case the query client needs to wait for the required data packets to get the remaining answers, the partial answer can be utilized by many applications that do not need exact solutions but require a short response time (for example, the query “What are the top three nearest hospitals?” issued by a motorist on a highway). In this chapter I detail how k nearest neighbor queries and window queries can be processed by cooperating mobile hosts to improve the performance of on air spatial queries. I apply the spatial query algorithms proposed in [ZLL04] to illustrate our techniques. However, our sharing based solution can be a common method for any broadcast system. In Section 3.2, I introduce the infrastructure that I assume for this work. The method for verifying query results from neighboring peers for processing kNN queries is discussed in Section 3.3. This section also explains how to use partial peer query results to provide approximate NN results and decrease the required data packets from the broadcast channel. Section 3.4 extends our algorithms for solving window queries. 22 3.2 System Architecture Figure 3.1 depicts our operating environment with two main entities: a remote wireless information server and mobile hosts. We are considering mobile clients, such as vehicles, that are instrumented with a global positioning system (GPS) for continuous position information. Furthermore, we assume that the wireless information server broadcasts information in a wireless channel periodically and the channel is open to the public. In addition, there are short-range networks that allow ad-hoc connections with neighboring mobile clients. Technologies that enable ad-hoc wide band communication include, for example, IEEE 802.11b/g. Benefiting from the power capacities of vehicles, we assume that each mobile host has a significant transmission range and virtually unlimited power lifetime. The architecture also supports hand-held mobile devices. In Figure 3.1, when a mobile host p issues a spatial query, it tunes into the broadcast channel and waits for the data. In the meantime, p can collect cached spatial data from peers to harvest existing results in order to complete its own spatial query. Because memory space is scarce in mobile devices, we assume that each mobile host p caches a set of POIs in an MBR related to its current location. Since the POIs located inside the MBR are from the wireless information server, we define the area bounded by the MBR as a verified region, p.V R,withregardto p’s location. 23 Data Station Mobile Host Wireless Broadcast Channel Spatial Database Mobile Host Transmission Range Mobile Host Peer-to-Peer Channel Figure 3.1: System environment. 3.3 Sharing Based Nearest Neighbor Queries Figure 3.2 shows an example of an on air kNN query based on a Hilbert curve index structure [ZLL04]. At first, by scanning the on air index, the k th nearest object to the query point is found and a minimal circle centered at q and containing all those k objects is constructed. The MBR of that circle, enclosing at least k objects, serves as the search range. Consequently, q has to receive the data packets that covers the MBR from the broadcast channel for retrieving its k nearest objects. As shown in Figure 3.2, the related packets span a long segment in the index sequence – between 5 and 58, which will require a long retrieval time. The other problem of this search algorithm is that the indexing information has to be replicated in the broadcast program to enable twice-scanning. The first scan is for deciding the kNN search range and the second scan is for retrieving k objects based on the search range [ZLL04]. 24 4 3 0 12 76 57 56 61 62 558 59 60 63 14 13 8 q 54 55 50 49 15 12 11 10 53 52 51 48 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 16 18 29 28 35 34 45 44 17 30 31 32 33 46 47 Figure 3.2: An on air kNN query example. Therefore, we propose a sharing based nearest neighbor (SBNN) query approach to improve the current on air kNN query algorithm. The SBNN algorithm attempts to verify the validity of k objects by processing results obtained from several peers. Table 3.1 summarizes the symbolic notations used throughout this chapter. Symbol Meaning P The set of all the peers that respond the query issued by q p.O The cached POI set of a mobile host p where p∈P p.V R The verified region of a mobile host p M VR The merged verified region e s Theedgeof M VR which has the shortest distance to q o i A nearest neighbor element in p.O H A heap for storing SBNN query results. Its verified and unverified elements are defined as H.verified and H.unverified, respectively. O The set of all the received POIs from peers q A query mobile user |A| The cardinality of setA ||a,b|| The Euclidean distance between objects a and b Dist(a,b) The network distance between objects a and b Table 3.1: Symbolic notations for Chapter 3. 25 3.3.1 Nearest Neighbor Verification (NNV) When a mobile host q executes SBNN, it first broadcasts a request to all its single- hop peers for their cached spatial data. Each peer that receives the request returns the verified region MBR and the cached points of interest to q. Then, q combines the verified regions of all the replied peers, each bounded by its MBR, into a merged verified region M VR (the polygon in Figure 3.3). The merging process is carried out by the MapOverlay algorithm [dBvKOS00] (line 4 of Algorithm 1). The core of SBNN is the nearest neighbor verification (NNV) method, whose objective is to verify whether a POI o i obtained from peers is a valid (i.e., top k) nearest neighbor of the mobile host q. Merged Verified Region q o1 || q, o1 || || q, e1 || o2 e1 o3 o4 o5 Figure 3.3: Because e 1 has the shortest distance to q and||q,o 1 ||≤||q,e 1 ||,POI o 1 is verified as a valid NN of mobile host q. 26 LetP denote the data collected by q from j peers p 1 ,···,p j . Consequently, the merged verified region M VR can be represented as: M VR = p 1 .V R∪ p 2 .V R∪···∪ p j .V R. Suppose the boundary of M VR consists of k edges,E ={e 1 ,e 2 ,...,e k } and there are l points of interest, O = {o 1 ,o 2 ,...,o l }, inside the M VR .Let e s ∈ E be the edge that has the shortest distance to q. An example is given by Figure 3.3, where k = 10, and e 1 has the shortest distance to q. Lemma 3.3.1 Let O ={ o 1 , o 2 ,..., o v } be a set of POIs each of which is closer to q than e s .Then, o 1 , o 2 ,..., o v are the top v nearest neighbors of q. Proof: Assume o m is one of the top v nearest neighbors of q, but o m / ∈ O. Then, ||q,o m || < ||q, o v || and ||q,o m || < ||q,e s ||.Since ||q,o m || < ||q,e s ||, o m must be inside M VR and o m ∈O. Based on the definition of O, o m must be a member of O. However, this contradicts the assumption that o m / ∈ O. Therefore, O must cover the top v nearest neighbors of q. In Figure 3.3, according to Lemma 3.3.1, the POI o 1 can be verified as the nearest neighbor of q and is termed a verified nearest neighbor, because the Euclidean distance between o 1 and q is no greater than the Euclidean distance between e 1 27 and q. Figure 3.4 demonstrates a counter example. Since we are not sure if there is any POI within the unverified regions, o 4 cannot be verified as a top kNN of q. The NNV method uses a heap H to maintain the entries of verified and unverified points of interest discovered so far (Table 3.2). Initially H is empty. The NNV method adds POIs to H as it verifies objects from mobile hosts in the vicinity of q. The heap H maintains the POIs in an ascending order in terms of their Euclidean distances to q. Unverified objects are kept in H only if the number of verified objects is lower than requested by the query. The nearest neighbor verification method is formalized in Algorithm 1. Merged Verified Region q o1 || q, o4 || || q, e1 || o2 e1 o3 o4 o5 Unverified Region Figure 3.4: Because of some unverified regions, o 4 cannot be verified as a top k NN of q. POI verified? distance correctness surpassing to q [miles] probability ratio (r /r) o 1 yes 2 - - o 5 yes 3 - - o 4 no 5 55% 1.67 o 3 no 6 40% 2.0 Table 3.2: The data structure of the heap H. 28 If k elements in H are all verified by NNV, the kNN query is fulfilled. There will be cases where the NNV method cannot fulfill a kNN query. Hence a set which contains unverified elements is returned. If the response time is critical, a user may agree to accept a kNN data set with unverified elements, where the objects are not guaranteed to be the top k nearest neighbors. However, the correctness of these approximate results can be estimated and will be discussed in Section 3.3.2. If the result quality is the most important concern, the client has to wait until it receives all the required data packets from the broadcast channel. Nevertheless, the partial results in H can be used to decrease the required data packets and thus speed up the on air data collection (Section 3.3.3). 3.3.2 Approximate Nearest Neighbor We calculate the probability that an unverified i-th nearest neighbor o of a query point q is actually the true i-th nearest neighbor of q. The reason why o cannot be verified is because there is a region which is not covered by q’s neighboring peers. As long as a POI exists in the region, then o cannot be q’s i-th nearest neighbor. We denote such a region as o’s unverified region. Figure 3.5 shows an example. POI o 4 is the unverified 3rd nearest neighbor of q because there is a possibility that another POI may exist in the shaded unverified region. We assume the POIs are Poisson distributed in the environment. The prob- ability of finding another POI in the unverified region U i of an unverified POI o i 29 can be calculated according to the area of U i . We formulate the correctness of an unverified POI based on probability models as Lemma 3.3.2. Algorithm 1 NNV (q, H, k) 1: P ← peer nodes responding the query request issued from q. 2: M VR ←∅ 3: for ∀p∈ P do 4: M VR ∪ = p.V R and O ∪ = p.O 5: end for 6: ∀o i ∈ O,sortaccordingto ||q,o i || 7: Compute ||q,e s || where edge e s has the shortest distance to q among all the edges of M VR 8: i=1 9: while |H|<k and i≤|O| do 10: if ||q,o i || ≤ ||q,e s || then 11: H.verified ∪ = o i 12: else 13: H.unverified ∪ = o i 14: i++ 15: end if 16: end while 17: return H Lemma 3.3.2 Assume the POIs in an area E are Poisson distributed. Let q be a query mobile host which has retrieved x verified and y unverified NN from M VR for a kNN query. If the unverified region U j of an unverified POI o j of q covers the area of u square units, then the probability that o j is the j th NN of q is e −λu . Proof: Let ||q,o j || = r and the circle C is defined by center point q with radius r . According to the definition of the Poisson distribution, we have: P{N(t + s)− N(s)= n} = e −λt (λt) n n! ,n=0, 1,... (3.1) 30 With the memoryless property of the Poisson distribution, we can assume t stands for the unverified region U j within C and s stands for the verified region within C. N(t) represents the total number of POIs that are located inside U j .Since we know the area of the unverified region of o j is u square units, the probability of no POI in u square units is e −λu . Figure 3.5 shows an example. Suppose we obtain the average number of POIs per square unit as 0.3 (the value of λ) and the unverified region of o 4 covers 2 square units, we can calculate the accurate ratio of o 4 as the true third nearest POI of q as e −0.6 ≈ 0.5488. Therefore, the probability that o 4 is the true third nearest POI of q is 55%. In addition, the relationship between the inner circle c i (with radius r)andouter circle c o (with radius r ) is also a useful metric as demonstrated in Figure 3.5. We name the metric the surpassing ratio and it stands for the ratio of r to r based on the Euclidean distance. For example, if a motorist decides to take o 4 in the heap H (Table 3.2) as his destination, in the worst case (o 4 is not the true third NN and the true third NN is a little bit further than o 5 ) he has to drive around two more miles (3× 0.67≈ 2). The correctness probability and the surpassing ratio of these unverified POIs are also memorized in the heap H and they can be utilized by applications with different result quality requirements. 31 q e1 k-y+1 Unverified Region r r o3 o4 o1 o2 o5 Ci Co Merged Verified Region Figure 3.5: The correctness probability of the unverified POI o 4 can be estimated based on the size of its unverified region. 3.3.3 Broadcast Channel Data Filtering Under most conditions there are verified and unverified entries in H when the NNV method cannot totally fulfill a kNN query. For applications which require accurate NN information, we can utilize the partial results to calculate data packet search bounds from the entries in heap H to speed up their on air NN search process. The heap H is in one of six different states after a mobile host has executed the NNV mechanism without retrieving k verified objects: • State 1: H is full and contains both verified and unverified entries. • State 2: H is full and contains only unverified entries. • State 3: H is not full and contains both verified and unverified entries. • State 4: H is not full and contains only verified entries. • State 5: H is not full and contains only unverified entries. • State 6: H contains no entries. 32 In State 1 there may exist some POIs which are closer to q compared with the last element in H. Hence, we can consider the last entry of H as the final candidate nearest neighbor in the NN search and utilize its distance as the search upper bound. In addition, the distance attribute d v of the last verified entry can be another bound, the search lower bound. Since we are certain about the POIs within the circle region C i with radius d v and center point q, q does not have to receive any data packet which is completely covered by C i . Conversely, when H is full and contains just unverified entries, we can infer only the upper bound (State 2). In States 3 and 4 after the mobile host performed the NNV algorithm, there have merely less than k POIs been found. Therefore, we can only infer the lower bound from the distance attribute of the last verified element in H.Inthelast two states, H is not full and contains only unverified entries or no entry at all. Consequently we cannot infer any search bounds from them. Based on the discussion in Sections 3.3.1, 3.3.2, and 3.3.3, the complete proce- dure of SBNN is described in Algorithm 2. Algorithm 2 SBNN (q, H, k) 1: H = NNV(q, H, k) 2: if (|H.verified| = k)or(|H| = k and accept = true) then 3: return H {if k verified NN have been retrieved, or the heap is full and q accepts ap- proximate results.} 4: end if {if H is not full or q denies any approximate results, utilize the search upper and lower bounds to improve the on air query efficiency.} 5: H∪ kNN query results returned from the updated on air NN query. 6: return H 33 3.4 Sharing Based Window Queries As proposed in [ZLL04], the basic idea for a mobile host to process a window query w based on space-filling curve index is to decide a candidate set of points along the Hilbert curve. The candidate set includes all the points that fall within the query window of w. Then the MH retrieves the related packets and filters out data objects which are located outside of the query window. As illustrated in Figure 3.6, the dashed-line rectangle represents the query window of w. We can find a first point a and a last point b according to the order in which they occur on the Hilbert curve. Consequently, all the points inside this query window should lie on the Hilbert curve segmented by points a and b. 4 3 0 12 76 57 56 61 62 558 59 60 63 14 13 8 9 54 55 50 49 15 12 11 10 53 52 51 48 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 16 18 29 28 35 34 45 44 17 30 31 32 33 46 47 w a b Figure 3.6: A window query on the Hilbert-curve index structure. Although the algorithm proposed in [ZLL04] can find entry and exit bounding points on a Hilbert curve index to decrease the number of candidate points, the access latency is still very long. As shown in the example, the required data packets 34 span between index value 9 and 54 and cover around 70% of the whole data file. Although a search space partition technique was proposed in [ZLL04] for improving the performance, it still cannot mitigate the overhead of access latency. Therefore, we propose a Sharing Based Window Query (SBWQ) method to improve the current on air window query algorithm. For SBWQ, a mobile host q has to merge peer verified regions (p.V R) and collect related POI data from peers. Then q computes the spatial relationship between the query window of w and the merged verified region M VR .If w can be totally covered by M VR , the window query can be fulfilled. Otherwise, the whole or part of the query window must be solved as an on air window query. However, under these conditions we may be able to reduce the query window. 3.4.1 Window Query Verification The MH q first broadcasts a request to all its single-hop peers for requesting their cached spatial data. Then it combines the returned verified regions p.V R,each bounded by its MBR, into a merged verified region M VR .Afterward q computes the spatial relationship between the query window w and M VR .If w falls entirely inside M VR , SBWQ will return the POIs which overlap with w (e.g., WQ1in Figure 3.7). 35 3.4.2 Broadcast Channel Data Filtering There will be cases when the SBWQ algorithm can provide only a partial result to a window query (e.g., WQ2 in Figure 3.7). Consequently one (or several) up- dated (i.e., reduced) query window(s) w will be utilized to decide the new search bound on the Hilbert curve index and the on air window query algorithm is exe- cuted for solving w .Since w is much smaller than w under most conditions, the access latency can be markedly decreased. The SBWQ algorithm is formalized in Algorithm 3. o1 o2 o4 o5 o3 w' WQ2 WQ1 w' Figure 3.7: POI o 1 and o 4 are the query results of this sharing based window query WQ1. 3.4.3 The Relationship Between the Verified Region Size and Query Window Size Since the efficiency of our techniques is mainly based on the cached previous query results, we are interested in the relationship between the verified region size and the query window size. We defined a metric, access time saving ratio (ATSR), for 36 Algorithm 3 SBWQ(q, w) 1: P ← peer nodes responding the query request issued from q. 2: for ∀p∈ P do 3: M VR ∪ = p.V R and O ∪ = p.O 4: end for 5: WQ←∀o∈ O which overlap with w 6: if w⊂ M VR then 7: return WQ 8: else 9: WQ ∪ query results returned from the on air window query with w . {if w ⊂ M VR , utilize w to compute the new search bounds and results.} 10: return WQ 11: end if evaluating the relationship between the verified region size and the query window size. The ATSR is calculated by comparing the access latency with a certain amount verified area in cache versus the access latency without any verified region using the same query window size. Figure 3.8 demonstrates an example. The merged verified region of a mobile host q covers broadcast cells 30, and 31 and the query window (w) overlaps with cells 10, 11, 30, and 31. Assume that the broadcast starts from cell 0. The mobile host q has to wait until the communication channel finishes the broadcasting of cell 31 before it can answer the query. However with the aid of the verified region, q only needs to wait until the end of the cell 11 transmission. Consequently, the mobile host can save 62.5% ( 32−12 32 ) of the access latency in this example. Note that the verified region size represents only around 3.1% ( 2 64 )ofthe whole search space and the cached data is collected from numerous neighboring peers. In our experiments we explore how much data a mobile host has to collect to achieve an ideal access time saving ratio. 37 4 3 0 12 76 57 56 61 62 558 59 60 63 14 13 8 9 54 55 50 49 15 12 11 10 53 52 51 48 21 22 25 26 37 38 41 42 20 23 24 27 36 39 40 43 19 16 18 29 28 35 34 45 44 17 30 31 32 33 46 47 w Figure 3.8: The access latency of the window query (WQ) can be largely decreased by the cached data. Figure 3.9 illustrates the relationship between the verified region size and the query window size with the average values of one thousand experiments. In Fig- ure 3.9a., we increased the verified region size from 1% to 20% of the whole search space with a fixed query window size (2% of the whole search space) and the ATSR increasing from 3% to 70%. As demonstrated in the figure if the verified region is around 5% of the whole search space, we can save more than 50% access latency. In addition, we also enlarged the query window size from 1% to 20% with a fixed verified region size (6%) and the saved access latency becomes very limited when the query window size is larger than 10% as shown in Figure 3.9b. However, we usually have relatively small query windows in most location-based service appli- cations [Spi04]. 38 0 20 40 60 80 100 0 2 4 6 8 10 12 14 16 18 20 Access Time Saving Ratio (%) Verified region Size (%) 0 20 40 60 80 100 0 2 4 6 8 10 12 14 16 18 20 Access Time Saving Ratio (%) Query Window Size (%) Fig. 3.9a. Verified region size. Fig. 3.9b. Query window size. Figure 3.9: The effect of various verified region size and query window size. 3.5 Evaluation To evaluate the performance of our approach we have implemented the sharing based spatial query algorithms within a simulator. In addition to enabling efficient and decentralized applications, the objective of our peer-to-peer design is to de- crease access latency in two dimensions. First, the access latency can be reduced as queries are answered directly by peers. Second, for the remaining queries that require packets from the broadcast channel, our technique diminishes the required number of packets by providing search bounds for the spatial query algorithms. Consequently, the focus of our simulations is to quantify the access latency varia- tions as a function of two main parameters, the Peer Query Fulfilling Rate (PQFR) and Broadcast Packet Access Rate (BPAR). PQFR quantifies what percentage of the client spatial query requests are fulfilled by peers, and BPAR denotes how many 39 broadcast data packet are required compared with the solution in [ZLL04] for a se- quence of queries with partial results from sharing based queries. Our experiments were performed with both synthetic and real-world parameter sets. 3.5.1 Simulator Implementation Our simulator consists of two main modules, the mobile host module and the base station module. The mobile host module generates and controls the movements and query launch patterns of all mobile hosts (MH). Each mobile host is an independent object which decides its movement autonomously. The base station module oper- ates a broadcast channel for continuously sending data packets to MHs. Spatial data indexing is provided with the well known Hilbert curve [Jag97]. We imple- mented the SBNN and SBWQ algorithms in the mobile host module. Parameter Description POI Number The number of point of interest in the system MH Number The number of mobile hosts in the simulation area C Size The cache capacity per data type of each mobile host M Velocity The mobile host movement velocity (mph) λ Query The mean number of queries per minute Tx Range The transmission range of queries λ kNN The mean number of queried nearest neighbors λ Window The mean size of query windows λ Distance The mean distance between a query MH and the center point of its query window T execution The length of a simulation run Table 3.3: Parameters for the simulation environment. 40 Each mobile host is implemented as an independent object that encapsulates all its related parameters such as the movement velocity M Velocity ,the cache ca- pacity C Size , the wireless transmission range Tx Range , etc. All MHs move inside a geographical area, measuring 30 miles by 30 miles. Additionally, user adjustable parameters are provided for the simulation such as execution length, the number of MHs and their query frequency, the number of POIs, etc. Table 4.2 lists all of the simulation parameters. The simulation is initialized by randomly choosing a starting location for each mobile host within the simulation area. The movement generator then produces trajectories with an underlying road network and the mobile host travel speed s is determined by the speed limit on the corresponding road segment. We employed the random waypoint model [BMJ + 98] as our mobility model. Each MH selects a random destination point inside the simulation area and progresses towards it. Upon reaching that location, it pauses for a random interval and decides on a new destination for the next travel period. This process repeats for all MHs until the end of the simulation. Every simulation has numerous intervals (whose lengths are Poisson distributed) and during each interval, the simulator selects a random subset of the mobile hosts to launch spatial queries (the query intervals are also based on a Poisson distrib- ution). The subset size is controlled via the λ Query parameter (e.g., 1,000 queries per minute). These mobile hosts then execute the SBNN or the SBWQ algorithm 41 by interacting with their peers. A mobile host will first attempt to answer each spatial query via the sharing based approaches. If this is unsuccessful, the query will be solved by listening to the broadcast channel. Each mobile host manages its local query result cache with a combination of the following two policies: 1. A MH stores all the verified POIs and their minimum bounding boxes. The cache replacement is based on the Least Recently Used (LRU) policy. 2. If a spatial query must be solved by listening to the broadcast channel, the MH will stored as many received POIs as its cache capacity allows (e.g., for a 5-NN query, if the downloaded broadcast packets contain 15 POIs and the cache capacity is 30 POIs for each data type, the MH will store all of them and their MBR). The sharing based nearest neighbor query algorithm is implemented according to the method detailed in Section 3.3. Multiple, potentially overlapping MBRs must be combined to provide the verified region. The simulator sequentially merges peer returned MBRs into a merged verified region M VR by performing the MapOverlay algorithm and also combines the returned POIs into a candidate listO.Afterwards, a MH sequentially verifies the objects in O with our verification technique based on M VR . Similarly, we implemented the sharing based window query algorithm (Section 3.4) in the simulator. 42 Parameter LA County Riverside County Synthetic Suburbia Units POI Number 4050 2160 3105 MH Number 121500 11700 66600 C Size 50 50 50 M Velocity 30 30 30 mph λ Query 8100 780 4440 min −1 Tx Range 200 200 200 m λ kNN 5 5 5 λ window 3 3 3 % λ Distance 1 1 1 mile T execution 10 10 10 hr Table 3.4: The simulation parameter sets for Los Angeles County, Riverside County, and the synthetic suburbia. 3.5.2 Simulation Parameter Sets To obtain results that closely correspond to real world conditions we obtained the simulation parameters from public data sets, for example, car and gas station densities in urban areas. We term the two parameter sets based on these real- world statistics the Los Angeles County parameter set and the Riverside County parameter set. • Points of Interest: We obtained information about the density of interest objects (e.g., gas stations, restaurants, hospitals, etc.) in the Greater Los Angeles area from two online sites: GasPriceWatch.com 1 and CNN/Money. Because gas stations are commonly the target of spatial queries, we use them as the sample POI type for the simulations. The peer query fulfilling rate of other POI types are expected to be very similar. 1 http://www.gaspricewatch.com 43 • Mobile Hosts: We collected vehicle statistics of the Greater Los Angeles area from the Federal Statistics web site 2 . The data provide the number of reg- istered vehicles in the Los Angeles and Riverside Counties (5,498,554 and 944,645, respectively). In the simulations we assume that about 10% of these vehicles are on the road during non-peak hours according to the traffic infor- mation from Caltrans 3 . We further obtained the land area of each county to compute the average vehicle density per square mile. The Los Angeles and the Riverside County parameter sets represent a very dense, urban area and a low-density, more rural area. Hence, for comparison pur- poses we blended the two real parameter sets to generate a third, synthetic set. The synthetic data set demonstrates vehicle and interest object densities in-between Los Angeles County and Riverside County, representing a suburban area. Table 3.4 lists the three parameter sets. 3.5.3 Experimental Results with the k Nearest Neighbor Query We used all the three input parameter sets – Los Angeles County, Riverside County, and Synthetic Suburbia – to simulate our peer sharing techniques for solving kNN queries. We varied the following parameters to observe their effect on the system 2 http://www.fedstats.gov/ 3 http://www.dot.ca.gov/hq/traffops/saferesr/trafdata/ 44 performance: the wireless transmission range, the cache capacity, and the nearest neighbor number k. The performance metric in the mobile host module was PQFR. The primary difference between the three different parameter sets is the vehicle and the POI density. Hence, we utilized the simulation to verify the applicability of our design to different geographical areas. All simulation results were recorded after the system reached a steady state. Effect of the Transmission Range In the first experiment we varied the mo- bile host wireless transmission range from 10 meters to 200 meters, with all other parameters unchanged. We chose 200 meters as a practical upper limit on the transmission range of the IEEE 802.11b/g technology. Although the reliable cov- erage range for IEEE 802.11b/g in open space with good antennas can be more than 300 meters [GC04], obstacles such as buildings could diminish the range to 200 meters or less in urban areas. Figure 3.10 illustrates percentage of the queries that can be resolved by SBNN, approximate SBNN (with POI correctness prob- ability higher than 50%), or the broadcast channel with the Los Angeles County, the Synthetic Suburbia, and the Riverside County parameter sets, respectively. As the transmission range extends, an increasing number of queries can be answered by surrounding peers. As expected, the effect is most pronounced in Los Angeles County, because of its high vehicle density. With a transmission range of 200 m less than 20% of the queries must be solved by listening to the broadcast channel. 45 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.10a. Los Angeles County. 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.10b. Synthetic Suburbia. Fig. 3.10c. Riverside County. Figure 3.10: The percentage of resolved queries as a function of the wireless trans- mission range. Effect of the Mobile Host Cache Capacity Next we varied the mobile host cache capacity, which denotes how many nearest neighbor objects a MH can store. Figure 3.11 illustrates cache capacities from 6 to 30 with the three parameter sets. In Figure 3.11a, even though the number of interest objects is much larger than the maximum capacity of the cached NN query results, we can observe a remarkable peer query fulfilling rate increase with a higher MH cache capacity. 46 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.11a. Los Angeles County. 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.11b. Synthetic Suburbia. Fig. 3.11c. Riverside County. Figure 3.11: The percentage of resolved queries as a function of the mobile host cache capacity. Effect of k We were also interested in the effect that varying the number of requested nearest neighbors, i.e., k, would have on the system performance. In the simulation we chose k randomly for each host and each query in the range from 3 to 15. Figure 3.12 illustrates the results. The server workload of the Los Angeles County parameter set increases 28% when we raise k from 3 to 15. The server workload of the Riverside County parameter set increases by only 21%, because its 47 starting level is much higher. Not surprisingly result sharing is much more effective for small values of k. 0 20 40 60 80 100 120 15 12 9 6 3 Number of k Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.12a. Los Angeles County. 0 20 40 60 80 100 120 15 12 9 6 3 Number of k Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 15 12 9 6 3 Number of k Percentage of Total Queries Queries Solved by SBNN Queries Solved by Approximate SBNN Queries Solved by the Broadcast Channel Fig. 3.12b. Synthetic Suburbia. Fig. 3.12c. Riverside County. Figure 3.12: The percentage of resolved queries as a function of k. 3.5.4 Experimental Results with the Window Query Similar to Section 3.5.3, we utilized all the three input parameter sets to simulate the peer sharing techniques for solving window queries. We varied the following 48 parameters to observe their effect on the system performance: the wireless trans- mission range, the cache capacity, and the query window size. Effect of the Transmission Range In this experiment we varied the mobile host wireless transmission range from 10 meters to 200 meters, with all other pa- rameters unchanged. Figure 3.13 demonstrates the proportion of window queries that can be resolved by SBWQ or the broadcast channel with the three data sets. The trend of the simulation results is similar to kNN queries in our system. With increasing transmission range, more queries can be fulfilled by surrounding peers. Effect of the Mobile Host Cache Capacity We studied the effect of vari- ous mobile host cache capacity by changing the cache capacity from 6 to 30 with the three parameter sets and the results are shown in Figure 3.14. We observe that with the increase of cache capacity, more window queries can be fulfilled by peers. Therefore, mobile hosts can have a shorter access latency with a higher cache capacity. Effect of the Query Window Size We studied the effect that varying the the query window size would have on the system performance. In the simulation we varied the window size from 1% to 5% of the whole search space. The center location of the query window is randomly decided with a distance to the query MH based on the normal distribution. Figure 3.15 illustrates the results. With a 49 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBWQ Queries Solved by the Broadcast Channel Fig. 3.13a. Los Angeles County. 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBWQ Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Transmission Range (Meters) Percentage of Total Queries Queries Solved by SBWQ Queries Solved by the Broadcast Channel Fig. 3.13b. Synthetic Suburbia. Fig. 3.13c. Riverside County. Figure 3.13: The percentage of resolved queries as a function of the wireless trans- mission range. relatively small query window (less than 3%), above 50% window queries can be fulfilled by our sharing mechanism. 3.5.5 Experimental Results of the Broadcast Packet Access Rate In order to evaluate the spatial query search bounds of Section 3.3 and 3.4, we ex- tended the on air spatial query (OASQ) algorithms proposed in [ZLL04] with search 50 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBWQ Queries Solved by the Broadcast Channel Fig. 3.14a. Los Angeles County. 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBNN Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 30 24 18 12 6 Number of Cached Items Percentage of Total Queries Queries Solved by SBNN Queries Solved by the Broadcast Channel Fig. 3.14b. Synthetic Suburbia. Fig. 3.14c. Riverside County. Figure 3.14: The percentage of resolved queries as a function of the mobile host cache capacity. bounds. The performance metric for comparing the extended on air spatial query (denoted by EOASQ) and OASQ is broadcast packet access rate. For each spatial query which cannot be fulfilled by the sharing based mechanism, the mobile host module executes both OASQ and EOASQ algorithms to compare the performance improvement with respect to packet access of the broadcast channel. We examined the behavior of the original and the extended solutions as the number of k and query window size increase. Because spatial queries are generated by randomly 51 0 20 40 60 80 100 120 5 4 3 2 1 Query Window Size (%) Percentage of Total Queries Queries Solved by SBWQ Queries Solved by the Broadcast Channel Fig. 3.15a. Los Angeles County. 0 20 40 60 80 100 120 5 4 3 2 1 Query Window Size (%) Percentage of Total Queries Queries Solved by SBNN Queries Solved by the Broadcast Channel 0 20 40 60 80 100 120 5 4 3 2 1 Query Window Size (%) Percentage of Total Queries Queries Solved by SBNN Queries Solved by the Broadcast Channel Fig. 3.15b. Synthetic Suburbia. Fig. 3.15c. Riverside County. Figure 3.15: The percentage of resolved queries as a function of query window size. selected mobile hosts, query points are uniformly distributed over the simulation area. The experiments are executed sufficiently often to obtain consistent results. Since EOASQ usually requests fewer data packets than OASQ, we believe that the search bounds can decrease the access latency and tuning time. During the sim- ulation process the mobile host module counts the number of data packet accesses which correspond to both access latency and tuning time. As shown in Figure 3.16, 52 the EOASQ algorithm performs consistently better than OASQ with various num- ber of k and query window size. We conclude that the search bound technique can effectively decrease the number of broadcast packet accesses. We varied the number of k from 3 to 15 with the three parameter sets and the EOASQ algorithm accesses 66% to 14% fewer packets than OASQ. Similarly, the EOASQ accesses 51% to 12% fewer packets than OASQ when we increased the query window size from 1% to 5%. We conclude from all the performed experiments that the mobile host density has a considerable impact on the peer query fulfilling rate. As a result, if more mobile hosts travel in a specific area, each MH has a higher opportunity to fulfill its spatial queries by peers. Furthermore, the spatial query search bounds also have a significant positive effect on the broadcast packet access rate and successfully decrease the access latency and tuning time. 0 20 40 60 80 100 15 12 9 6 3 Number of k Required Packet Ratio (%) EOASQ (LA) EOASQ (SYN) EOASQ (RV) 0 20 40 60 80 100 5 4 3 2 1 Query Window Size (%) Required Packet Ratio (%) EOASQ (LA) EOASQ (SYN) EOASQ (RV) Fig. 3.16a. kNN queries. Fig. 3.16b. Window queries. Figure 3.16: The packet access comparison between EOASQ and OASQ. We nor- malized the required packet number of EOASQ to OASQ. 53 3.6 Summary I have presented a novel approach for answering spatial queries by leveraging results from neighboring peers within wireless broadcast environments. Significantly, the method allows a mobile peer to locally verify whether candidate objects received from neighbors are indeed part of its own spatial query result set. The simulation results indicate that the technique can reduce the access to the wireless broadcast channel by a significant amount, for example up to 80% in a dense urban area. This is achieved with minimal caching at the peers. By virtue of its peer-to-peer archi- tecture, the method exhibits great scalability: the higher the mobile peer density, the more queries can be answered by peers. Therefore, the query access latency can be markedly decreased with the increase of clients. The mechanisms proposed in this chapter are mainly for searching static information and the techniques for employing dynamic information are discussed in the following chapter. 54 Chapter 4 Sharing-based Spatial Query Processing for Retrieving Dynamic Information 4.1 Overview Recently novel algorithms [PZMT03, SKS02] have been proposed to compute NN queries in spatial networks. These methods extend NN queries by considering the spatial network distance, which provides a more realistic measure for applications where objects are constrained in their movements. However, these existing tech- niques only consider static models of spatial networks: pre-defined road segments with fixed road conditions are used in computing nearest neighbors. Thus, any real- time events (e.g., detours, traffic congestions, etc.) affecting the spatial network cannot be reflected in the query result. For example, a traffic jam occurring on the route to the computed nearest neighbor most likely elongates the total driving time. More drastically, the closure of a restaurant which was found as the nearest 55 neighbor might even invalidate a query result. This motivates the need for new al- gorithms which extend existing NN query techniques by integrating real time event information [KZWW05, KZWN06]. By leveraging ad-hoc networks, traffic infor- mation can be shared in a P2P manner among mobile hosts and thus local traffic information (e.g., driving speed of vehicles) can be considered when computing NN queries. Furthermore, broadcast channels and cellular communication enable re- mote traffic information server (TIS) access such that collecting and disseminating traffic information for a much wider area becomes possible. In this chapter I propose two novel nearest neighbor query algorithms which incorporate real time traffic information. Compared with existing work, my design leverages the communication among mobile hosts and traffic information servers to adaptively compute more accurate nearest neighbor results. The rest of this chapter is organized as follows. The system infrastructure and travel time network are illustrated in Section 4.2 and Section 4.3 respectively. In Section 4.4, I introduce the LANN and the GANN algorithms. The experimental validation of my design is presented in Section 4.5. 4.2 System Infrastructure Figure 4.1 illustrates the system infrastructure of our design. We are considering mobile hosts with abundant power capacity, such as vehicles, that are equipped 56 with a Global Positioning System (GPS) for obtaining continuous position infor- mation. In addition, mobile hosts also maintain the road network data and the set of points of interest (POI) in local memory (for example, stored on a CD). The road network data (e.g., the US Census TIGER/Line data set) covers the road segments of highways, primary roads, rural roads, etc. These different road types are defined as road class attributes in the TIGER data set. Since this road class attribute does not convey the speed limit information, we define a fixed speed limit for each road class (e.g., 65 mph for highways). Furthermore, we assume that two tiers of wireless connections are available on each mobile host. The cellular-based networks (such as utilized by the OnStar service) allow medium range connections to base-stations that interface with the wired Internet infrastructure. A second type of short-range ad hoc communication protocols (e.g., IEEE 802.11x) are also supported to com- municate between neighboring peers. Mobile hosts can either broadcast requests of traffic information to peers within the communication range (local solution) or send requests to the traffic information server directly (global solution). Currently there are many real-time traffic event providers (e.g., California Highway Patrol, SIGALERT.com real-time traffic information 1 , etc.) which supply traffic informa- tion of many urban areas. These web sites can be easily integrated with TIS servers through Web service interfaces in the future. A local travel time network in each 1 http://www.sigalert.com/ 57 mobile host is thus built by integrating the information of traffic events from peers or a TIS and the local stored road network for processing nearest neighbor queries. Peer-to-Peer Channel Mobile Host Transmission Range Spatial Database Base Station Transmission Range GPS Satelite Base Station Peer-to-Base Channel Figure 4.1: The system infrastructure. 4.3 Travel Time Networks Leveraging existing methods, we assume a digitization process that generates a modeling graph from an input spatial network. The modeling graph contains three categories of graph nodes: the network junctions, the start/end points of 58 a road segment, and other auxiliary points (e.g., speed limit change points). How- ever, a digitized spatial network (as shown in Figure 4.2a) cannot reflect real- time traffic events and this limitation decreases the accuracy of NN query algo- rithms [SKS02, PZMT03] which utilize spatial networks. In order to obtain a more accurate estimation of the nearest POI (and its travel time) to a query point Q, we propose to combine real time traffic information with spatial road networks to generate travel time networks (TTN). A travel time network uses the travel time between nodes as the graph edge weight, rather than the network distance as in spa- tial networks. For example, assume that the spatial distance of edge e is five miles (the left most road segment in Figure 4.2a) and the speed limit of e is thirty miles per hour (mph). Hence, we compute the minimum driving time between nodes A and B to be ten minutes and we set this value as the distance between nodes A and B on the travel time network (the left most road segment in Figure 4.2b). With travel time networks, real-time events can be readily integrated into af- fected road segments by converting their effect to time. For example, if the driving speed of a road segment is slower than its speed limit because of a traffic congestion, the travel time between its start and end points can be dynamically updated to reflect the congestion. In addition if a road segment is closed after a traffic accident, this road segment can be temporarily removed from its travel time network. Since turn restrictions (e.g., right turn prohibition) can be modelled by including extra nodes in the spatial network, we do not consider them as real-time events here. 59 6 Mi 12 Mi 5 Mi 6 Mi 3 Mi 10 Mi 4 Mi 2 Mi 8 Mi 4 Mi 9 3 Mi 3 Mi 30 MPH 60 MPH 65 MPH A B e 5 Mi 12 Min 6 Min 10 Min 12 Min 8 Min 8 Min 9 Min 2 Min 6 Min 8 Min 6 Min 5.5 Min 4.6 Min A B 10 Min e Fig. 4.2a. A spatial network with different road classes. Thick lines have higher speed limits. Fig. 4.2b. A travel time network with the driving time between nodes. Figure 4.2: A spatial network (left) and its travel time network (right). 4.4 Travel Time Network Nearest Neighbor Queries In the real world, mobile objects often move on pre-defined networks (e.g., roads, railways, etc.). In this scenario, the spatial network distance provides a more exact estimation of the travel distance between any two objects than Euclidean distance. Papadias et al. [PZMT03] have proposed the Incremental Euclidean Restriction (IER) algorithm and the Incremental Network Expansion (INE) algorithm to solve spatial network nearest neighbor queries. Here we extend the IER algorithm to solve NN queries on travel time networks, because according to [PZMT03] IER has a better performance in denser, more regular networks (e.g., city blocks) and our algorithms are most likely to be applied in urban areas. We propose to use travel 60 time networks to replace spatial networks and take driving time as the length between two nodes for utilizing real-time traffic information. We call the modified method travel time network nearest neighbor query. Incremental Euclidean Restriction The IER algorithm is based on the multi- step kNN technique [FRM94, SK]. To execute a nearest neighbor search for query point Q, IER first retrieves the first Euclidean distance nearest neighbor n 1 of Q and computes the Euclidean distance ED(Q,n 1 ). Next it calculates the network distance from Q to n 1 , ND(Q,n 1 ). Subsequently the IER algorithm can use Q as the center to draw two concentric circles with radii ED(Q,n 1 )and ND(Q,n 1 ), respectively. Due to the Euclidean lower bound property (for any two nodes i and j, their Euclidean distance ED(n i ,n j ) always provides a lower bound on their network distance ND(n i ,n j )). Objects closer to Q than n 1 in the network must be within the circle making use of ND(Q,n 1 ) as its radius. Therefore, the search space becomes the ring area between the two circles as shown in Figure 4.3a. In the next iteration, the second closest object n 2 is retrieved (by Euclidean distance). Since in the given example ND(Q,n 2 )<ND(Q,n 1 ), n 2 becomes the current candidate for spatial network nearest neighbor and the search upper bound becomes ND(Q,n 2 ). This procedure is repeated until the next Euclidean nearest neighbor is located beyond the search region (as n 3 in Figure 4.3b). Incremental Network Expansion Figure 4.4 illustrates the INE algorithm. The points represent the nodes in the modeling graph and triangles denote POIs. The 61 closest POI to the query point Q is n 5 . The subscripts of the POIs (n 1 , n 2 , ..., n 5 ) are in ascending order according to their Euclidean distance to Q.TheINE algorithm performs network expansion from Q, and examines the POIs in the order in which they are encountered. In this example, INE first locates the segment p1p2 where Q finds and then retrieves all POIs on p1p2. Since there is no POI on p1p2 in this example, the point p1, which is the closest to Q is expanded. There is no POI that can be found on p1p9and p9 is inserted into Queue=<(p2, 5), (p9, 9)>. The expansion of p2leadsto p3and p4andthen Queue=<(p3, 9), (p4, 9), (p9, 9)> and POI n 5 is discovered on p2p4. The network distance from Q to n 5 is 8 and it provides a bound on the search space. The search terminates here since the next entry p3in Queue has a longer distance. Euclidean Distance ED (Q, n1) Network Distance ND (Q, n1) Q Network Distance ND (Q, n1) n1 Network Distance ND (Q, n1) Q Network Distance ND (Q, n2) n1 n2 Euclidean Distance ED (Q, n2) n3 Network Distance ND (Q, n2) Fig. 4.3a. The 1 st Euclidean NN. Fig. 4.3b. The 2 nd Euclidean NN. Figure 4.3: Nearest neighbor search in a spatial network environment with the IER algorithm. 62 N 1 n1 N 1 n5 N 1 n3 N 1 n2 N 1 n4 p1 p2 p3 p4 p5 p6 p7 p8 p9 Q 5 2 7 1 3 4 Figure 4.4: Nearest neighbor search in a spatial network environment with the INE algorithm. 4.4.1 Local-based Greedy Nearest Neighbor Queries The travel time network enables the integration of real time traffic information into a spatial representation. Since the cellular-based communication is much more expensive than the short-range ad hoc communication, traffic information can be exchangedinalocalareawithamuch lowercost. Basedonthisobservation, we propose a Local-bAsed greedy Nearest Neighbor query algorithm (LANN). LANN relies on exchanging local traffic information between peers to build a travel time network. With LANN, the mobile host first computes the nearest neighbor via executing a travel time network nearest neighbor query. Next, the mobile host 63 incrementally updates local traffic information and selects a road segment corre- spondingly as the shortest path to the computed nearest neighbor. As the mobile host begins to navigate on a road segment, it broadcasts requests to collect local traffic information from peers within the ad hoc communication range. Only the traffic information of the surrounding road segments is requested. In the case that the traffic information cannot be collected from peers, the default speed limits of the corresponding road segments are used. A travel time network, which evaluates each surrounding road segment as the sum of the actual travel cost and a heuristic travel cost, is hence built up. The travel time network utilizes the travel time of surrounding road segments as the actual cost g. The heuristic cost h is computed as the Euclidean distance from the end of each surrounding road segment to the destination point divided by a heuristic travel speed (e.g., the average travel speed on the spatial network). The mobile host selects the road segment with the cost f =MIN( g + h ) as the shortest path to the computed nearest neighbor and starts to navigate on that road segment. Figure 4.5 demonstrates an example of a TTN and LANN computation. As- sume a mobile host at location A executes a travel time NN query and the computed nearest neighbor from the spatial road network is at location B. Figure 4.5a shows the corresponding spatial road network and the Euclidean distance from the end of each surrounding road segment of location A (AC, AD,and AE in this example) 64 12 Mi 5 Mi 6 Mi 3 Mi 5 Mi 10 Mi 3 Mi 4 Mi 2 Mi 8 Mi 4 Mi 3 Mi 9 A B 30 Mi C D 15 Mi 6 Mi 35 Mi 30 MPH 60 MPH 65 MPH Heuristic MPH E 12 Min 8 Min 32.7 Min 16.4 Min A C D B 38.2 Min 2 Min E Fig. 4.5a. A road network with B as the nearest neighbor of POI of A. Fig. 4.5b. The corresponding TTN used in LANN. Figure 4.5: An example of traffic time networks in LANN. to the destination B. Next, the mobile host broadcasts to collect the traffic infor- mation on road segments AC, AD,and AE from peers. Assume the traffic speed of road segment AC is 30 mph, the speeds of AD and AE are 60 mph, and the aver- age travel speed on the spatial network is 55 mph. A TTN is constructed based on these traffic information as shown in Figure 4.5b. The heuristic costs of AC, AD, and AE are computed with the average travel speed on the spatial network. The length of each edge in the TTN represents the corresponding travel time. Hence the cost of AC is the actual cost (8 minutes) plus the heuristic cost (32.7 minutes), which in total equals 40.7 minutes. Similarly, the cost of AD is 12 + 16.4=28.4 minutes and the cost of AE is 2 + 38.2=40.2 minutes. Road segment AD,which has the minimum cost, is selected in this example. 65 LANN is executed in an incremental manner: when the mobile host reaches the end of the selected road segment, it broadcasts again to collect local traffic information from peers and update the travel time network. Next, the mobile host selects a road segment to continue navigating based on the updated TTN. The mobile host keeps executing the algorithm until it reaches the computed nearest neighbor. The complete algorithm of LANN is shown in Algorithm 4. Algorithm 4 Lobal-based Greedy Nearest Neighbor (Q) 1: /* Q is the query mobile host */ 2: Execute a spatial road network NN query for retrieving the nearest POI and the shortest path to it. 3: Take the returned nearest POI as the destination, D 4: for each surrounding road segment in the TTN do 5: select the minimum of actual traffic time cost T g plus heuristic traffic time T h 6: end for 7: Take the returned road segment as the current route, S route 8: while Dist(Q, D) =0 do 9: repeat 10: Navigate the mobile host to D 11: until the mobile host reaches the end of S route 12: broadcast to update the TTN 13: for each surrounding road segment in the TTN do 14: select the minimum of actual traffic time cost T g plus heuristic traffic time T h 15: end for 16: Take the returned road segment as the current route, S route 17: end while 66 4.4.2 Traffic Event Collection and Distribution of the Traffic Information Server In order to support the Global-based Adaptive Nearest Neighbor Queries, a traffic information server has to maintain valid traffic events for mobile hosts to access. Traffic events can be broadly classified into four categories: • Category 1 - Congestion Events: The real traffic speed of a road segment is much lower than the speed limit. This condition is usually caused by traffic accidents or traffic jams. • Category 2 - Detour Events: A road segment is closed and the mobile host has to detour. • Category 3 - Closure Events: The selected POI is closed and the mobile host has to search for another nearest POI. • Category 4 - Recovery Events: A mobile host can recover its local TTN from previous events: a traffic congestion has been relieved, a detour has been removed, or a POI is reopened. Since mobile hosts send requests to the TIS to acquire new traffic events, they can simultaneously upload the speed of their current road segments and report any real-time traffic events. Consequently the TIS aggregates traffic congestions, accidents, and other real-time traffic events. In addition, transportation and law 67 enforcement agencies can report road construction and accident information to the TIS. Commercial businesses (e.g., gas stations) can also report closure events to the TIS. Every traffic event is time-stamped and a mobile host can synchronize traffic information with the TIS by checking the latest time-stamp in its local memory. 4.4.3 Global-based Adaptive Nearest Neighbor Queries The purpose of the LANN algorithm is to ensure that a mobile host can make ef- ficient local navigation decisions at the end of each road segment and always stays on the shortest path to a pre-selected nearest neighbor. Because of the commu- nication range limitation of IEEE 802.11 wireless networks, a mobile host cannot access (assume one hop broadcast) real-time traffic events which happened faraway. However, these events could have significant influences on its driving time. For ex- ample, a POI may close unexpectedly and it may be too late for a mobile host to be aware of the event until it reaches the vicinity of the POI. In addition, if a mobile host holds the knowledge of global traffic events, it can update the current NN adaptively when receiving related traffic events. For example, if a road segment on the path to the current nearest neighbor is temporarily closed, this traffic event changes the TTN and may result in another POI being selected as the new nearest neighbor. Therefore, utilizing the cellular communication device on mobile hosts to access current traffic events from the traffic information server is desirable. The 68 TIS access frequency can be decided by each mobile host and all the traffic events related to the current route can be accessed at one time. We propose a Global-based Adaptive Nearest Neighbor query algorithm that computes the nearest neighbor in a best-first manner with a global travel time network. At the start of a trip a mobile host M executes the GANN algorithm to compute a nearest POI as the destination D and the shortest path to D as the selected route S route . Afterwards the mobile host follows S route for traveling to D before updating traffic events with the server. When M receives new traffic information T info from TIS, it needs to determine if T info has any influence (e.g., traffic jams usually slow down the traffic on road segments) on S route .If T info has no influence on the current route of M, the mobile host only needs to integrate T info into its local TTN for future usage. However, if T info is related to the current journey of M, the mobile host has to execute more methods. As discussed in Section 4.4.2, there are four traffic event categories. With category 1 and 2, mobile hosts have to update their local TTN (remove the edge of the closed road segment for category 2) and recalculate the Dr time from the current location to D.Thenit launches a travel time network NN query with Dr time as the upper search bound S bound . Afterwards, GANN chooses the shortest driving time POI within S bound as the new destination and navigates the mobile host there, if any new NN has been found. In category 3, a selected POI can be closed unexpectedly after a mobile host starts its trip. When receiving a POI closure event which is the current 69 destination, a mobile host has to launch a travel time network NN query for finding a new nearest POI. With category 4, if the recovery is about a traffic congestion and related with the route to the current nearest neighbor, the mobile host only needs to update its local TTN. Otherwise, the mobile client launches a travel time network NN query with the current driving time Dr time as the search upper bound S bound on the updated TTN. The complete algorithm of GANN is formalized in Algorithm 5 and its symbols are listed in Table 4.1. Parameter Description S route The pre-selected route to the destination D T info Receiving real-time traffic information from TIS Dr time The driving time from the current location of the query mobile host Q to the destination D S bound The upper bound of a nearest neighbor search D new The new selected POI for updating the original destina- tion D M A query mobile user |A| The cardinality of setA ||a,b|| The Euclidean distance between objects a and b Dist(a,b) The network distance between objects a and b Table 4.1: Symbols of the GANN algorithm. 4.5 Evaluation We implemented the adaptive nearest neighbor query algorithms in a simulator to evaluate the performance of the approach. The main objective is to increase the accuracy of nearest neighbor queries and decrease the driving time. First, real time 70 Algorithm 5 Global-based Adaptive Nearest Neighbor (Q) 1: /* Q is the query mobile host */ 2: Execute a travel time network NN query for retrieving the nearest POI and the shortest path to it. 3: Take the returned nearest POI as the destination, D, and the shortest path to D as the current route, S route 4: while Dist(Q, D) =0 do 5: Navigate the mobile host to D and request traffic events from the server with a frequency f 6: if The received information T info is related with S route or is a recovery event then 7: if T info is a traffic congestion recovery of S route then 8: Update the local time network with T info 9: else 10: if D is not closed then 11: Update the time network with T info and utilize the current location of Q for calculating a new driving time Dr time to D 12: Execute the travel time network NN query with Dr time as a search upper bound S bound 13: if any closer POI is found then 14: Pick the closest POI within S bound as D new 15: Replace D with D new and update S route with the route to D new 16: end if 17: else 18: Execute the travel time network NN query with the current location of Q for finding a new nearest POI D new . 19: Replace D with D new and update S route with the route to D new 20: end if 21: end if 22: else 23: Update the time network with T info 24: end if 25: end while 71 traffic information can be easily integrated into the underlaying network. Second, the shortest path to a destination (POI) can be generated incrementally (LANN) or dynamically (GANN). Consequently, the focus of our simulation is on quantifying the driving time variations. We have performed the experiments with both real- world and synthetic parameter sets. 4.5.1 Simulator Implementation Our simulator consists of three main modules, the navigation module,the server module,and the baseline module. The objective of the navigation module is to generate and control the movements and the NN query launch of all mobile hosts. Each mobile host is an independent object which encapsulates all its related pa- rameters (such as the movement velocity Move Velo and the wireless transmission range TR Rang ) and decides its movement autonomously. Spatial data (POI) index- ing is provided with the well known R-tree algorithm with the quadratic splitting method [Gut84]. All mobile hosts move inside a geographical area, measuring 4 miles by 4 miles. Additionally there are user adjustable parameters for the sim- ulation such as the execution length, the number of mobile hosts and the num- ber of POIs. The server module interacts with mobile hosts which execute the GANN algorithm for disseminating real-time traffic events. In addition, the pur- pose of the baseline module is to simulate traditional road network NN query so- lutions [PZMT03] and compute a shortest path S P to a NN without considering 72 any related traffic events (e.g., congestions, detours, etc.). Then the driving time of following S P can be used to compare with both the driving time of utilizing the LANN and the GANN algorithms. Table 4.2 lists all the simulation parameters. Parameter Description POI Numb The number of points of interest in the system Mobi Host The number of mobile hosts in the simulation area Move Velo The mobile host movement velocity (MPH) λ Cong The mean number of congestions per hour λ Deto The mean number of detours per hour λ Clos The mean number of POI closures per hour TR Rang The transmission range of queries Time Exec The length of a simulation run Expe Regn The measure of the simulation region Table 4.2: Parameters for the simulation environment. The simulation is initialized by randomly choosing a starting location for each MH within the simulation area. Since mobile hosts are not always searching for nearest neighbors, we assume each mobile host has two modes, NN search mode and driving mode. When a mobile host M is in the NN search mode, the navigation module navigates M to its queried nearest neighbor n via employing both the LANN and the GANN algorithms (for comparison purposes). At the same time, the baseline module computes the shortest path S P from the start location of M to n and estimates the driving time. Afterwards, the comparison module memorizes the driving time of utilizing the LANN algorithm, the GANN algorithm, and following the pre-computed shortest path S P . Furthermore, we employ the random waypoint model [BMJ + 98] as the mobility model for the driving mode. A MH which is 73 in the driving mode selects a random destination inside the simulation area and progresses greedily toward it. When reaching that location, the MH pauses for a random interval and decides on a new destination for the next travel period. Both processes (NN search and driving) repeat until the end of the simulation. Users can decide the number of mobile hosts which operate in the NN search mode and the driving mode. The movement of each MH follows the underlaying road network and their travel speed s is determined by the traffic speed on the corresponding road segment. Figure 4.6 demonstrates the simulator interface. Simulation Parameter Sets In order to acquire results that closely correspond to real-world traffic conditions, we obtained the simulation parameters from data sets which report, for example, vehicle density, POI density, and traffic event sta- tistics in the Southern California area. We term the two parameter sets based on real-world statistics the Los Angeles County parameter set and the Riverside County parameter set. • Mobile Hosts: We collected vehicle statistics of Southern California from the Federal Statistics web site. The data provides the number of registered vehicles in the Los Angeles and Riverside Counties (5,498,554 and 944,645, respectively). In our simulations we assume that about 10% of these vehicles are on the road during non-peak hours according to the traffic information 74 from Caltrans. We further obtained the land area of each county to compute the average vehicle density per square mile. • Points of Interest: We obtained information about the density of interest objects (e.g., gas stations, restaurants, hospitals, etc.) in Southern California from two online sites: GasPriceWatch.com and CNN/Money. Because gas stations are commonly the target of NN queries, we use them as the point of interest type for our simulations. • Traffic Events: According to the National Transportation Statistics 2 and data from traffic related agencies (e.g., Caltrans Office of Traffic Safety, SIGALERT .com real-time traffic information, etc.), we acquired the traffic event statistics of Southern California and categorized these events into two main categories: traffic congestions (e.g., traffic hazards, traffic collisions, etc.) and detours (e.g., road constructions, road closures, etc.). Because there is no official survey of POI closure events, we assume a low occurrence rate. The event generation module is designed to plug-in the collected traffic statistic data and produce four types of traffic/POI events: (1) traffic congestions, (2) detours, (3) POI closures, and (4) event recoveries. We categorized traffic congestions into two levels: medium and heavy. The corresponding speed are 10 to 20 miles per hour (mph) and 0 to 10 mph respectively on local routes; and additionally 20 to 40 mph and 0 to 20 mph on highways. The ratio between 2 http://www.bts.gov/publications/national transportation statistics/ 75 medium and heavy congestions is also based on the traffic statistic data. The appearance ratio of these events are based on statistical traffic data and the interval between events is based on the Poisson distribution. The Los Angeles and the Riverside County parameter sets represent a very dense, urban area and a low-density, more rural area. Hence, for comparison pur- poses we blended the two real parameter sets together to generate a third, synthetic parameter set. The synthetic data set demonstrates vehicle density, POI density, and traffic event frequency in-between Los Angeles County and Riverside County, representing a suburban area. Table 4.3 lists the three parameter sets. Parameter Los Angeles Riverside Synthetic Units County County Suburbia POI Numb 65 21 43 Mobi Host 1852 204 1028 λ Cong 39 11 25 hr −1 λ Deto 7 3 5 day −1 λ Clos 3 2 3 day −1 TR Rang 200 200 200 m Time Exec 10 10 10 hrs Expe Regn 16 16 16 mile 2 Table 4.3: The simulation parameter sets for Los Angeles County, Riverside County, and the synthetic suburbia. 4.5.2 Implementation of Travel Time Network We acquired our road network data from the TIGER/LINE street vector data avail- able from the U.S. Census Bureau. The road segments belong to several different 76 Figure 4.6: The simulator and its visualization interface. categories, such as primary highways, secondary and connecting roads, and rural roads. The segments associated with a different road classes are associated with different maximum driving speeds. We define a road segment as the road section between two crossroads and the driving time of a road segment can be derived from its speed limit, current traffic events on it, and its length. During the execution of a simulation, each mobile host monitors the speed on the road that it is currently travelling on and adjusts its velocity accordingly. One of the challenges when inte- grating road segments into a complete travel time network is to isolate intersecting paths and determine if they are indeed intersections. For example, freeways gen- erally project many intersections in two-dimensional space, however many of them are over-passes or bridges. My solution is to detect intersection points with the help 77 of their endpoint coordinates. In addition, differing road classes let me distinguish over-passes from intersections. 4.5.3 Experiments We used all the three parameter sets, Los Angels County, Riverside County, and synthetic, to simulate our two adaptive nearest neighbor query algorithms. We varied the following parameters to observe their effects on the average driving time: the wireless transmission range, the number of congestions, and the number of detours. The performance metric of the simulation is the Driving Time Savings Rate (DTSR) which normalizes the saved driving time to the driving time of the pre-computed S P . Since the LANN algorithm queries for one nearest POI as the destination at the beginning and continuously searching for local optimal paths to the destination while the GANN algorithm updates its destination POI if a POI with a shorter travel time is found, we record the driving time to the present destination POI of GANN for comparing with LANN. The primary differences between the three parameter sets are vehicle density, POI density, and traffic event amount. Therefore, the simulation results reveal the applicability of our algorithms to different geographical areas. All simulation results were recorded after the system reached a steady state. Effect of the Transmission Range In the first experiment we varied the mo- bile host wireless transmission range from 20 meters to 200 meters, with all other 78 parameters unchanged. We chose 200 meters as a practical upper limit on the trans- mission range of the IEEE 802.11 technology. Figure 4.7 illustrates the percentage of the driving time which is saved by employing the LANN algorithm compared with following the pre-computed S P . As the transmission range extends, a mobile host can reach more peers and more related traffic events can be retrieved. As expected, the effect is most pronounced in Los Angeles County, because of its high vehicle density. At a transmission range of 200 m, the LANN algorithm can save approximately 10% to 15% driving time compared with the pre-computed shortest path solution. Effect of Congestion Frequency Since congestions have a significant impact on driving time, we studied the effect of varying the congestions frequency on the pre-computed shortest path S P by changing the number from 2 to 10 and the results are shown in Figure 4.8. We observe that when the congestions frequency increases, our algorithms have a better performance in areas with a higher vehicle density. This is because more traffic information can be detected by peer vehicles. In addition, the advantage of the travel time network becomes more pronounced when congestions on S P increase. Effect of Detour Frequency Next we varied the detours frequency from 2 to 10 on S P . Figure 4.9 illustrates the simulation results of the three parameter sets. 79 0 5 10 15 20 60 80 100 120 140 160 180 200 Communication Range (Meter) DTSR (%) LANN Fig. 4.7a. Los Angeles County. 0 5 10 15 20 60 80 100 120 140 160 180 200 Communication Range (Meter) DTSR (%) LANN 0 5 10 15 20 60 80 100 120 140 160 180 200 Communication Range (Meter) DTSR (%) LANN Fig. 4.7b. Synthetic Suburbia. Fig. 4.7c. Riverside County. Figure 4.7: The percentage of driving time that are saved by the LANN algorithm as a function of the mobile host transmission range. As we can observe, both our algorithms have a better performance than the pre- computed shortest path solution based on driving time. However since a mobile host can avoid a detour by taking nearby routes, all the result DTSR rates are lower than in the previous experiments. We conclude from all the performed experiments that my algorithms have better performance among all the experimental cases. We can observe that the mobile host density has a considerable influence on the driving time savings rate. 80 0 10 20 30 40 50 2 3 4 5 6 7 8 9 10 Congestions on the pre-computed shortest path DTSR (%) LANN GANN Fig. 4.8a. Los Angeles County. 0 10 20 30 40 50 2 3 4 5 6 7 8 9 10 Congestions on the pre-computed shortest path DTSR (%) LANN GANN 0 10 20 30 40 50 2 3 4 5 6 7 8 9 10 Congestions on the pre-computed shortest path DTSR (%) LANN GANN Fig. 4.8b. Synthetic Suburbia. Fig. 4.8c. Riverside County. Figure 4.8: The percentage of driving time that are saved by the LANN and the GANN algorithm as a function of congestions on the pre-computed shortest path. 4.6 Summary Geographic information systems are getting increasingly sophisticated and finding nearest neighbor objects represents a significant class of queries. Existing algo- rithms work on realistic, but static, spatial networks. The next generation of appli- cations will require real-time information to be integrated to produce search results that reflect the most current network conditions. To this end I have presented the 81 0 5 10 15 20 25 30 35 40 2 3 4 5 6 7 8 9 10 Detours on the pre-computed shortest path DTSR (%) LANN GANN Fig. 4.9a. Los Angeles County. 0 5 10 15 20 25 30 35 40 2 3 4 5 6 7 8 9 10 Detours on the pre-computed shortest path DTSR (%) LANN GANN 0 5 10 15 20 25 30 35 40 2 3 4 5 6 7 8 9 10 Detours on the pre-computed shortest path DTSR (%) LANN GANN Fig. 4.9b. Synthetic Suburbia. Fig. 4.9c. Riverside County. Figure 4.9: The percentage of driving time that are saved by the LANN and the GANN algorithm as a function of detour on the pre-computed shortest path. concept of a travel time network that is dynamically and continuously updated. Additionally, I have introduced two nearest neighbor query algorithms that oper- ate on such travel time networks. I have shown through simulation results that my techniques outperform the static approaches and reduce the travel time when dynamic events occur. 82 Chapter 5 Privacy Protection for Query Processing in Mobile Environments 5.1 Overview Due to the recent advances in low-power technologies, mobile devices with compu- tation, storage, and wireless communication capabilities have become increasingly popular. At the same time, the technique of positioning systems is embedded into these mobile devices. As a result, new mobile applications allow users to issue location-dependent queries in a ubiquitous manner. Examples of such location- dependent queries include ”find the nearest gas station”and ”find the top three closest French restaurants”. To get location-dependent data, users have to re- veal their current locations when launching location-dependent queries. From the location-dependent query logs of location-based service providers (LBSP), it is pos- sible that adversaries could collect the location history and monitor the behavior of 83 some users, which in turn invades their privacy. Therefore, with the popularity of location-based services (LBS), user privacy protection is a very important research issuetobestudied. Recent research have explored the K-anonymity concept [Swe02] 1 in which one trusted server is needed to cloak at least K users’ locations for protecting location privacy. In order to implement K-anonymity, one trusted server is set up to col- lect user location information and perform cloaking procedures in which the exact location of the query requester is blurred as a cloaked spatial area whose boundary is defined by the locations of K − 1 other users. Then, the trusted server will send the location-dependent query along with the cloaked spatial area to location- based service providers to retrieve location-dependent data. Note that since the query location is an area instead of a single query point, location-dependent ser- vice providers should fetch those query results based on the cloaked spatial region. Prior work in [MCA06] proposed a framework for location services without com- promising location privacy. However, only a free space environment is considered, which is not fully applicable in real world environments. On the other hand, recent research invented novel mechanisms to compute location-dependent queries on spa- tial networks. For example, executing nearest neighbor queries based on the spatial network distance provides a more realistic measure for applications where mobile user movements are constrained by underlying networks. Though devising spatial 1 Note that we use the symbol K for the degree of anonymity and k for k nearest neighbor queries. 84 query schemes in spatial networks, these prior works in [PZMT03, KS04] did not consider location privacy issues. Thus, we aim to provide spatial query (nearest neighbor queries and range queries) solutions with privacy protection concerns in this study. With our techniques location-based service users can obtain high quality services without sacrificing their privacy. 5.2 System Architecture In this section, we describe the system architecture for supporting privacy pro- tected spatial queries with underlying spatial networks. Figure 5.1 depicts our operating environment with three main entities: mobile users,the location cloaker, and location-based service providers. We consider mobile clients such as cell phones, personal digital assistants (PDA), and laptops, that are instrumented with a global positioning system (GPS) for continuous position information. Furthermore, we assume that there are access points/base stations distributed in the system envi- ronment for mobile devices to communicate with the location cloaker. All users are mobile and travel on the underlying network and they also hold privacy policies which specify the privacy requirements of each user. In this research we focus on two parameters, K-anonymous and the minimum cloaked region size, R min ,that are included in the user privacy policies. A user can demand the cloaked area to cover the locations of K − 1 closest peers for anonymizing its exact location. In order to keep a reasonable size of the cloaked area in high user density regions, the 85 user decides the minimum acceptable size of R min . Based on privacy requirements at different locations or time slots, a user can update his/her privacy policies at any time. Table 5.1 summarizes the symbolic notation used throughout this chapter. Privacy Policies Mobile User Access Point Mobile User Mobile User Location Cloaker User location updates and queries Query results Location-based Service Providers Spatial Databases Cloaked areas and spatial queries Query results DB DB Figure 5.1: The system architecture. Symbol Meaning A c A cloaked area K The degree of anonymity R min The minimum cloaked region size Q A priority queue Seg i The network segments inside the cloaked area A c Seg o The network segments outside the cloaked area A c T The set of all the points that intersect between the cloaked area A c and the underlying networks q A query mobile user |A| The cardinality of setA ||a, b|| The Euclidean distance between objects a and b Dist(a, b) The network distance between objects a and b Table 5.1: Symbolic notations for Chapter 5. 5.2.1 The Location Cloaker Compared with location-based service providers, the location cloaker is an inter- mediate agent which can be trusted by mobile users. The location cloaker receives 86 continuous location updates from mobile users and blurs their exact locations into cloaked areas A c according to individual user privacy policies before forwarding the information to location-based service providers (e.g., for buddy searching services). In addition, the location cloaker also anonymizes the location of any query request- ing user q to a cloaked region before forwarding the query to related location-based service providers. Note that any user identity related information in the query is also removed by the location cloaker during the cloaking process. According to previous research [MCA06, GG03, GL05] there are several different mechanisms to support location anonymization. However compared with existing solutions, the location anonymizing technique proposed by Mokbel et al. [MCA06] has prominent efficiency (low cloaking time) and flexibility (user defined privacy profile). Therefore, we adopt the grid-based complete pyramid data structure pro- posed in [MCA06] for our system to provide cloaking functionalities. 5.2.2 Location-based Service Providers Location-based service providers play the role of spatial data maintainers and spa- tial query processors in our system. In order to handle privacy protected spatial queries, location-based service providers implement privacy protected query proces- sors in their databases. The privacy protected query processor has the ability to process cloaked spatial queries efficiently and retrieves the inclusive result set (i.e., 87 the minimal set which covers all the possible answers) for query requesters. Af- ter receiving the result set, mobile users can distill the exact answers from their locations in linear time. The privacy policies of a user determine the computa- tional complexity of his/her spatial queries. Strict privacy requirements (i.e., large K and R min values) increase the complexity of processing the query. In order to improve efficiency, only cloaked spatial queries have to be processed by the privacy protected query processor and non-cloaked queries can be processed with existing spatial query algorithms. In addition to a privacy protected data processor, a LBSP also needs to main- tain spatial databases for storing (cloaked) user locations, spatial data and road networks. The stored spatial data can be categorized as public data and private data. Public data covers static objects such as restaurants, hotels, and gas stations and the dynamic information (e.g., real-time bus locations) which are directly open to public queries. On the contrary, private data mainly comprise cloaked mobile user locations from the location cloaker. Based on the two data categories, the spa- tial queries submitted to a LBSP can be classified as four types: (1) public queries over public data,(2) public queries over private data,(3) private queries over public data,and(4) private queries over private data. For the first query type, there were already existing solutions proposed in [PZMT03, KS04]. Because the movement of mobile users is limited by the underlying road networks, we can easily extend the mechanisms proposed in [PZMT03, KS04] for solving queries of the second query 88 type (e.g., Figure 5.2a) with probability density functions [CKP03]. To the best of our knowledge, there is no existing solution for the third query type and simi- larly the fourth query type (e.g., Figure 5.2b) can be answered by extending the algorithms of the third query type. Therefore, we propose our novel techniques for solving private queries over public data on spatial networks in Section 5.3. u12 u5 u17 u33 q Ac u12 u5 u17 u33 Fig. 5.2a. Public query over private data. Fig. 5.2b. Private query over private data. Figure 5.2: Two novel query types. In order to support queries on spatial networks, we assume a digitization process that generates a modeling graph from an input spatial network. The model- ing graph contains three categories of graph nodes: the network junctions, the start/end points of a road segment, and other auxiliary points (e.g., speed limit change points). In addition, as discussed in [PZMT03], we assume that the spatial network database supports the following primitive operations: • inside segments(A c ): returns a set of subsegments of a network N which intersects with the cloaked area A c . 89 • find objects(segment x ): returns the data objects which fall on the input network segment segment x . • Dist(p 1 ,p 2 ): calculates the network distance of two input points, p 1 and p 2 , in the underlying network by applying an algorithm (e.g., Dijkstra’s algo- rithm [Dij59]) to compute the shortest path between p 1 and p 2 . 5.3 Privacy Protected Query Processing We illustrate our mechanisms for solving private queries over public data on road networks in this section. We focus on two popular query types, nearest neighbor queries and range queries. 5.3.1 Privacy Protected Nearest Neighbor Query on Spatial Networks Given a query point q and an object data setS,anetwork k nearest neighbor query retrieves the k objects of S closest to q based on the network distance. Papadias et al. [PZMT03] proposed two algorithms (incremental euclidean restriction and incremental network expansion) to efficiently solve nearest neighbor queries with spatial network databases. However in order to protect user privacy, a location- based service provider can only receive cloaked spatial areas from users. Therefore, LBSPs need to have a competent mechanism for retrieving an inclusive query result 90 set based on the input cloaked area and the underlying spatial network. We design a privacy protected spatial network nearest neighbor query (PSNN) algorithm by extending the incremental network expansion solution [PZMT03]. PSNN first locates all the intersection points between the edges of the input cloaked area A c and the spatial network as a point set T by executing primitive database operations. If T is not empty, A c covers at least one network segment. We define these network segment(s) within A c as Seg i and the segments outside A c as Seg o . According to the network topology, the network edges inside A c can be fully connected or comprise several separate subgroups. Figure 5.3 illustrates the two cases. Since we designed efficient solutions for the case which the network edges inside A c are fully connected, we can utilize the divide and conquer strategy to handle the case demonstrated in Figure 5.3b. Consequently, we can execute a pre-process for splitting the input cloaked area until each subregion contains only one connected network segment set. Then we execute our algorithms on each subregion separately and merge their results after the whole computation. For ease of presentation, we assume the pre-process has been done in the following sections. Based on our observation the number of data objects inside a cloaked area A c can meet one of two conditions: (i) there are at least k objects inside A c or (ii) there are fewer than k objects within A c . The Cloaked Area Contains at Least k Objects Since Seg i contains a limited number of network segments, PSNN starts the search on Seg i for retrieving data 91 Ac Ac Fig. 5.3a. The network segments in A c are totally connected. Fig. 5.3b. The network segments in A c comprise two separate subgroups. Figure 5.3: The two possible connection statuses of the network segments inside a cloaked area A c . objects inside A c .If Seg i covers more than or equal to k data objects, the search can be finished by expanding the intersection points in T. First PSNN includes the data objects within A c into R. Then for each point t i in T,PSNN calculates the distance from t i to its k th nearest object inside A c as Dist(t i ,n k ), and then expands outward from point t i to search for data objects on Seg o within distance Dist(t i ,n k ). Consequently, we can cover the special case when q is located exactly at t i . All the newly discovered data objects are inserted into the result setR. Figure 5.4 demonstrates an example where the circles represent the nodes in the modeling graph, triangles indicate data objects, and the gray rectangle stands for the cloaked area. Assuming that only one nearest neighbor is queried (k =1), PSNN first retrieves n 1 by searching the network segments inside A c .Sincethe number of data objects found in A c is equal to 1, PSNN computes the network distance from t 1 , t 2 ,and t 3 to n 1 as 2, 5, and 6 respectively. Afterwards, PSNN expands the search space outbound from the three intersection points according to 92 their distance to n 1 . No data object is found from the expansion of t 1 and t 3 .The expansion of t 2 reaches n 2 and it is inserted into the result set. Consequently, the final search result set covers objects n 1 and n 2 . n1 n3 N1 n2 p2 p3 p4 p5 p6 2 4 2 4 4 t1 t2 t3 p1 N1 N1 3 2 1 2 p7 p8 2 5 4 8 Ac N1 Point of interest Legends Junction Intersection Figure 5.4: Searching k network nearest neighbors with PSNN where the cloaked area contains k objects (k = 1 in this example). The Cloaked Area Contains Fewer than k Objects If there are fewer than k data objects found on Seg i , PSNN has to search the network segments which are outside of A c .SincethepointsinT are all on the boundary of A c , they determine the network search expansion upper bound. Consequently, PSNN executes the network expansion from all the intersection points for retrieving an inclusive result set. For every point t i inT, the PSNN algorithm first retrieves the network segment p m p n which passes through t i and searches all data objects on this segment. In the mean time, the two end points p m and p n are inserted into a queue Q with their distance to t i . Afterward, if the search found fewer than k objects on p m p n or the search retrieved no fewer than k objects but one end point of p m p n whose distance 93 to t i is shorter than Dist(t i ,n k )(n k is the k th nearest neighbor of t i ), the end point p x which is closer to t i will be popped from Q and expanded. For each non-visited adjacent point p y of p x , PSNN searches p x p y , updates the result set, and inserts p y with its distance to t i into Q. Then the point in Q with the shortest distance to t i is de-queued. The procedure repeats until k nearest neighbors of t i are found and the k objects are inserted into the result set R. PSNN repeats the whole process until it expands all the points inT. Assuming k is equal to two, Figure 5.5 illustrates an example. PSNN retrieves only n 1 inside A c and it has to expand the three intersection points, t 1 , t 2 ,and t 3 outward. First, PSNN locates the segment p 1 p 2 which covers t 1 and the segment p 1 p 2 only covers object n 1 .Since p 2 is closer to t 1 than p 1 , it is expanded and p 1 is inserted into Q with its distance to t 1 . The expansion of p 2 reaches p 3 and p 4 ,and object n 3 is found on p 2 p 4 . Atthismoment,since Q contains(p 1 , 4), (p 3 , 6), (p 4 , 6) and since Dist(t 1 ,n 3 ) = 4, the search terminates. The top two nearest neighbors of t 1 are n 1 and n 3 and they are inserted intoR. PSNN continues the process with t 2 and retrieves the top two nearest neighbors of t 2 as n 2 and n 1 . The search of t 3 demonstrates why we have to expand the joint road segments until we reach the search bound. The search on p 1 p 5 retrieves n 4 and we know A c covers n 1 , however n 4 and n 1 are not the true top two nearest neighbors of t 3 . After finishing the complete search procedure, we can retrieve the correct nearest neighbor set of t 3 as n 4 and n 5 . The final result setR covers n 1 , n 2 , n 3 , n 4 ,and n 5 . Then, the result set 94 R will be returned to the query mobile user q and q will evaluate the query locally over the received R for retrieving the exact query result. The complete algorithm of PSNN is shown in Algorithm 6. n1 n3 N1 n2 p2 p3 p4 p5 p6 2 4 2 4 4 t1 t2 t3 p1 N1 N1 3 2 1 1 p7 p8 2 5 4 5 Ac 2 2 N1 n4 1 N1 n5 3 N1 Point of interest Legends Junction Intersection Figure 5.5: Searching k network nearest neighbors with PSNN where the cloaked area contains fewer than k objects (k = 2 in this example). 5.3.2 Privacy Protected Range Query on Spatial Networks We define a spatial network range query as: given a query point q, a range value r, and an object data set S, the query retrieves all elements of S that are within network distance r from q. For executing a privacy protected spatial network range query (PSRQ), first we have to locate the intersection points of the cloaked spatial area A c and the underlying road network as a set T = {t 1 ,...,t m }. Then, PSRQ searches the network segments inside A c and inserts the retrieved objects into R. For each point t i in T we can compute a set of candidate segments within net- work range r from t i and then retrieve the data objects falling on these segments. 95 Algorithm 6 PSNN (q, k, A c ) 1: locate the intersection points between A c and the underlying network as T = {t 1 ,...,t m } 2: Seg i = inside segments(A c ) 3: search objects on Seg i and insert the retrieved objects intoR 4: if Seg i covers ≥ k objects then 5: for ∀t i ∈T do 6: expand t i outward A c for searching objects within distance Dist(t i ,n k ) 7: insert any discovered objects intoR 8: end for 9: else 10: for ∀t i ∈T do 11: p m p n = find segment(t i ) 12: R i ∪ find objects(p m ,p n ) /* {n 1 ,...,n k } = the k nearest objects in R i sorted in ascending order, n j ,n j+1 ...,n k may be ∅,if |R i |<k */ 13: Dist max = Dist(t i ,n k ) /* Dist max =∞,if n k =∅ */ 14: Q = (p m ,Dist(p m ,t i )), (p n ,Dist(p n ,t i )) 15: de-queue the node p in Q with smaller Dist(p, t i ) 16: while Dist(p, t i )<Dist max do 17: for each non-visited adjacent vertex p x of p do 18: R i ∪ find objects(p x ,p) 19: update Dist max with sorted R i 20: en-queue(p x ,Dist(p x ,t i )) 21: end for 22: de-queue the next vertex p in Q 23: end while 24: R =R∪ R i 25: end for 26: end if 27: SortR for removing duplicates 28: return R 96 Because the search range r could be a large number and it could cover many candi- date segments, it is inefficient to check each candidate segment with the primitive operation find objects(segment). Therefore, we utilize the intersection join func- tion [BKS93] for retrieving all intersection object pairs from the spatial network R-tree and the object R-tree. When reaching the leaf node level, PSRQ executes the plane-sweep method with the object R-tree nodes which intersect the MBR of at least one candidate segment and include the qualified objects intoR. n2 n3 N1 n4 p2 p3 p4 p5 p6 6 2 4 4 t1 t2 p1 N1 N1 3 1 6 p7 p8 6 Ac 1 N1 n1 E1 E4 E3 E2 E5 E6 N1 Point of interest legends Junction Intersection Figure 5.6: Network range query with PSRQ. An example is demonstrated in Figure 5.6. Assuming r is equal to 7 unit distance, PSRQ joins the candidates segments (solid lines) with the object R-tree and retrieves leaf node E 5 intersecting with segment p 4 p 6 . After executing the intersection test (plane-sweep method), PSRQ retrieves n 4 as the query result. The complete algorithm of PSRQ is shown in Algorithm 7. 97 Algorithm 7 PSRQ (q, r, A c ) 1: locate the intersection points between A c and the underlying network as T = {t 1 ,...,t m } 2: Seg i = inside segments(A c ) 3: search objects on Seg i and insert the retrieved objects intoR 4: for each point t i inT do 5: Compute all the candidate segments within distance r from t i as C 6: Intersection join C with the object R-tree for finding intersection leaf nodes 7: for each retrieved leaf node E i do 8: R ti = intersection test of the intersected segments with data objects in E i 9: end for 10: R =R∪ R ti 11: end for 12: SortR for removing duplicates 13: return R 5.4 Evaluation In this section, we present extensive simulation results of our query processing al- gorithms. We implemented our privacy protected query algorithms in a simulator to evaluate the performance of our approach. Our main objectives are to to observe the influence of performance related factors (e.g., cloaked region size) on the system and to test the feasibility of our approach with real world parameter sets. Perfor- mance is measured in terms of the result set size and CPU time. All simulation results were recorded after the system model reached steady state. 5.4.1 Simulator Implementation Our simulator consists of three main components, the mobile environment,the lo- cation cloaker,andthe location-based service provider. For the mobile environment, 98 we utilized the network-based moving objects generation framework [Bri02] to gen- erate a set of mobile users and the underlying road network inside a geographical area, measuring 10 miles by 10 miles. Each mobile user is an independent object which encapsulates all its related parameters (e.g., its current speed and destina- tion). We implemented the location cloaker as a new module for interacting with mobile users to anonymize spatial queries in the framework. Our privacy pro- tected query approaches (Section 5.3.1 and Section 5.3.2) were also implemented inside the framework as new functions and play the role of the LBSP. We obtained our road network data of the City of Los Angeles and Riverside County from the TIGER/Line street vector data available from the U.S. Census Bureau. 5.4.2 Experiments with Performance Influential Factors We are interested in the effect of three major performance influential factors, cloaked region size, number of k, and number of Point of Interest (POI), of spatial queries. We experimented with both nearest neighbor queries and range queries with these factors as follows. Effect of the Cloaked Region Size We first varied the cloaked region size from 2% to 10% of the whole experimental region with all the other parameters as de- scribed by the Synthetic Suburbia data set in Table 5.2. Figure 5.7 demonstrates the result set size and query processing time with different cloaked region sizes. 99 Since a bigger cloaked region usually intersects with more underlying network seg- ments, it generates a larger candidate result set and takes longer to process. As shown in Figure 5.7a., the curve of PSRQ remarkably increases because the result set covers all the POIs within the cloaked region and the search range r. In addi- tion, the query processing time of PSNN increases notably with an enlarged cloaked region size as illustrated in Figure 5.7b. 0 10 20 30 40 50 60 2 3 4 5 6 7 8 9 10 Result Set Size Cloaked Region Size (%) PSNN PSRQ 0 0.2 0.4 0.6 0.8 1 1.2 1.4 2 3 4 5 6 7 8 9 10 Query Processing Time (sec.) Cloaked Region Size (%) PSNN PSRQ Fig. 5.7a. Result Set Size. Fig. 5.7b. Query Processing Time. Figure 5.7: The effect of the cloaked region size. Effect of k Next we tested the impact of varying the number of requested nearest neighbors, i.e., k. We altered k in the range from 4 to 20 as the number for each query with all the other parameters at their default values. As shown in Figure 5.8, the result set size grows when we raised k from 4 to 20 and we also observed that the result set increases faster when k is a relatively large number. The result set covers around 25% more objects than the queried k number when k isequalto20. 100 The factor has a similar influence on the query processing time as demonstrated in Figure 5.8b. 0 10 20 30 40 50 4 6 8 10 12 14 16 18 20 Result Set Size Number of k PSNN 0 0.2 0.4 0.6 0.8 1 1.2 1.4 4 6 8 10 12 14 16 18 20 Query Processing Time (sec.) Number of k PSNN Fig. 5.8a. Result Set Size. Fig. 5.8b. Query Processing Time. Figure 5.8: The effect of k. Effect of the Number of POI To see the effect of varying the total POI number, we increased the total POI number from 200 to 1000 in the simulation environment with all the other parameters of default. Figure 5.9 illustrates the result set size and query processing time of PSNN and PSRQ with increasing POI number. We notice that with a higher POI density the result set of PSRQ has a conspicuous increase compared with PSNN. This is expected, since PSRQ has to report all the objects inside the cloaked region. In addition, the query processing time of PSNN is always longer than that of PSRQ. 101 0 10 20 30 40 50 200 300 400 500 600 700 800 900 1000 Result Set Size Number of POI PSNN PSRQ 0 0.2 0.4 0.6 0.8 1 1.2 1.4 200 300 400 500 600 700 800 900 1000 Query Processing Time (sec.) Number of POI PSNN PSRQ Fig. 5.9a. Result Set Size. Fig. 5.9b. Query Processing Time. Figure 5.9: The effect of the POI number. 5.4.3 Experiments with Real World Parameter Sets In order to test our approach with realistic environments, we obtained our simula- tion parameters from public data sets which report vehicle density and POI density in the City of Los Angeles and Riverside County. We term the two parameter sets based on these real-world statistics the City of Los Angeles parameter set and the Riverside County parameter set. The details of the parameters are as follows. • Points of Interest: We obtained information about the density of interest objects (e.g., gas stations, restaurants, hospitals, etc.) in the Greater Los Angeles area from two online sites: GasPriceWatch.com and CNN/Money. Because gas stations are commonly the target of spatial queries, we use them as the sample POI type for our simulations. The system performance of other POI types are expected to be very similar. 102 • Mobile Hosts: We collected vehicle statistics of the Greater Los Angeles area from the Federal Statistics web site. The data provide the number of reg- istered vehicles in the City of Los Angeles and Riverside County (1,092,939 and 944,645, respectively). In our simulations we assume that about 10% of these vehicles are on the road during non-peak hours according to the traffic information from Caltrans. We further obtained the land area of the City of Los Angeles and Riverside County to compute the average vehicle density per square mile. Parameter City of Los Angeles Riverside County Synthetic Suburbia Units POI Num 580 260 420 MU Num 15325 3430 9378 K-anonymity 150 150 150 peers R min 1 1 1 mile 2 k 5 5 5 r 1 1 1 mile Table 5.2: The simulation parameter sets. The City of Los Angeles and the Riverside County parameter sets represent a very dense, urban area and a low-density, more rural area respectively. For experimental purpose, we mixed the two real parameter sets to generate a third, synthetic data set. Table 5.2 lists the two parameter sets where POI Num stands for the total POI number and MU Num stands for the total mobile user number. We used both input parameter sets, City of Los Angeles and Riverside County, to simulate our techniques for solving kNN queries and range queries (whose results are omitted due to space limitations). The simulation results are demonstrated in 103 Figure 5.10. The City of Los Angeles data set generates a larger result and requires a longer CPU time because of its high user mobility and POI density. However, the performance of the City of Los Angeles data set did not deteriorate much compared with the far sparser Riverside County parameter set. The results for range queries exhibit a similar trend as for kNN queries. 0 10 20 30 40 50 4 6 8 10 12 14 16 18 20 Result Set Size Number of k Los Angeles Riverside 0 0.2 0.4 0.6 0.8 1 1.2 1.4 4 6 8 10 12 14 16 18 20 Query Processing Time (sec.) Number of k Los Angeles Riverside Fig. 5.10a. Result Set Size. Fig. 5.10b. Query Processing Time. Figure 5.10: The effect of k with real-world parameter sets. 5.5 Summary In this chapter we propose two novel algorithms for processing k nearest neighbor queries and range queries on spatial networks with privacy protection. The main idea is to hide the exact mobile user location with a cloaked region. The cloaked region covers the query requester and at least K− 1 other users based on the K- anonymity concept. Then the spatial queries are executed based on both the cloaked region and the underlying networks. A candidate result set will be returned to the 104 requesting user and the user can filter out the exact answer. Our comprehensive simulations with real and synthetic parameter sets demonstrate the efficiency of our methods. 105 Chapter 6 A Distributed Geotechnical Data Management Architecture Geotechnical information on soil deposits is critical for civil infrastructures. Local, state and federal agencies, universities, and companies need this information for a variety of civil engineering applications, including land usage and development, and mapping of natural hazards such as soil liquefaction and earthquake ground motions. Foremost sources of geotechnical information, geotechnical boreholes are vertical holes drilled in the ground for the purpose of obtaining samples of soil and rock materials and determining the stratigraphy, groundwater conditions and/or engineering soil properties [Hun84]. In spite of rather costly drilling operations, boreholes remain the most popular and economical means to obtain subsurface in- formation. These type of data range from basic borehole logs containing a visual inspection report of soil cuttings to sophisticated composite boreholes combining 106 visual inspection and in-situ, laboratory geotechnical and geophysical tests. Fig- ure 6.1(a) shows an example transcript of the Standard Penetration Test (SPT), a particular type of geotechnical borehole test. Significant amounts of geotechnical borehole data are generated in the field from engineering projects each year. As data collection technologies improve, more and more geotechnical borehole data from the field and laboratory are directly produced in, or converted to a digital format. Furthermore, with the recent ubiquity of communication networks – particularly the Internet – the trend towards electronic storage and exchange of geotechnical borehole data has accelerated. One significant constraint is that geotechnical data is collected and managed by a multitude of private and public agencies, such as the U.S. Geological Survey (USGS), the California Department of Transportation (CalTrans), and others. Several pilot efforts are underway that aim to facilitate electronic access to geotechnical information. For instance, the ROSRINE (Resolu- tion of Site Response Issues from the Northridge Earthquake) project has produced an integrated system based on a relational database management system (RDBMS), geographic information system (GIS) and Internet Map Server (IMS) to disseminate geotechnical data via the Internet [SBHN02]. The USGS is continually publishing seismic Cone Penetration Test (CPT) data through a web-based system managed by the Earthquake Hazards Program [Hol01]. 107 Pitcher-type sample tubes. SPT-type sample. Fig. 6.1(a). Example of boring log. Fig. 6.1(b). Drilling and sampling activities. Figure 6.1: The photographs illustrate the geotechnical boring activities from drilling until the soil samples are examined. The result is a boring log showing stratigraphy, physical sampling record and SPT blow counts. The goal of the Geotechnical Information Management and Exchange (GIME) project is to overcome the challenges inherent in data sharing among heteroge- neous database repositories under different administrative control. Specifically, the following features and goals have guided our design. • Autonomy: Each of the archives contains data that is maintained by a specific organization (e.g., USGS). For organizational rather than technical reasons, it is undesirable to replicate or cache the data sets at other participating archives. Data sets may geographically overlap. 108 • Standardized access: It is desirable to allow direct, programmatic access to distributed data sets from end user applications. To hide the heterogeneity of the numerous data sources, Web services are employed to provide a stan- dardized interface. Web services build upon the idea of accessing resources (storage space, compute cycles, etc.) from a local machine on a powerful remote computer. Unlike earlier attempts to enable this functionality, Web services are broadly accepted and open standards that are supported by all major industry vendors. • Cooperative and efficient query processing: When presented with a query at any one of the participating database nodes, the overall system must cooper- atively execute the request and return all relevant data. For this purpose, an efficient access method is required which can rapidly decide which other nodes contain potentially relevant data and which do not. The query must then be forwarded to the candidate nodes and the result returned expediently to the querying host. We describe our design and implementation of the distributed GIME infrastruc- ture, comprised of a number of geographically distributed spatial databases as il- lustrated in Figure 6.2. 109 6.1 GIME Overview Figure 6.2 illustrates the architecture of the GIME system. Multiple, distributed geotechnical data archives are accessible via the Internet. Services are provided such that practitioners in the field can directly store newly acquired data in a repository while the data customers are able to access these data sets in a uniform way. By adopting a Web services infrastructure, multiple applications, such as a soil liquefaction analysis or a borehole visualization can be built in a modular fashion. GIME distinguishes two types of archives on the basis of the data access allowed: read-write (RW) or read-only (RO). RW archives host three geotechnical Web ser- vices to provide the interface for distributed applications to store (File Storage Web service, FSWS), query and retrieve (Query & Exchange Web service,QEWS) and visualize (Visualization Web service, VWS) the geotechnical information. RO archives implement only the QEWS Web service. In this case, an on-site data- base administrator may insert the data directly into the local database. Figure 6.2 illustrates the components of an RW archive in the upper, left corner. Finding the relevant data sets required for a specific application among all the geotechnical archives can be a daunting task. To conveniently process the spatial queries and locate the relevant information, we have designed an efficient query routing algorithm for GIME that automatically forwards queries to other known archives and collects the results before returning the data to the application (see 110 Section 6.1.2). Such forwarding mechanisms can be effective as demonstrated pre- viously by the SkyQuery project [MSBT03] and our own distributed query routing techniques [ZKC04]. The retrieved geotechnical borehole data is complex and sophisticated in that it contains both well structured and semi-structured elements. In Figure 6.1(a), for example, the Material Description (3 rd column from left) field contains free-form text, while some of the other columns are well structured. Therefore, an efficient data format for storage and exchange is required that is suitable for the diversity of geotechnical borehole data. In GIME, we use XML as the preferred container format for both storage and exchange of borehole data [ZBK + 03]. XML offers many advantages over other data formats for borehole data [AGS99, Sur02]. Its tree- structure and flexible syntax are ideally suited for describing constantly evolving and annotated borehole data. It also lends itself to an automated visualization capability that converts XML geotechnical data into a graphical view similar to the traditional hardcopy format. The output can be presented as Scalable Vector Graphics (SVG) 1 . Note that any uploaded data file is first placed in a temporary space where it is validated against the Document Type Definition (DTD) or XML Schema for geotechnical data sets. If the file is accepted the data is then stored in the main database. 1 http://www.w3.org/Graphics/SVG/ 111 GIME Client Soil Liquefaction Analysis QEWS Map Service. e.g., Microsoft TerraServer FSWS/QEWS/ VWS Data Source Data Source FSWS/QEWS/ VWS Data Source Other User Client Programs GIME Client Borehole Query RO RW RW Internet GIME Archive Server FSWS (File Storage Web service) QEWS (Query & Exchange Web service) VWS (Visualization Web service) Temporary Storage Data Source XML Extension Spatial Extension XSLT Transformation XML SVG XML DTD, XML Data Queries, XML Data MBR MBR MBR Middleware with Spatial Access Index Structure query.rect query.rect Result Data Result Data O1 O2 O3 O4 O5 O6 O7 O8 R3 R4 R5 R6 R1 R2 O1 O2 O3 O4 O5 O6 O7 O8 R3 R4 R5 R6 R1 R2 O1 O2 O3 O4 O5 O6 O7 O8 R3 R4 R5 R6 R1 R2 Figure 6.2: The GIME architecture is composed of multiple, distributed data archives. Some archives are read-only while others allow read and write access. Each archive contains a middleware utilizing replicated spatial index structures. 112 6.1.1 Geotechnical Web Services Functionalities Web services commonly operate from a combination of a Web and an application server and they can be implemented using many existing tools. Our local GIME prototype testbed utilizes the open source software components Apache Tomcat (web server) and Apache Axis (application server). The application code specific to GIME is embedded within Axis, which is convenient to use: when Tomcat starts, Axis automatically compiles the application codes located in its working directory and generates the necessary object class files. The three main geotechnical Web services provide a number of specialized meth- ods for programmatic access to the geotechnical data. The file storage service pro- vides client programs with an interface to upload their XML data files into the main database. During the upload process, meta-data is extracted and stored. The meta-data includes specific elements of the imported files to facilitate querying. The main purpose of the query and exchange Web service is to facilitate the dissemi- nation of the valuable geotechnical data sets and encourage their usage in broad and novel applications. XML borehole files, although easily readable by computers, become meaningful to geologists and civil engineers only after they are rendered into images (e.g., SVG format). Therefore, generating SVG files is the main pur- pose of the visualization service. The following is a list of the GIME Application Programming Interface (API) methods. 113 GeoPutDTD(): Upload and store a new DTD file on the server (FSWS). GeoGetDTD(): Retrieve the current DTD file (FSWS). GeoPutXMLFile(): Upload an XML borehole data file and store it on the server (FSWS). GeoQuery(): Execute a query expression and return a list of unique identification numbers, one for each borehole file in the result set (QEWS). GeoGetXMLFile(): Retrieve an XML borehole data file based on a unique identification number (QEWS). GeoVisualization(): Transform the XML borehole file selected with the unique identification number into SVG format on the server (VWS). GeoGetSVGFile(): Retrieve an SVG borehole file based on a unique iden- tification number (VWS). These methods allow uniform access to the GIME functionality while hiding the heterogeneous resources. 6.1.2 Efficient Query Routing with Spatial Indexing Given a federation of independently managed spatial database servers, one research challenge is the efficient querying of this distributed infrastructure. Note that the 114 data set in each repository may be disjoint or may overlap with other archives. To avoid that an application must contact each and every repository, we have imple- mented a distributed query mechanism that efficiently and automatically forwards queries to other known archives and collects the results before returning the data to the application. Repository data sets are spatially indexed at a middleware layer via replicated R-trees or Quadtrees [ZKC04]. The concept is illustrated in Figure 6.2. We first introduce a baseline algorithm for comparison purposes. Baseline Method: Exhaustive Query Routing We assume a non-volatile environment where occasionally, but not very frequently, a repository (node) leaves or joins. The data sets that we consider are large and valuable, and hence they are professionally managed. As a result, we can compile a list of all the participating archives. This list may not be completely up-to-date at a specific time instance, but accurate enough to result in few disruptions. The list is distributed to every archive and a query q that arrives at a specific node is forwarded to all other nodes for exhaustive processing. We call this naive method exhaustive query routing (EQR). Even though EQR is inefficient, it is useful as a baseline mechanism to compare our more sophisticated models against. The metric that we use to compare the different techniques is the total number of messages created in the system to execute a query q and collect the results. A lower number of messages reduces network traffic and indicates better scalability of the system. The number of messages generated by queries with EQR can be 115 represented as M =2× Q× (N − 1): the overall number of messages M is the product of the total number of queries Q and the number of archives N in the system. The total is doubled because an equal number of result messages are generated. Query Routing with R-Trees and Quadtrees Spatial Indexing For spa- tial data indexing, the R-tree [Gut84] and Quadtree [FB74] families of algorithms are well established. Both build a tree-structure that partitions the overall space into successively smaller areas at lower levels of the index hierarchy. R-trees and Quadtrees are very successfully used in the core engines of spatial database systems. We use them in a novel way as index structures across multiple spatial databases to decrease the query forwarding traffic. Specifically, we insert the minimum bounding rectangle (MBR) of the data set of each archive into a global R-tree or Quadtree. Because we prefer to avoid a centralized index server we further distribute copies of this global index structure to each archive. During query processing, an archive intersects each query rectangle with the archive MBRs stored in the global index. The query is then only forwarded to candidate archives whose MBR overlaps with the query rectangle, immediately reducing inter-node message traffic significantly. Note that forwarded queries are marked to show that they originated from a server rather than a client to avoid query loops. The results of forwarded queries are returned to the initially contacted server which aggregates them and returns the set to the client. 116 An overhead cost is incurred with the above design because it requires the global index structures to be synchronized. Furthermore, the complexity of keeping a distributed data structure consistent is added. However, one characteristic of this technique greatly reduces the overhead and makes it an attractive solution. Because the global index structures manage bounding rectangles, changes to the data set of any individual archive only result in index updates if the MBR changes – and this is very infrequent. Consider the following example. An archive manages 1,000 two- dimensional spatial data objects. The MBR is defined by at most four of them 2 . Any insertion or deletion of data objects confined within the MBR do not affect the global index; only changes that either stretch or shrink the MBR need to be propagated. Equation 6.1 shows the estimation function of the number of messages when a global index is used. M =[2× Q× (N − 1)× S Q ]+[U × (N − 1)× S U ] (6.1) The query traffic is reduced by a factor equal to the selectivity S Q of the query. The selectivity, 0≤ S Q ≤ 1, estimates what fraction of the total number of objects may be in the result set. Similarly, the update message traffic U is diminished by the factor 0 ≤ S U ≤ 1 that describes how many of the data updates (including insertions and deletions) actually result in global index changes. Consequently, 2 More than four points may define a rectangle if some of them have the exact same x-or y-coordinate values. 117 Number of R-tree Based Design Servers Updates MBR Index Structure S U Changes Changes 10 1000 45 45 0.045 100 10000 398 398 0.0398 200 20000 806 806 0.0403 400 40000 1582 1582 0.0396 600 60000 2457 2457 0.0410 800 80000 3359 3359 0.0420 1000 100000 4083 4083 0.0408 Number of Quadtree Based Design Servers Updates MBR Index Structure S U Changes Changes 10 1000 45 3 0.003 100 10000 398 20 0.002 200 20000 806 54 0.0027 400 40000 1582 91 0.0023 600 60000 2457 143 0.0024 800 80000 3359 186 0.0023 1000 100000 4083 207 0.0021 Table 6.1: The relationship between data updates, MBR updates, and index up- dates as a function of the number of servers and updates. the aggregate number of messages is dependent on the frequency of data updates in the system. Table 6.1 shows an example of S U values ranging from 0.002 to 0.045 with one of our experiments when increasing both the number of servers and updates linearly. Recall that the exhaustive query routing technique has no such dependency. As we will show in Section 6.3, global indexing generally results in a significant reduction of message traffic with the benefits being highest when few updates must be propagated. For the R-tree based design, every MBR change translates into an index struc- ture update. By contrast, in the Quadtree design the index structure updates are 118 reduced by an order of magnitude. Hence we can conclude that the update message traffic to synchronize distributed Quadtrees is much lower than for R-trees. This is because most MBR updates do not affect the Quadtree structure, hence no index synchronization is necessary. Table 6.1 illustrates the case where the activity per server is relatively constant. Based on these results our techniques are expected to scale well. 6.2 GIME Client Application Example The feasibility of the GIME concepts is illustrated with a sample client application shown in Figure 6.3. This borehole query and drafting (BQ&D) application is writ- ten in Java and demonstrates all the features of GIME, i.e., query and visualization of borehole data, and exchange of XML files. The background map is assembled from aerial images retrieved from the Microsoft TerraServer Web service [BGS + 02]. A query window can be selected graphically and the meta-data of matching bore- hole files are obtained from GIME and their locations displayed as dots over the background map (Figure 6.3). Borehole meta-data can be viewed by clicking on the dots. One advantage of a Web services infrastructure over traditional browser based applications is that raw data can be programmatically accessed and locally processed. For example, stand-alone engineering programs can import information from GIME. 119 This integration is illustrated with a drafting capability in our client. The draft- ing component produces fence diagrams and Log of Test Borings (LOTBs) along user defined alignments, based on the data retrieved from GIME Web services. A fence diagram is a two dimensional interpretation of the soil stratigraphy along a (usually) vertical plane. The right-most pop-up window in Figure 6.3 illustrates the concept of a fence diagram. Fence diagrams and LOTBs are the fundamental tools for civil engineers to perform major geotechnical studies. Traditionally, the production of these diagrams has required significant drafting efforts. The draft- ing application reduces this effort by automating the drafting process and directly accessing the required borehole data via the GIME infrastructure. An earlier version of the visualization component generates SVG files using Apache Batik. It can be downloaded from the GIME website and installed as a standalone Java program along with the Batik Squiggle package to display SVG files. More details are available from the GIME website at http://datalab.usc.edu/gime/. 6.3 Experimental Validation To validate the efficiency of our query routing design with a controlled and large amount of access traffic we ported the GIME modules to a simulation environment. The query routing component was implemented with a plug-in module for both the 120 R-tree and the Quadtree algorithms. In a distributed environment, the search com- plexity is dominated by the communication overhead between servers. Therefore, the focus of our simulation was on quantifying the query routing traffic generated by processing a sequence of spatial range queries and updates. If a query window intersected with several server MBRs, then the query was forwarded to each repos- itory. Tree update information was broadcast to all servers. We performed our experiments with both synthetic and real-world spatial data sets. Experiments The total simulation time for all our experiments was ten hours. For each query event, a server was randomly chosen as the injection point to receive the query from a client. Update events were randomly decided to be either insertions or deletions. For an insertion event, the simulator randomly chose a server which received the new borehole location and the MBR of each server could expand. For a deletion event, the simulator simply deleted a randomly selected borehole location from a randomly chosen server. If the deletion resulted in a tree index change (i.e., the MBR shrank), the simulator counted its related update synchronization traffic cost. We also created a simulation module to analyze the message traffic generated by the exhaustive query routing (EQR) mechanism with the same event sequence. Consequently, we were able to measure the performance differences between the two approaches. 121 Discussion of Results Figure 6.4(a) demonstrates that our design improves the query routing performance significantly with both synthetic and real-world data sets. The tree-based designs result in a decrease of approximately 60% to 70% of inter-server message traffic compared with exhaustive query routing (with query window sizes of 10% to 20% of the data area). We also defined a comparison metric termed the network performance improvement rate as NPIR = T EQR −Ttree T EQR . Figure 6.4(b) shows the results of our scalability experiments where we increased the number of servers from ten to one thousand while keeping the other parame- ters constant. As illustrated, the NPIR stays remarkably constant over the full spectrum of configurations and we conclude that the tree based designs scale well to large distributed systems. Additionally, we found that the data distribution in- fluences the performance of the systems with tree-based query routing. The best condition exists when there is no overlap among any server MBRs. With no MBR overlap the tree-based designs can reduce inter-server traffic by up to 90%. In the worst case, the performance decreases to the same level or slightly worse (because of the update costs) than ERQ. Therefore, a system designer needs to consider the characteristics of the data set before opting for the tree-based query routing algorithms. 122 Figure 6.3: A sample Borehole Query & Drafting client application. After a query has been issued a small pop-up display (the left window) illustrates the metadata of a borehole when the user clicks on one of the dots (indicating a borehole location) on the map. The fence diagram rendering (the right window) is generated from three selected XML data files. 0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 100 200 300 400 500 600 Accumulated Messages Time [Minutes] Update messages EQR query messages R-tree query messages Quadtree query messages 0 20 40 60 80 100 100 200 300 400 500 600 700 800 900 1000 Network Performance Improvement Rate (%) Total Number of Servers R-tree Based Design Quadtree Based Design Fig. 6.4(a) Fig. 6.4(b) Figure 6.4: Performance of two query routing techniques. Fig. 6.4(a) compares the query message traffic between the EQR mechanism and the tree-based designs over a ten hour period. The update message traffic is negligible (synthetic and real-world data sets). Fig. 6.4(b) illustrates that the performance of the tree-based designs remains stable when the number of servers increases from ten to one thousand. 123 6.4 Summary We presented our Geotechnical Information Management and Exchange architec- ture aimed at facilitating the utilization of geotechnical information. We illustrated the usefulness of this design with a relevant application that implements direct pro- grammatic access to geotechnical data via our Web services. The Web services are being presently tested in realistic work environments in collaboration with munici- pal, state, and federal agencies as well as international partners. 124 Chapter 7 Conclusions This dissertation presents novel approaches to support spatial queries in mobile environments with short access latency and higher query result accuracy. Mobile user spatial query privacy protection techniques are also illustrated. In addition, a distributed geotechnical information management architecture is introduced as an application of the proposed algorithms. 7.1 Summary The SBNN and SBRQ algorithms are the mechanisms to reduce access latencies of nearest neighbor queries and window queries in broadcast systems. The two techniques allows a mobile peer to locally verify whether candidate objects received from neighbors are indeed part of its own spatial query result set. By virtue of its peer-to-peer architecture, the method exhibits great scalability: the higher the 125 mobile peer density, the more queries can be answered by peers. Therefore, the query access latency can be markedly decreased with the increase of clients. Since SBNN and SBRQ are mainly focus on processing spatial queries of static information, I have proposed two other algorithms, LANN and GANN for utilizing dynamic information. Besides, I have presented the concept of a travel time network that is dynamically and continuously updated. The next generation of applications will require real-time information to be integrated to produce search results that reflect the most current network conditions. With travel time networks and the two novel algorithms, mobile users can obtain more accurate query results compared with existing solutions. In order to protect mobile user privacy on executing spatial queries, I proposed two novel algorithms, PSNN and PSRQ. The main idea is to hide the exact mobile user location with a cloaked region. The cloaked region covers the query requester and at least K−1 other users based on the K-anonymity concept. Then the spatial queries are executed based on both the cloaked region and the underlying networks. A candidate result set will be returned to the requesting user and the user can filter out the exact answer. For demonstrating the practicability of the proposed algorithms, I implemented a distributed geotechnical information management architecture and illustrated the usefulness of this architecture with a relevant application that implements direct programmatic access to geotechnical data via our Web services. The Web services 126 are being presently tested in realistic work environments in collaboration with mu- nicipal, state, and federal agencies as well as international partners. 7.2 Future Work I plan to discover more useful properties and related applications of these proposed novel data management methods in the future. Three possible future research directions are as follows. 7.2.1 Energy Consumption The mobile data management algorithms which are proposed in this research can also be utilized to support hand-held devices (e.g., PDAs, cell phones). However power management will become an important topic for study. Due to the slow ad- vancement of battery technology, power management in wireless networks remains to be a critical issue. There are some existing solutions for decreasing power con- sumption in mobile environments, e.g., the asynchronous wake-up model [ZHS03]. I plan to integrate these existing power management solutions with extensions into the proposed algorithms for supporting devices with limited power supply. 7.2.2 Communication Protocols The current algorithms utilize ad hoc communication for reaching peers within a single hop. Consequently the number of peers which can be reached is limited and 127 geocast routing protocols [Mai04] can be a solution to extend the communication range of a mobile host. Geocasting is the delivery of a message to mobile hosts within a geographical region. Other than the destination’s position, each peer needs to know only its own position and the position of its one-hop neighbors in order to forward messages. Since it is not necessary to maintain explicit routes, georouting does scale well even if the network is highly dynamic. With geocast, new services and applications are feasible, such as finding friends who are nearby, geographic advertising, and road accident warning systems. Therefore after integrating geocast techniques into the algorithms, many promising applications can be invented. 7.2.3 Approximate Spatial Queries in Mobile Environments Conventional spatial queries are usually meaningless in mobile environments since their results may be invalidated as soon as the query data objects move. Tao el al. [TP03] proposed solutions of the most common spatial queries (i.e. window queries, nearest neighbor queries, spatial joins) in dynamic environments. However, their mechanism is based on a centralized database server which would take a long response time and users have to submit predefined trajectories. Therefore, I intend to develop approximate (still with high accuracy) spatial query techniques with quick response time and users can move without any restraints. A preliminary so- lution for approximate nearest neighbor queries has been proposed in Section 3.3.2. 128 Reference List [AA01] Ashraf Aboulnaga and Walid G. Aref. Window Query Processing in Linear Quadtrees. Distributed and Parallel Databases, 10(2):111–126, 2001. [AGS99] AGS. Electronic Transfer of Geotechnical and Geoenvironmental Data. Association of Geotechnical and Geoenvironmental Specialists, 3rd Edition, United Kingdom, 1999. [Bar99] Daniel Barbar´ a. Mobile Computing and Databases - A Survey. IEEE Trans. Knowl. Data Eng., 11(1):108–117, 1999. [BGS + 02] Tom Barclay, Jim Gray, Eric Strand, Steve Ekblad, and Jeffrey Richter. TerraService.NET: An Introduction to Web Services. Tech- nical Report MSR-TR-2002-53, Microsoft Research, June 2002. [BKS93] Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. Effi- cient Processing of Spatial Joins Using R-Trees. In SIGMOD Con- ference, pages 237–246, 1993. [BKS01] Stephan B¨ orzs¨ onyi, Donald Kossmann, and Konrad Stocker. The skyline operator. In ICDE, pages 421–430, 2001. [BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bern- hard Seeger. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. In SIGMOD Conference, pages 322–331, 1990. [BMJ + 98] Josh Broch, David A. Maltz, David B. Johnson, Yih-Chun Hu, and Jorjeta Jetcheva. A Performance Comparison of Multi-Hop Wire- less Ad Hoc Network Routing Protocols. In Proceedings of the 4 th ACM/IEEE MobiCom, pages 85–97, 1998. [Bri02] Thomas Brinkhoff. A Framework for Generating Network-Based Moving Objects. GeoInformatica, 6(2):153–180, 2002. 129 [CKP03] Reynold Cheng, Dmitri V. Kalashnikov, and Sunil Prabhakar. Eval- uating Probabilistic Queries over Imprecise Data. In SIGMOD Con- ference, pages 551–562, 2003. [CLC04] Chi-Yin Chow, Hong Va Leong, and Alvin Chan. Peer-to-Peer Coop- erative Caching in Mobile Environment. In ICDCS Workshops, pages 528–533, 2004. [CSL01] Boris Y. L. Chan, Antonio Si, and Hong Va Leong. A Framework for Cache Management for Mobile Databases: Design and Evaluation. Distributed and Parallel Databases, 10(1):23–57, 2001. [dBvKOS00] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry Algorithms and Applications (2nd Edition). Springer, 2000. [Dij59] Edsger. W. Dijkstra. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik, 1:269–271, 1959. [EA04] Alon Efrat and Arnon Amir. Buddy Tracking - Efficient Proximity Detection Among Mobile Friends. In INFOCOM, 2004. [FB74] Raphael A. Finkel and Jon Louis Bentley. Quad Trees: A Data Struc- ture for Retrieval on Composite Keys. Acta Inf., 4:1–9, 1974. [FRM94] Christos Faloutsos, M. Ranganathan, and Yannis Manolopoulos. Fast Subsequence Matching in Time-Series Databases. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 24-27, pages 419–429, 1994. [GC04] Gregor Gaertner and Vinny Cahill. Understanding Link Quality in 802.11 Mobile Ad Hoc Networks. IEEE Internet Computing, 8(1):55– 60, 2004. [GG03] Marco Gruteser and Dirk Grunwald. Anonymous Usage of Location- Based Services Through Spatial and Temporal Cloaking. In MobiSys, 2003. [GL04] Marco Gruteser and Xuan Liu. Protecting Privacy in Continuous Location-Tracking Applications. IEEE Security & Privacy, 2(2):28– 34, 2004. [GL05] Bugra Gedik and Ling Liu. Location Privacy in Mobile Systems: A Personalized Anonymization Model. In 25th International Conference on Distributed Computing Systems, pages 620–629, 2005. 130 [Gut84] Antonin Guttman. R-Trees: A Dynamic Index Structure for Spa- tial Searching. In SIGMOD 1984, Proceedings of Annual Meeting, Boston, Massachusetts, June 18-21, pages 47–57, 1984. [Hol01] Tom Holtzer. Distribution of USGS CPT Data via the Web. In Procedings of the COSMOS/PEER-LL Workshop on Archiving and Web Dissemination of Geotechnical Data, Richmond, CA, October 2001. [HS99] G´ ısli R. Hjaltason and Hanan Samet. Distance Browsing in Spatial Databases. ACM Transactions on Database Systems, 24(2):265–318, 1999. [HSW94] Yixiu Huang, Prasad Sistla, and Ouri Wolfson. Data Replication for Mobile Computers. In Proceedings of the Proceedings of the 1994 ACM SIGMOD international conference on Management of data, Minneapolis, Minnesota, United States, pages 13 – 24, 1994. [Hun84] R.E. Hunt. Geotechnical Engineering Investigation Manual. McGraw-Hill, New York, 1984. [IVB97] Tomasz Imielinski, S. Viswanathan, and B. R. Badrinath. Data on Air: Organization and Access. IEEE Transactions on Knowledge and Data Engineering, 9(3):353–372, 1997. [Jag97] H. V. Jagadish. Analysis of the Hilbert Curve for Representing Two-Dimensional Space. Information Processing Letters, 62(1):17– 22, 1997. [JK96] Ravi Jain and Narayanan Krishnakumar. An Asymmetric Cost Model for Query Processing in Mobile Computing Environments. Kluwer Wireless Information Networks, pages 363 – 377, 1996. [KRR02] Donald Kossmann, Frank Ramsak, and Steffen Rost. Shooting stars in the sky: An online algorithm for skyline queries. In VLDB, pages 275–286, 2002. [KS04] Mohammad R. Kolahdouzan and Cyrus Shahabi. Voronoi-based K Nearest Neighbor Search for Spatial Network Databases. In Proceed- ings of the 30 th International Conference on Very Large Data Bases, Toronto, Canada, pages 840–851, 2004. [KZW07] Wei-Shinn Ku, Roger Zimmermann, and Haixun Wang. Location- based Spatial Queries with Data Sharing in Wireless Broadcast En- vironments. In ICDE, 2007. 131 [KZWN06] Wei-Shinn Ku, Roger Zimmermann, Haojun Wang, and Trung Nguyen. ANNATTO: Adaptive Nearest Neighbor Queries in Travel Time Networks. In Mobile Data Management, 2006. [KZWW05] Wei-Shinn Ku, Roger Zimmermann, Haojun Wang, and Chi-Ngai Wan. Adaptive Nearest Neighbor Queries in Travel Time Networks. In 13th ACM International Symposium on Geographic Information Systems, ACM-GIS 2005, November 4-5, Bremen, Germany, Pro- ceedings, pages 210–219, 2005. [LC00] Shou-Chih Lo and Arbee L. P. Chen. Optimal index and data al- location in multiple broadcast channels. In ICDE, pages 293–302, 2000. [LLZX06] Ken C. K. Lee, Wang-Chien Lee, Baihua Zheng, and Jianliang Xu. Caching complementary space for location-based services. In EDBT, pages 1020–1038, 2006. [Mai04] Christian Maih¨ ofer. A Survey of Geocast Routing Protocols. IEEE Communications Surveys and Tutorials, 6(2), 2004. [MCA06] Mohamed F. Mokbel, Chi-Yin Chow, and Walid G. Aref. The New Casper: Query Processing for Location Services without Compromis- ing Privacy. In VLDB, pages 763–774, 2006. [Mok06] Mohamed F. Mokbel. Towards Privacy-Aware Location-Based Data- base Servers. In ICDE Workshops, page 93, 2006. [MSBT03] Tanu Malik, Alex S. Szalay, Tamas Budavari, and Ani R. Thakar. SkyQuery: A Web Service Approach to Federate Databases. In Pro- ceedings of the First Biennial Conferenc on Innovative Data Systems Research (CIDR 2003), pages 188–196, Asilomar, CA, January 5-8 2003. [PTFS03] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. An optimal and progressive algorithm for skyline queries. In SIGMOD Conference, pages 467–478, 2003. [PTFS05] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Pro- gressive skyline computation in database systems. ACM Trans. Data- base Syst., 30(1):41–82, 2005. [PZMT03] Dimitris Papadias, Jun Zhang, Nikos Mamoulis, and Yufei Tao. Query Processing in Spatial Network Databases. In Proceedings of the 29 th International Conference on Very Large Data Bases, Sep- tember 9-12, Berlin, Germany, pages 802–813, 2003. 132 [RDK03] Qun Ren, Margaret H. Dunham, and Vijay Kumar. Semantic caching and query processing. IEEE Trans. Knowl. Data Eng., 15(1):192–210, 2003. [RFH + 01] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott Shenker. A scalable content-addressable network. In Pro- ceedings of the ACM SIGCOMM Conference, pages 161–172, 2001. [RKV95] Nick Roussopoulos, Stephen Kelley, and Fr´ ed´ eic Vincent. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD Inter- national Conference on Management of Data, San Jose, California, pages 71–79, 1995. [SBHN02] Jennifer Swift, Jean-Pierre Bardet, Jianping Hu, and Robert Nig- bor. An Integrated RDBMS-GIS-IMS System for Dissemination of Information in Geotechnical Earthquake Engineering. Computers & Geosciences, 2002. [SCB] Samuel Sheng, Anantha Ch, and R. W. Brodersen. A portable multi- media terminal for personal communications. IEEE Communications Magazine, 30:64 – 75, December. [SHG03] Bill N. Schilit, Jason I. Hong, and Marco Gruteser. Wireless Location Privacy Protection. IEEE Computer, 36(12):135–137, 2003. [SJLL00] Simonas Saltenis, Christian S. Jensen, Scott T. Leutenegger, and Mario A. Lopez. Indexing the Positions of Continuously Moving Ob- jects. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA., pages 331–342, 2000. [SK] Thomas Seidl and Hans-Peter Kriegel. Optimal Multi-Step k-Nearest Neighbor Search. In SIGMOD 1998, Proceedings ACM SIGMOD In- ternational Conference on Management of Data, June 2-4, Seattle, Washington, USA. [SKS02] Cyrus Shahabi, Mohammad R. Kolahdouzan, and Mehdi Shar- ifzadeh. A Road Network Embedding Technique for k-Nearest Neigh- bor Search in Moving Object Databases. In Proceedings of the Tenth ACM International Symposium on Advances in Geographic Informa- tion Systems, McLean, VA, USA, November 8-9, pages 94–10, 2002. [SMK + 01] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM Con- ference, pages 149–160, 2001. 133 [SMLN + 03] Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1):17–32, 2003. [Spi04] Sarah Spiekermann. General Aspects of Location Based Services. In Location-Based Services, pages 9–26. 2004. [SR01] Zhexuan Song and Nick Roussopoulos. K-Nearest Neighbor Search for Moving Query Point. In Proceedings of the 7th International Sym- posium on Advances in Spatial and Temporal Databases, Redondo Beach, CA, USA., pages 79–96, 2001. [SRF87] Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos. The R+- Tree: A Dynamic Index for Multi-Dimensional Objects. In VLDB, pages 507–518, 1987. [Sur02] User Survey. Archiving and Web Dissemination of Geotechni- cal Data Website. COSMOS/PEER-LL 2L02, USC, 2002. URL http://geoinfo.usc.edu/gvdc/. [SWCD97] A. Prasad Sistla, Ouri Wolfson, Sam Chamberlain, and Son Dao. Modeling and Querying Moving Objects. In Proceedings of the Thir- teenth International Conference on Data Engineering, Birmingham U.K., pages 422–432, 1997. [Swe02] Latanya Sweene. k-Anonymity: A Model for Protecting Privacy. In- ternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002. [TP03] Yufei Tao and Dimitris Papadias. Spatial Queries in Dynamic Envi- ronments. ACM Trans. Database Syst., 28(2):101–139, 2003. [TPS02] Yufei Tao, Dimitris Papadias, and Qiongmao Shen. Continuous Near- est Neighbor Search. In Proceedings of the 28th International Confer- ence on Very Large Data Bases, Hong Kong, China., pages 287–298, 2002. [WZK05] Haojun Wang, Roger Zimmermann, and Wei-Shinn Ku. ASPEN: An Adaptive Spatial Peer-to-Peer Network. In Proceedings of the 13 th ACM International Workshop on Geographic Information Systems, ACM-GIS 2005, November 4-5, Bremen, Germany,, pages 230–239, 2005. 134 [ZBK + 03] Roger Zimmermann, Jean-Pierre Bardet, Wei-Shinn Ku, Jianping Hu, and Jennifer Swift. Design of a Geotechnical Information Architec- ture Using Web Services. In Proceedings of the Seventh World Multi- Conference on Systemics, Cybernetics and Informatics (SCI 2003), Orlando, Florida, July 27–30, 2003. [ZHS03] Rong Zheng, Jennifer C. Hou, and Lui Sha. Asynchronous Wakeup for Ad Hoc Networks. In Proceedings of the 4th ACM Interational Symposium on Mobile Ad Hoc Networking and Computing, MobiHoc 2003, Annapolis, Maryland, USA, pages 35–45, 2003. [ZKC04] Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng Chu. Efficient Query Routing in Distributed Spatial Databases. In Proceedings the 12 th ACM International Symposium on Geographic Information Systems, ACM-GIS 2004, November 12-13, 2004, Washington, DC, USA, pages 176–183, 2004. [ZKW04] Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang. Spatial Data Query Support in Peer-to-Peer Systems. In Proceedings of the 28 th International Computer Software and Applications Conference, Hong Kong, China, pages 82–85, 2004. [ZKW + 06] Roger Zimmermann, Wei-Shinn Ku, Haojun Wang, Amir Zand, and Jean-Pierre Bardet. A Distributed Geotechnical Information Man- agement and Exchange Architecture. IEEE Internet Computing, 10(5):26–33, 2006. [ZLL03] Baihua Zheng, Wang-Chien Lee, and Dik Lun Lee. Spatial Index on Air. In PerCom, pages 297–304, 2003. [ZLL04] Baihua Zheng, Wang-Chien Lee, and Dik Lun Lee. Spatial Queries in Wireless Broadcast Systems. Wireless Networks, 10(6):723–736, 2004. 135
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Spatial query processing using Voronoi diagrams
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Differentially private learned models for location services
PDF
Efficient updates for continuous queries over moving objects
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Scalable processing of spatial queries
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Approximate query answering in unstructured peer-to-peer databases
PDF
Query processing in time-dependent spatial networks
PDF
Generalized optimal location planning
PDF
Privacy-aware geo-marketplaces
PDF
Location privacy in spatial crowdsourcing
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Efficient crowd-based visual learning for edge devices
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Enabling query answering in a trustworthy privacy-aware spatial crowdsourcing
PDF
Crowd-sourced collaborative sensing in highly mobile environments
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Partitioning, indexing and querying spatial data on cloud
Asset Metadata
Creator
Ku, Wei-Shinn
(author)
Core Title
Location-based spatial queries in mobile environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2007-05
Defense Date
01/30/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
location-based services,mobile data management,OAI-PMH Harvest,peer-to-peer systems,spatial data management
Language
English
Advisor
Zimmermann, Roger (
committee chair
), Hwang, Kai (
committee member
), Krishnamachari, Bhaskar (
committee member
), McLeod, Dennis (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
wku@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m298
Unique identifier
UC1458547
Identifier
etd-Ku-20070227 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-205810 (legacy record id),usctheses-m298 (legacy record id)
Legacy Identifier
etd-Ku-20070227.pdf
Dmrecord
205810
Document Type
Dissertation
Rights
Ku, Wei-Shinn
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
location-based services
mobile data management
peer-to-peer systems
spatial data management