Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
(USC Thesis Other)
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PRIV ACY IN LOCATION-BASED APPLICATIONS; GOING BEYOND K-ANONYMITY , CLOAKING AND ANONYMIZERS by Jaffar Khoshgozaran A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2010 Copyright 2010 Jaffar Khoshgozaran Dedication To my beloved wife and my dear family for their unconditional love. ii Acknowledgments I will never be able to find the words to thank my academic advisor, professor Cyrus Shahabi, without whom this work would not be possible. Cyrus gave me the gift of education, taught me how to think, conduct research, approach a problem and present results. What I appreciate in you is your mentorship and your friendship. I will also never forget your comments on my paper drafts and will keep them all somewhere safe! I am also greatly indebted to the folks at Infolab for their unconditional support, encouragement and friendship during the past five years. Among them, a special thanks goes to Dr. Mehdi Sharifzadeh for helping me write my first academic paper, Dr. Mehrdad Jahangiri for being my L A T E Xguide and traditional Persian music resource and Dr. Farnoush Banaei-Kashani who deserves special recognition for always being there to listen and to give advice and for his contribution toward the completion of this thesis. I wish the world had more people like you! I am also indebted to Houtan Shirani-Mehr with whom I explored many privacy research ideas throughout my studies at USC. I will remember those long nights at infolab for the rest of my life. It was an honor to have the Turing Award winner Professor Leonard Adleman along with Professor Shrikanth Narayanan, Professor Aiichiro Nakano and Professor Hossein Hashemi serving in my Ph.D. qualification and (the last two in) my dissertation com- mittee. I am humbled by their support, guidance, and vision. I would extend a special iii thanks to Professor Hashemi for his continuous support and mentorship throughout my Ph.D. studies at USC. My numerous friends of many years from Sharif University of Technology deserve special credit and have my deepest recognition for inspiring me and for teaching me many new things. I wish I could list all their names here among which I owe big grati- tude to Maysam Sayyadian, Ali Nouri and Mazda Ahmadi. I am by far most grateful to my friend of more than a decade Mahmoudreza Amini for believing in me, for pulling me into getting my Ph.D. and for pushing me along the way. I would also like to express my enormous gratitude to my former teachers and advi- sors among which Professor Poorvi V ora and Professor Abdou Youssef from The George Washington University and Professor Raman Ramsin and Professor Hassan Mirian from Sharif University of Technology deserve special credit for many of the successes. Last but not least, I sincerely appreciate my wife Senobar and my family for their incredible encouragement and support throughout all these years. I am quite grateful to you all and deeply admire and respect your love, patience and support. iv Table of Contents Dedication ii Acknowledgments iii List of Tables ix List of Figures x Abstract xiv Chapter 1: Introduction 1 1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Formal Problem Definition . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Trust and Threat Model . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Position Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Private Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Oblivious Index Navigation . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Location Privacy in Emerging Social Applications . . . . . . . . . . . . 10 1.6 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Space Transformation 12 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Space Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Locality Preserving Space Filling Curves . . . . . . . . . . . . 15 2.2.3 Making Hilbert Mapping Irreversible . . . . . . . . . . . . . . 17 2.3 Privacy-Aware Query Processing . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Offline Space Encoding . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Online Query Processing-ApproximateKNN . . . . . . . . . . 20 2.3.3 Online Query Processing-Range . . . . . . . . . . . . . . . . . 23 2.3.3.1 Finding the Range Query Runs . . . . . . . . . . . . 24 2.4 Online Query Processing-ExactKNN . . . . . . . . . . . . . . . . . . 28 v 2.5 Dual Curve Query Resolution . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 ApproximateKNN Queries: Proximity in Hilbert Curves . . . 30 2.5.2 Range Queries: The Number of Query Runs . . . . . . . . . . . 31 2.5.3 A Dual Curve for Query Processing . . . . . . . . . . . . . . . 32 2.5.4 ExactKNN Queries: Smaller Regions with Fewer Runs . . . . 34 2.6 Proposed End-to-End Architecture . . . . . . . . . . . . . . . . . . . . 36 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3: Spatial Query Processing Using Private Information Retrieval 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 Private Information Retrieval . . . . . . . . . . . . . . . . . . . 41 3.3 Privacy Aware Query Processing . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Querying Private Index Structures with PIR . . . . . . . . . . . 43 3.3.2 Private Range Queries . . . . . . . . . . . . . . . . . . . . . . 47 3.3.3 Private KNN Queries . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3.1 Progressive Expansion . . . . . . . . . . . . . . . . . 49 3.3.3.2 Hierarchical Expansion . . . . . . . . . . . . . . . . 51 3.3.3.3 Hilbert Expansion . . . . . . . . . . . . . . . . . . . 51 3.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.1 Grid Granularity . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 Secure Record Decomposition and Padding . . . . . . . . . . . 55 3.5 A Sample Implementation . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.1 Hardware-Based PIR . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.3 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4: Oblivious Index Traversal 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Obfuscating Access Frequencies . . . . . . . . . . . . . . . . . . . . . 72 4.3.1 Probabilistic Uniform Node Access . . . . . . . . . . . . . . . 74 4.3.2 Object Permutation . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Security and Complexity Analysis . . . . . . . . . . . . . . . . . . . . 89 4.5 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 Partially Full R-trees . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.2 Other Tree Structured Spatial Indexes . . . . . . . . . . . . . . 92 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 vi Chapter 5: Experiments 94 5.1 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . . . . . 94 5.2 Space Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.2 Choosing the Right Curve Order (N) . . . . . . . . . . . . . . . 96 5.2.3 Evaluating Query Results’ Quality . . . . . . . . . . . . . . . . 96 5.2.4 Approximate KNN Query Evaluation . . . . . . . . . . . . . . 97 5.2.5 Range Query Evaluation . . . . . . . . . . . . . . . . . . . . . 100 5.2.6 Exact KNN Query Evaluation . . . . . . . . . . . . . . . . . . 103 5.3 Private Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.2 Space Complexity Analysis . . . . . . . . . . . . . . . . . . . 108 5.3.3 The Effect of . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.4 Choosing the Optimum Cut-off Value . . . . . . . . . . . . . . 112 5.3.5 The Effect of K . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.6 The Effect of Hilbert Curve Order . . . . . . . . . . . . . . . . 113 5.3.7 End-to-End Performance . . . . . . . . . . . . . . . . . . . . . 114 5.3.8 Comparing with Other Approaches . . . . . . . . . . . . . . . 115 5.3.8.1 PIR-Based Approaches . . . . . . . . . . . . . . . . 116 5.3.8.2 Transformation-Based Approaches . . . . . . . . . . 117 5.4 Oblivious Index Navigation . . . . . . . . . . . . . . . . . . . . . . . . 118 5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.2 Frequency Skewness . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.3 Node Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4.4 Frequency Domain Range . . . . . . . . . . . . . . . . . . . . 123 5.4.5 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4.6 End-to-End Response Time . . . . . . . . . . . . . . . . . . . 126 Chapter 6: Related Work 128 6.1 Anonymity and Cloaking . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.2 Space Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3 Private Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Oblivious Tree Traversal . . . . . . . . . . . . . . . . . . . . . . . . . 131 Chapter 7: Conclusions 134 Appendix 136 Chapter A: Location Privacy in Emerging Social Applications 136 A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.2 Trust and Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . 138 A.3 Querying an Untrusted Server . . . . . . . . . . . . . . . . . . . . . . 141 vii A.3.1 Client Side Query Processing with Space-Driven Indexing . . . 142 A.3.2 Plain Indexing: Aggregation and Isolation . . . . . . . . . . . . 143 A.4 The PBS Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A.4.1 Group Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.4.2 Server Side Indexes . . . . . . . . . . . . . . . . . . . . . . . . 145 A.5 Private Spatial Queries with PBS . . . . . . . . . . . . . . . . . . . . . 147 A.5.1 Range Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 147 A.5.2 K-Nearest Neighbor Queries . . . . . . . . . . . . . . . . . . . 148 A.5.3 Buddy Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.6 PBS Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 A.6.1 Group Related Operations . . . . . . . . . . . . . . . . . . . . 150 A.6.2 User Related Operations . . . . . . . . . . . . . . . . . . . . . 151 A.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.7.1 Datasets and Experimental Setup . . . . . . . . . . . . . . . . . 153 A.7.2 PBS Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.7.3 Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.7.4 End to End Query Processing . . . . . . . . . . . . . . . . . . 157 A.7.5 Comparison with Other Approaches . . . . . . . . . . . . . . . 158 A.8 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 161 References 162 viii List of Tables Table 2.1: SampleELT for Figure 2.3 . . . . . . . . . . . . . . . . . . . . . 20 Table 4.1: Notations and Symbols . . . . . . . . . . . . . . . . . . . . . . . 71 Table 5.1: Storage Requirements of SC . . . . . . . . . . . . . . . . . . . . . 109 Table 5.2: Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . 120 Table 5.3: Response Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 ix List of Figures Figure 1.1: Privacy Aware LBS . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 2.1: Space Encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 2.2: (a) AH 2 2 Pass of the 2-D Space (b) Recursive Hilbert Curve Con- struction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 2.3: Algorithms CreateIndex (top), KNN-Generate (middle) and KNN- Resolve (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Figure 2.4: A Range Query Decomposed into Six Maximal Blocks (a), with the Underlying Hilbert Curve (b). . . . . . . . . . . . . . . . . . . . . 23 Figure 2.5: Algorithms Range-Generate (top) and Range-Resolve(bottom) . . 27 Figure 2.6: ExactKNN Query Processing . . . . . . . . . . . . . . . . . . . 29 Figure 2.7: Missed Sides of H 1 2 and H 2 2 Curves (a), Four strips Around the Range Query (b), Shifted Range Query(c). . . . . . . . . . . . . . . . . 31 Figure 2.8: Proximity in the Original Curve (middle) vs. the Rotated (left) and Shifted (right) Dual Curves. . . . . . . . . . . . . . . . . . . . . . 33 Figure 2.9: DCQR Architecture for Spatial Query Processing. . . . . . . . . . 35 Figure 3.1: The Private Index Structures . . . . . . . . . . . . . . . . . . . . 44 Figure 3.2: The Range Query R (a) Computing the Safe Region R 0 for the KNN Query (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 x Figure 3.3: Progressive, Hierarchical and Hilbert KNN Algorithms . . . . . . 52 Figure 3.4: Record Size Distribution . . . . . . . . . . . . . . . . . . . . . . 56 Figure 3.5: Record Decomposition and Padding . . . . . . . . . . . . . . . . 58 Figure 4.1: R-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.2: Probability Contributions of NodeN 0 i;3 . . . . . . . . . . . . . . . 76 Figure 4.3: Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 4.4: Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 4.5: Original TreeR . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 4.6: Shuffled TreeR 0 . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.7: Shuffling Steps for TreeR . . . . . . . . . . . . . . . . . . . . . 85 Figure 4.8: Visiting vs. Expanding Nodes . . . . . . . . . . . . . . . . . . . 90 Figure 5.1: Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 5.2: Finding the Right Value of Curve Order (N). . . . . . . . . . . . . 96 Figure 5.3: Precision Vs. Curve Order (N). . . . . . . . . . . . . . . . . . . . 98 Figure 5.4: Displacement Vs. Curve Order (N). . . . . . . . . . . . . . . . . 98 Figure 5.5: Response Time Vs. Curve Order (N). . . . . . . . . . . . . . . . 99 Figure 5.6: Precision Vs. K . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Figure 5.7: Displacement Vs. K . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 5.8: Response Time Vs. K . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 5.9: Precision Vs. Curve Order for Different Range Query Sizes. . . . 101 Figure 5.10: Number of Runs Vs. Hilbert Order for a Query with Different Selectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 5.11: Running Times Vs. Curve Order (N) for Different Query Selec- tivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Figure 5.12: Excessive Objects Vs. Curve Order (N) . . . . . . . . . . . . . . 104 xi Figure 5.13: Response Time Vs. Curve Order (N) . . . . . . . . . . . . . . . 105 Figure 5.14: KNN Excessive Objects Vs. K . . . . . . . . . . . . . . . . . . 105 Figure 5.15: KNN Response Time Vs. K . . . . . . . . . . . . . . . . . . . . 106 Figure 5.16: Real-world, Uniform, Highly Skewed and Sparse Datasets . . . . 107 Figure 5.17: Effect of on Range Algorithm for the Real-World Dataset . . . 109 Figure 5.18: Effect of on Progressive, Hilbert and Hierarchical Algorithms for the Real-World Dataset . . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 5.19: Relative Overhead Reduction of Secure Padding for Datasets of Figure 5.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Figure 5.20: Effect of for Skewed (top) and Sparse (bottom) Datasets . . . . 113 Figure 5.21: Effect ofK for the Real World Dataset . . . . . . . . . . . . . . 114 Figure 5.22: Effect of N on C (left) and Time (right) for the Real World Dataset Using Algorithm 8 . . . . . . . . . . . . . . . . . . . . . . . . 114 Figure 5.23: End-to-end Performance for Range (a,b) and KNN (c,d) Algorithms116 Figure 5.24: Comparing with Other Approaches . . . . . . . . . . . . . . . . 117 Figure 5.25: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Figure 5.26: Skewness:fs;l;f max ;f 0 max g values for US . . . . . . . . . . . . 121 Figure 5.27: Skewness:fs;l;f max ;f 0 max g values for LA . . . . . . . . . . . . 121 Figure 5.28: Capacity:fc;f max ;f 0 max g values for US . . . . . . . . . . . . . 123 Figure 5.29: Capacity:fc;f max ;f 0 max g values for LA . . . . . . . . . . . . . 124 Figure 5.30: Frequency Range:fU i ;f max ;f 0 max g values for US . . . . . . . . 124 Figure 5.31: Frequency Range:fU i ;f max ;f 0 max g values for LA . . . . . . . . 124 Figure 5.32: Effects of each parameter on . . . . . . . . . . . . . . . . . . 126 Figure A.1: The PBS Trust Model . . . . . . . . . . . . . . . . . . . . . . . 139 xii Figure A.2: (a) Object Space (b) Object List (c) Naive Encryptions (c) Aggregate (top) and Isolated (bottom) Indexing . . . . . . . . . . . . . . . . . . . . . 141 Figure A.3: (a) ACI and IOI Structures (b) PBS Architecture . . . . . . . . . 144 Figure A.4:KNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Figure A.5: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Figure A.6: PBS Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Figure A.7: (a) Range Queries (b) NN Queries . . . . . . . . . . . . . . . . . 156 Figure A.8: Response Time for (a,b) 0:1%, (c,d) 0:5%, and (e,f) 1% Selectivity 157 Figure A.9: Comparison with Casper . . . . . . . . . . . . . . . . . . . . . . 158 xiii Abstract An obvious requirement for evaluating spatial queries in Location Based Services (LBS) is that the location of the query point needs to be shared with the location server respond- ing to user queries. Spatial data such as points of interest are indexed at this potentially untrusted server (host) and queries are evaluated by navigating the underlying index structure used to partition the data. However, a user’s location is highly sensitive infor- mation that once compromised, can expose him to various threats such as stalking and inference about his health problems or political/religious affiliations. Such growing con- cerns about users’ location privacy in LBS is considered to be the biggest impediment to the explosive growth and popularity of location-based services. The anonymity and cloaking-based approaches proposed to address this problem cannot provide stringent privacy guarantees without incurring costly computation and communication overhead. Furthermore, they require a trusted intermediate anonymizer to protect user locations during query processing. In this dissertation, we identify the key challenges of enabling privacy in location- based services using an untrusted server model. We propose three solutions to the loca- tion privacy problem. Our first solution employs a space transformation scheme to pri- vately evaluate location queries in a space unknown to the untrusted server. The novel one-way transformation developed allows fast computation of location queries in the xiv transformed space while respecting user privacy. We develop our second solution based on the theory of Private Information Retrieval to achieve yet stronger levels of privacy. This strong measure of privacy comes with more computational cost. Finally, we pro- pose a more fundamental technique that enables oblivious traversal of tree-structured spatial indexes for query processing. With this technique, the original spatial index is replaced with an encrypted spatial index that is hosted at the server. While preserv- ing user privacy, this technique allows a wide range of spatial queries to be efficiently evaluated over the encrypted index. xv Chapter 1 Introduction 1.1 Motivation and Problem Statement Location-based services (LBS) are becoming an important part of our everyday life. According to the Pew Internet & American Life Project survey [Pew], 47% of current cell-phone users say they would prefer to have mapping features on their phones as well as another 24% who say would like to be able to browse the web for services especially for maps and directions, movie listings and weather reports 1 . The survey also shows that in October 2004, 66% of American adults subscribed to cell-phone services, a num- ber that has risen undoubtedly to around 195 million recently [Jup]. Similar trends can be shown for other portable devices such as car navigation systems, PDA’s and com- puter laptops. Meanwhile, with the advent of inexpensive GPS devices, many of these portable devices can (and already have) incorporated GPS systems. Hence, the loca- tion of the user (or query point) can be accurately identified and reported to a location service provider. Consequently, information customized to the user location, such as weather forcasts, nearest points of interests and the locations of nearby friends can be provided. However, recent concerns over how such services can jeopardize user’s pri- vate information resulted in a newly coined term of location privacy. There is a growing concern over how individuals and enterprizes might maliciously exploit users’ personal 1 The reader is referred to [CS] to see a variety of innovative applications based on user’s location information realized by GPS-enabled cell-phones. 1 location data [NYC, USA, for, BBCa, BBCb, tax]. Several researchers and organiza- tions have raised the need to explore the threats associated with location-based services [BS03, Ack, BD, WMM03]. Al-Muhtadi et. al in [AMCK + ] goes as far as warning about the creation of “a ubiquitous surveillance system” if user privacy in ubiquitous mobile systems is not carefully addressed. In addition to revealing an individual’s location, LBS queries, such as “find the near- est cancer treatment center”, may disclose other sensitive information about individuals, including health condition, lifestyle habits, political and religious affiliations, or may result in unsolicited advertisements (spam). It is important to note that hiding the user’s identity alone without hiding his location would not address these privacy issues. An attacker or a (potentially malicious) location server can infer the identity of the query source from its (subsequent) location information. For example, a user’s location infor- mation can be tracked through several stationary connection points (e.g., cell towers). After a while, the user leaves “a trail of packet crumbs” which could be associated to a certain residence or office location and easily lead to determine the user’s identity. Sev- eral other types of surprisingly private information can also be revealed by just observing anonymous users’ movement and usage pattern over time [Blu, GRB08]. Due to the increasing concern over the sensitivity of location data, location privacy has emerged as a new and very active field of interest to the database research com- munity. Similar to many other existing approaches in areas such as data mining and databases, various techniques based on theK-anonymity principle [Swe02] have been extensively used to provide location privacy [GG, GL04, MCA, BWJ, GL, BS03]. With these approaches, a user’s location becomes indistinguishable amongK 1 other users or is hidden in a larger cloaked region before sending the query to the server to make it harder for the untrusted server to locate the querying user. 2 The cloaking process is usually performed by a third party known as the anonymizer. Alternatively, users can generate the anonymity set in a decentralized fashion. However, with the former approach, all users are required to trust and continuously report their private locations to an intermediary, as sophisticated as the location server to protect user locations. Similarly, with the latter approach, each user should trust and share his private location data with every other user in the system. Aside from such unrealis- tic assumptions of trust, these approaches suffer from several other drawbacks: (i) by design, all queries are directed to the anonymizer (or a subset of other users) during the system’s normal mode of operation. (ii) users have to trade-off their privacy with the quality of service or overall system performance. (iii) cloaking fails to protect user locations under certain distributions. These issues are detailed in various studies such as [KS, GKK + , KSMS, Hen] and we briefly study some of them in Chapter 6. More recently, several multi-party computation schemes are proposed that employ encryption techniques to enable location privacy. However, there is an inherent limita- tion in using traditional encryption techniques for blind evaluation of spatial queries. To illustrate, assume our server uses recently proposed encryption techniques to compute the encryption of the Euclidean distance between an encrypted point (i.e., the query ori- gin) and each point of interest [IW]. These encrypted distances can then be sent back to the client who can decrypt them and find the topK results. Trivially, this protocol protects user privacy since the location of neither the query point nor the result set is revealed to the server (see Section 1.1.1). However, the main limitation here is that dis- tance between query point and each and every point of interest must both be computed and transferred to the client, i.e., O(n) computation and communication complexity wheren is the size of the database. There are cryptographic binary search communica- tion protocols that may reduce the communication complexity to logarithmic; however, 3 the computation complexity at the server cannot be further reduced. This is because the points of interest are treated as vectors with no exploitation of the fact that they are in fact points in space. In this dissertation, we propose three fundamental approaches that go beyond the conventional techniques for location privacy and devise frameworks to eliminate the need for an anonymizer in location-based services and satisfy significantly more strin- gent privacy guarantees as compared to the anonymity/cloaking-based approaches. More specifically, we show how we use Space Encoding, Private Information Retrieval (PIR) and Oblivious Navigation of Tree-Structured Spatial Indexes to provide users with a more generalized, efficient and secure location privacy scheme while querying static data such as points of interest. 1.1.1 Formal Problem Definition Given a set of static objects S =fo 1 ;o 2 ;:::;o n g in 2-D space stored at the location server LS and a set of users U =fu 1 ;u 2 ;:::;u M g in an area A which can be repre- sented as a set of discrete locationsA =fl 1 ;l 2 ;:::;l 2 2Ng (we discretizeA into a grid of 2 N 2 N cells), theKNN query with respect to query pointq i finds a setS 0 S of K objects where for any objecto 0 2S 0 ando2SS 0 ,D(o 0 ;q i )D(o;q i ) where D is the Euclidean distance function. Similarly, a range query returns all objects that fall inside a rectangular query window represented asw(x;y;n 1 ;n 2 ) wherex andy are the coor- dinates of the lower left corner of the window andn 1 andn 2 are the height and width of the window query, respectively. As range andKNN queries constitute the dominant spatial queries performed in location-based services, we focus on blind evaluation of these two types of queries and refer to them as Spatial Queries hereafter. In a typical 4 spatial query scenario, the static objects represent points of interest (POI) and the query points represent user locations. Users subscribe toLS to provide them with its location based services. To enable location privacy, a useru i ’s location should not be revealed toLS while responding to u i ’s queries. Location information can be exposed toLS through a spatial query or its result set. A conventional range orKNN query sent to the server includes location infor- mation that can be exploited by the server to pinpoint a user’s location. For instance, if u i is searching for his nearest gas station, his precise location information is needed to find the closest POI. Similarly,LS can deduceu i ’s location from the result of a spatial query. For instance, suppose u i manages to secretly receive the gas station o j as the result of his 1NN query fromLS without revealing his location. Here, the server learns thatu i is located somewhere in thevoronoi cell [OBSC00] ofo j . Therefore, our goal is to preventLS from learning user locations through a query or its result set. We refer to this as the blind evaluation criteria. Definition 1. Blind evaluation of spatial queries: Suppose the result of a spatial query q, issued by useru i located at pointl i evaluates toR = (o 1 ;o 2 ;:::;o K ) by the server LS. We sayq is blindly evaluated ifl i is not revealed toLS viaq orR andLS is unable to narrow downu i ’s location to a region or among a subset of users. Note that Definition 1 imposes stronger privacy requirements than the commonly used K-anonymity or cloaking [SS98, GL, GG, KGMP06, BS03, MCA] criteria, in which a user is indistinguishable amongK other users (whereK is usually a very small number) or his location is blurred in a small cloaked region. Based on the above properties, we term a location server privacy aware if it is capa- ble of blindly evaluating spatial queries. The challenge in blind evaluation of spatial 5 queries is protecting the very piece of information (i.e., user location) that is needed to process spacial queries and form a result set. The following example shows how the above properties should be satisfied in a typ- ical KNN query. Suppose a user asks for his 3 closest gas-stations. In this case the untrusted location server should acquire neither the location of the user, nor its identity nor the actual location or identity of any of the 3 closest gas stations in the response set while the user should receive the actual results matching his query. For the rest of this dissertation, unless otherwise stated, we use the term user to refer to a client subscribed to a location-based service located at point l i issuing the queryq. While we always encrypt client/server communications to protect the content of information, anonymous communication is orthogonal to our problem and constructs such as Onion Router (Tor) [DMS] can be used to protect users against traffic analysis or eavesdropping attacks. 1.1.2 Trust and Threat Model We consider a model in which users query a central untrusted server for various points of interest (Figure 1.1). While users trust their client devices to run legitimate software, they do not trust any other entity in the system including the location server. Users might collude with LS against other users and thus from each user’s point of view, all other users as well as LS can be adversarial. We refer to any such entity as an adversary. LS owns and maintains a database of POIs and responds to users queries as a service provider. Users subscribe to LS’s services. As part of our threat model, we assume that the server’s database is publicly accessible and available and thus an adversary can perform the so-called known plaintext attack. 6 Figure 1.1: Privacy Aware LBS Although users do not trust the location server, we take an honest but curious behav- ioral model for the location server. That is, LS does not deviate from the predefined protocols. However, it is curious to take advantage of any sensitive user data. This is a practical assumption in many disciplines such as database outsourcing [HL] and secure file sharing [KRS + ]. We assume there is a secure communication channel between users and LS and thus the connection cannot be sniffed by adversaries (obviously, users’ information cannot be protected from being leaked to adversaries through other means such as physical observation). In other words, no adversary can learn about a user’s location without colluding with the server. Therefore, hereafter, we only focus on the location server as the most powerful adversary. We also assume that adversaries are computationally bounded. We finally stress that we assume unmolested program execution on users’ client devices that prevents adversaries from breaching into a client device. Un-tampered exe- cution of a program on an untrusted client platform remains an interesting open problem [KRS + ] beyond the scope of our work. In the remainder of this chapter, we present a sketch of our three approaches to achieve location privacy. 7 1.2 Position Transformation With our space encoding approach, we show how space filling curves can be treated like one-way transformations to encode the locations of both user(s) and points of inter- est into an unknown space and to evaluate a query in this transformed space. The transformed space maintains some of the spatial properties of the original space which enables efficient evaluation of location queries. At the same time, our transformation can be viewed as an encoding of the space with a one-way transformation function that allows fast computation of its inverse given some extra knowledge, termed trap- door [Sch84] or transformation key. Subsequently, the client can encode the query using its key and the server performs the query in the encoded space and returns back to client the encoded answers for client’s fast decoding. Consequently, similar to conventional encryption schemes, we do not need any intermediator between the client and server to evaluate spatial queries blindly. 1.3 Private Information Retrieval Our second approach is based on the theory of Private Information Retrieval (PIR) to protect sensitive information regarding user locations from malicious entities (Section 3.2.1). Using a PIR protocol, a client can retrieve a database item hosted at an untrusted server without revealing which item is retrieved from the host. Although PIR can be used to privately generate a query result set, avoiding a linear scan of the entire object space is challenging. This is due to the fact that the server owning the object information cannot be trusted to perform the query processing and choose what to be returned as responses. Alternatively, moving this knowledge to the users is not possible due to the bulky nature of the data and costly data replication and maintenance. Utilizing spatial 8 partitionings based on PIR, we devise algorithms that significantly reduce the amount of information that is privately queried from an untrusted server during blind evaluation of spatial queries. 1.4 Oblivious Index Navigation With many location-based services, spatial data is indexed using a spatial indexing tech- nique and the indexed data is stored at the server for query processing. To enable blind evaluation of spatial queries, we replace the original indexed data with its privacy-aware variant. Although our space transformation technique allows efficient and blind eval- uation of range and KNN queries, the developed techniques are susceptible to more sophisticated adversaries with strong prior knowledge about object distributions and how frequently they are being queried. As we detail in Chapter 2, enuring privacy in such extreme cases is not a trivial task. On the other hand, with our PIR-based approach, stringent user privacy is achievable as the server does not learn anything about the con- tent of data requests by users during the query processing. Therefore, no information about the frequency of object accesses is leaked to the untrusted server. However, as we see in Chapter 3, relying on PIR to achieve perfect secrecy is costly. In Chapter 4, we propose a more generalized solution that consists of two techniques to hide frequency access to the nodes of tree-structured spatial indexes (e.g., R-trees) from the untrusted server hosting the data. With our first technique, each access to an index node requires reading an extra node using a precomputed node-based probability distribution function to guarantee uniform node access at all tree levels. Our second approach employs a tree permutation scheme that obfuscates the index nodes access frequencies by shuffling the assignments of internal elements to nodes while fully pre- serving the integrity of the underlying index. 9 1.5 Location Privacy in Emerging Social Applications The bulk of techniques proposed for location privacy assume users query location servers for static POI data such as restaurants and hospitals. However, with the abun- dance of location-aware client devices, users are becoming increasingly interested in querying the location information of their “buddies” in a social networking context. The nature of such queries is fundamentally different compared to their conventional static variant for several reasons. First, here users query the dynamically changing locations of other users whereas in the static case, queried data does not change with time. This dynamic nature of the data plays an important role on the choice of partitioning tech- niques used to efficiently index and query spatial data. More importantly, the different nature of information queried by users (i.e., other people’s locations as opposed to POI) and complex interactions among users, change our assumptions of trusted and adversar- ial entities from the original setup for static location queries (see Section A.2). In Appendix A, we propose a framework, called PBS (Private Buddy Search), which protects users’ profiles and locations from adversaries in a social networking environ- ment. We aim to strike a compromise between efficiency and privacy by storing users’ aggregate information in various encrypted index structures at a centralized server and then pushing the query processing to the client side. Hence, the clients utilize the encrypted index structures to only retrieve a small portion of the database related to the query area. Consequently, all the location updates are only sent to and stored (encrypted) in a single centralized server while by securely communicating with the (untrusted) server, clients receive enough information to answer most typical location queries. We note however, that the dynamic nature of user locations reduces the strong effect of server’s prior knowledge about original object distributions that is present in the 10 static case. In other words, with user locations continuously changing, associating cer- tain encrypted node requests to locations in the 2-D space becomes more challenging. Therefore, in this dissertation we devote more attention to solving the location privacy problem in the context of location-based services and querying static data. 1.6 Road Map The remainder of this dissertation is organized as follows. We review our Space Trans- formation, PIR-based and Oblivious Tree Traversal approaches in Chapters 2, 3 and 4, respectively. Chapter 5 presents our empirical evaluation of the above techniques. In Chapter 6, we present the related work. We conclude this dissertation with a summary of our contributions and future work in Chapter 7. Finally, in Appendix A we present our Private Buddy Search approach for achieving user location privacy in emerging location-based social networking applications. 11 Chapter 2 Space Transformation 2.1 Introduction The most fundamental classes of location queries predominantly used in location-based services (LBS) are K-nearest-neighbor (KNN) and range queries where a group of mobile users want to find the location of their K closest objects from a query point (KNN) or all objects located in a certain area of interest (range). An obvious require- ment for evaluatingKNN (range) queries is that the location of the query point (query window) needs to be shared with the location server (server for short) responding to user queries. However, as we stated in Chapter 1, a user’s location is highly sensitive information that once compromised, can expose him to various threats such as stalking and inference about his health problems or political/religious affiliations. Protecting users’ locations while responding to their location-dependant queries is challenging due to an interesting dilemma in resolving such queries: while precise infor- mation about query location is needed to generate the result set for a spatial query, the privacy constraints of the problem does not allow revealing users’ location information to the potentially untrusted server responding to such queries. In order to resolve this dilemma, we propose a fundamental approach based on utilizing the power of one-way functions and locality-preserving transformations to preserve users’ location privacy by encoding the space of all objects and queries and processing spatial queries blindly in the transformed space. 12 In this chapter, we utilize space filling curves and one-way hash functions to trans- form the locations of both user(s) and points of interest into an encoded space and to evaluate a query in this space. The transformed space maintains some of the distance properties of the original space which enables efficient evaluation of location queries while remaining irreversible without the knowledge of a transformation key. Subse- quently, the client can encode the query using its key and the server blindly responds to the query in the transformed space and returns back to client the encoded answers which are transformed back to the original space using the knowledge of the key at the client side. Consequently, similar to conventional encryption schemes, we do not need any intermediator between the client and server to evaluate the spatial queries blindly. We propose an approximateKNN query algorithm that guarantees constant compu- tation and communication complexity while providing a very close approximation of the original query results and an exact range query algorithm which takesO(n l logT ) +jRj time where n l = max(n 1 ;n 2 ) for a query of size n 1 n 2 , T = 2 N for N being the Hilbert curve order andjRj denoting the number of objects in the query window. Build- ing on the above two algorithms, we also propose an exactKNN algorithm. Our exact KNN algorithm first computes a candidate result set using the approximateKNN algo- rithm and uses the results to form a region around the query point guaranteed to include the exact results to theKNN query. It then employs the range query algorithm to retrieve all objects within the computed region. 13 2.2 Preliminaries In this section, we first discuss the sketch of our approach and its use of one-way trans- formations and also study the challenges associated with finding the right transforma- tions. We review an important class of many-to-one dimensional mappings called space filling curves which are used in our approach to achieve location privacy. 2.2.1 Space Encoding In this section, we introduce our novel approach for protecting user’s location from the malicious location servers by transforming the static objects to a new space using a locality-preserving one-way transformation and also addressing the (transformed) query in this new space. As mentioned earlier, we address the issue of location privacy in the context of location-based services and thus focus on the 2-D space of static objects (i.e., points of interest) and dynamic query points (i.e., users). We need a one-way function to map each point from the original space to a point in the transformed space to prevent the server from obtaining the original results by reversing the transformation. A transforma- tion is one-way if it can be easily calculated in one direction (i.e., the forward direction) and is computationally impossible to calculate in the other (i.e., backward) direction. Identifying the right one-way transformation is very challenging because many such mappings do not respect the notion of distance and proximity. The transformations that respect such properties are the only candidates enabling efficient query processing in an encoded space. Therefore, as depicted in Figure 2.1, any one-way transformation which respects the proximity of the original space can replace the black boxes in Figure 2.1 to make the location server privacy aware. Transforming the original space with such a locality-preserving one-way mapping can be viewed as encrypting the elements of the 2-D space using a one-way transformation. In practice, many one-way transformations 14 Figure 2.1: Space Encoding. may be reversible even without the knowledge of the trapdoor but the process must be too complex (equivalent to exhaustive try) to make such transformation computationally secure. In this chapter we use the properties of our mapping function as the trapdoor only provided to clients to reverse the encoded results back to their original format while giving the server enough information to process user queries in the encoded space. In the following two sections, we first study the locality-preserving characteristics of space filling curves and proceed to detail our mapping function which uses space filling curves to preserve locality among objects and one-way hashing to make the transformation irre- versible. 2.2.2 Locality Preserving Space Filling Curves We now study an important class of transformations called space filling curves as can- didate space encoders for our framework and show how they can be treated as space encoders if certain properties of these curves are kept secret from malicious attack- ers. Introduced in 1890 by an Italian mathematician G. Peano[Sag94], space filling curves belong to a family of curves which pass through all points in space without crossing themselves. The important property of these curves is that they retain the prox- imity and neighboring aspects of the data. Consequently, points which lie close to one another in the original space mostly remain close to each other in the transformed space. 15 (a) (b) Figure 2.2: (a) AH 2 2 Pass of the 2-D Space (b) Recursive Hilbert Curve Construction. One of the most popular members of this class is Hilbert curves[Hil91] since several studies show the superior clustering and distance preserving properties of these curves [LK01, Jag90, FR89, MvJFS01]. Similar to [MvJFS01], we define H N d for N 1 and d 2, as the N th order Hilbert curve for ad-dimensional space.H N d is therefore, a linear ordering which maps ad-dimensional integer space [0; 2 N 1] d into an integer set [0; 2 Nd 1] as follows: H = (P ) for H 2 [0; 2 Nd 1], where P is the coordinate of each point in the d- dimensional space. We call the output of this function its H-value throughout this chap- ter. Note that it is possible for two or more points to have the same H-value in a given curve. As mentioned above, our motivating application is location privacy and therefore, we are particularly interested in 2-D space (and hence 2-D curves). Therefore, H = (X;Y ) where X and Y are the coordinates of each point in the 2-D space. Figure 2.2a illustrates a sample scenario showing how a Hilbert curve can be used to transform a 2-D space into H-values. In this example, points of interest (POI) are traversed by a second order Hilbert curve and are indexed (transformed to the Hilbert space) based on the order they are visited by the curve (i.e.,H in the above formula). Therefore, in our example the points a;b;c;d and e are represented by their H-values 7, 14, 5, 9 and 4, 16 respectively. For any desired resolution, more fine-grained curves can be recursively constructed (Figure 2.2b). The key benefit of employing Hilbert curves in our approach is how they can act as locality preserving transformations when used in location-based services. This property suits our approach as our goal is to address spatial queries efficiently in a transformed space. However, as we discussed in Section 2.2.1 we still need to prevent a hostile entity from reversing such transformations to protect original user locations. We now discuss how we use cryptographic hash functions to achieve this. 2.2.3 Making Hilbert Mapping Irreversible To index objects with a Hilbert curve, one should first choose various curve parame- ters such as the curve’s order (i.e., granularity), starting point, orientation, and scale factor. Let us denote these parameters with N;X 0 ;Y 0 ; and , respectively. Know- ing these parameters, one can effectively compute the H-value of an object with linear complexity (see Section 2.3.1). However, if these parameters are not known, a substan- tial brute force effort is required to exhaust all potential combinations to find the right curve parameters. Therefore, protecting the server from learning curve values prevents it from reversing the transformation and obtaining original query parameters using the location information received from the clients. However, this scheme is still suscepti- ble to more sophisticated attacks that employ the untrusted server’s prior knowledge of original object distributions. For instance, even without the knowledge of curve param- eters, the server can identify a dense area in the transformed space by counting the number of points with similar (or nearby) Hilbert-values. This problem arises from the locality-preserving properties of Hilbert curves and the fact that the untrusted server has access to original Hilbert values and the assignments of points to these values in 17 the transformed space. In this chapter, we apply cryptographic hash functions [Pre03] to our Hilbert mappings to make our locality-preserving mapping one-way 1 . With this approach, while the one-way hash function blinds the server from learning any infor- mation about object locations, the Hilbert curve ordering allows it to perform spatial queries efficiently in the transformed space and hence satisfying the blind evaluation cri- teria defined in Section 1.1.1. The Hilbert curve parameters, together with the hash key, form a space decoding/encoding key (SDK) that we use to construct an encoded index for query processing. The challenge, however, is to perform spatial queries efficiently over such an encrypted index without revealing sensitive information to the untrusted server hosting the index. We now elaborate on our approach to achieve privacy while enabling the server to efficiently respond to user queries. 2.3 Privacy-Aware Query Processing Making a query processing engine privacy-aware based on our idea of space transforma- tion discussed above, requires a two-step process consisting of an offline transformation of original space (Section 2.3.1) followed by online query processing (Sections 2.3.2 and 2.3.3). 2.3.1 Offline Space Encoding During this phase, we first choose the curve parameters (listed in Section 2.2.2), a nonce and a secret key . These values together form our transformation key SDK =fX 0 ;Y 0 ;;N; ;;g. Next, assuming the entire area covering all points of 1 A cryptographic hash allows fast computation of a digest in the forward direction while making it infeasible to find the original message given the digest. Moreover, it is infeasible to find two different messages that share the same digest. 18 interest is a squareS 1 , anH N 2 Hilbert curve is constructed starting from (X 0 ;Y 0 ) in a larger squareS 2 surroundingS 1 until the entireS 2 is traversed (see Figure 2.2a). Let us denote withjC rc j the number of objects in the grid cellC rc located on rowr and column c. For each objecto j 2C rc , 1jjC rc j located at (o j :x;o j :y) we construct one entry of an encoded look up tableELT defined below. ELT<(Hjjj);" (o i :t);prev;next> 8o i 2S (2.1) We now discuss in more detail each part of the ELT schema. The term " (o i :t) denotes the encrypted value of an object’s textual attributes such as name and address represented by o i :t. Thej symbol denotes the concatenation operator and H = (o j :x;o j :y) returns the H-value for the object o i using the curve parameters 2 . The function is a one-way cryptographic hash function such as SHA-1 or MD5 that takes H, an object rank within a cell and a nonce to compute a hash value for each object. This allows the clients to efficiently form an object request during the query processing usingSDK. To enable navigation of the encoded index, each entry points to the previ- ous and next object according to the Hilbert ordering (we assign a random order among objects belonging to the same cell). These two pointers are denoted byprev andnext inELT . For instance, in a cellC rc with three objects (jC rc j = 3) and Hilbert valueH, the next value of the second object points to the third object’s entry of the same cell while thenext value of the third object points to the first object’s entry from the next non-empty cell C r 0 c 0 with H 0 > H. Finally, to snap requests for empty cells to their closest non-empty neighbors, we store one entry per each empty cell and link itsprev 2 We use an efficient bitwise interleaving algorithm from [FR89] to compute the H-values for points of interest. Depending on the implementation, the cost of performing this operation varies betweenO(n) andO(n 2 ) wheren is the number of bits required to represent a Hilbert value. 19 and next values to its preceding and succeeding non-empty neighbors in the Hilbert space, respectively. This process is performed once for the entire object space and at the end of this step, the encoded look-up tableELT is stored at the server. Algorithm 1 details this offline space encoding process. The result of applying Algorithm 1 on object d from Figure 2.2a is illustrated below. Table 2.1: SampleELT for Figure 2.3 computed offline (9j1j) " (d:t) prev : a’s entry next : b’s entry stored at the server 5e107d9d. . . UZnLpmN. . . YQta+/rl. . . kjFU4Fq!. . . Given a setS ofn static objects, let represent the average number of objects with the same H-value inELT . For a Hilbert curve of orderN, = n 2 2N . Indexing the objects with a lower degree curve is analogous to using a coarse-grained grid. Therefore, using a smallN (a large) potentially increases the number of false positives in a query result set. On the other hand, a very fine grained grid (largeN) results in many empty cells and hence a storage overhead. Therefore, for a given dataset, we increase N until becomes smaller than a threshold. In Section 5.2, we experimentally study the effects of the curve order on evaluating range andKNN queries. 2.3.2 Online Query Processing-ApproximateKNN Using the encoded indexELT , we show howKNN queries are blindly evaluated in the transformed space. Algorithm 2 shows the KNN-Generate module taking place while responding to a user’sKNN query. For each query pointq located at position (q x ;q y ), KNN-Generate uses SDK to compute H = (q x ;q y ). We then use KNN-Resolve (Algorithm 3) to expand our search in the encoded space starting from the first entry (Hj1j;K). Note that since Hilbert values are hashed before being stored at the server, 20 we cannot compute the distance between each point in the encoded space. Therefore, the search has to expand toK objects in each direction. While storing original Hilbert values would allow comparisons among object distances (at least in the Hilbert space), as we discussed in Section 2.2.3, it would make our scheme vulnerable to server’s prior knowledge attacks. The computation and communication burden of searching for K more objects is the cost we pay to achieve more stringent privacy. Next, knowingSDK, KNN-Generate uses" 1 to decrypt the response received from KNN-Resolve result set back to the original space. To illustrate, in our example, having K = 2, and q = (2; 2), KNN-Generate computesH = 8 = (2; 2) and calls KNN-Resolve((8j1j; 2)) to obtainR =f" (d);" (b);" (a);" (c)g. Next," 1 is applied to objects inR to obtain their original attributes. In Section 2.6, we show how appropriate placement of these modules on client and server sides enables blind evaluation of spatial queries. We can now derive the complexity of the KNN-Resolve module which represents the overallKNN query processing complexity. UsingELT instead of the curve itself, we only visit Hilbert values that contain at least one object. Therefore, Algorithm KNN- Resolve visits at most (2 K) entries in ELT . Furthermore, our communication complexity is determined by average number of objects read and transferred to the user which is 2K also being constant since 1 (see Section 5.2.2). It is important to note that the generated query result set is approximate due to the dimension reduction technique used. For instance, in Figure 2.2a, whilee is closer toq thanb, it is not included in the result of the 2NN query. Later in Section 2.5, we propose an efficient technique that significantly reduces the approximation error for the method discussed above. We also present an exact KNN algorithm in Section 2.4 which is rel- atively more costly than our approximate results but returns the actual closest objects. 21 Figure 2.3: Algorithms CreateIndex (top), KNN-Generate (middle) and KNN-Resolve (bottom) Algorithm 1 CreateIndex(S). Require: SDK =fX 0 ;Y 0 ;;N; ;;g 1: for allCrc2S 2 do 2: if (jCrcj == 0) then 3: H (r;c); 4: ELT ELT[<(Hj1j);"(\dummy");prev;next>; 5: end if 6: for allo j 2Crc ,f1jjCrcjg do 7: H (o j :x;o j :y); 8: ELT ELT[<(Hjjj);"(o i :t);prev;next>; 9: end for 10: end for Algorithm 2 KNN-Generate(K;q x ;q y ) 1: H (qx;qy ); 2: R KNN-Resolve((Hj1j;K)); 3: for all"(o i :t)2R do 4: result result[ " 1 ("(o i :t)); 5: end for 6: returnresult; Algorithm 3 KNN-Resolve((Hj1j;K)) 1: idx (Hj1j;K); 2: ifisEmpty(idx) then 3: idx idx:next; 4: end if 5: qIndexMore idx; 6: qIndexLess idx:prev; 7: R 1 ;R 2 ;; 8: whilejR 1 j<K do 9: R 1 R 1 [ELT [qIndexMore]; 10: qIndexMore qIndexMore:next; 11: end while 12: whilejR 2 j<K do 13: R 2 R 2 [ELT [qIndexLess]; 14: qIndexLess qIndexLess:prev; 15: end while 16: R R 1 [ R 2 ; 17: returnR; We conduct several experiments in Section 5.2 to evaluate the accuracy of our approxi- mate KNN and the cost of our exact KNN methods. While the approximate results are fully practical in an LBS scenario, exact results can be derived with reasonable extra overhead. 22 (a) (b) Figure 2.4: A Range Query Decomposed into Six Maximal Blocks (a), with the Under- lying Hilbert Curve (b). 2.3.3 Online Query Processing-Range During the offline process described in Section 2.3.1, the objects in the original space are encoded and their H-values are stored in ELT: Our general approach to answer a 2-D range queryw(x;y;n 1 ;n 2 ) is to transform it to a series of 1-D ranges in the Hilbert transformed space and the challenge is how to do this transformation efficiently and accurately. Figure 2.4(b) shows an example of a range queryw(1; 2; 2; 6). The result of this query contains the highlighted objectso 1 ando 2 in Figure 2.4(b) whose H-values are contained inw(1; 2; 2; 6). As the figure illustrates,w(1; 2; 2; 6) can be transformed into two 1-D range queries one with objects whose H-values range from 8 to 13 and another one for objects whose H-values range from 50 to 55. Each of these 1-D range queries is called a range query run or a run for short (a similar notion of runs is also defined by Moon et al. in [MvJFS01]). These runs are highlighted in Figure 2.4(b). The result of the range query consists of objects whose H-values belong to any of these two runs (o 1 ando 2 ; with H-values 12 and 50 in our running example). 23 2.3.3.1 Finding the Range Query Runs In order to find the query runs, we adopt the method proposed in [CTH00] and show how this method can efficiently respond to a range query. To find the range query runs, the first step is to decompose the input range query into a set of square blocks according to the quadtree decomposition. During this process, the grid space is recursively divided into four equal partitions until a partition is completely contained in the range query. For example in Figure 2.4(a), the square labeledc is obtained after two recursions while acquiring the squaresa andb needs a third recursion. After decomposing the range query into a set of square blocks according to the quadtree decomposition, the resulted square blocks are named maximal quadtree blocks or maximal blocks for short [CTH00]. An interesting property of maximal blocks is that the H-values inside a maximal block form a continuously increasing sequence. The min- imum and the maximum H-values of the sequence are denoted ash b andh e ; respectively. Later, we show that we can efficiently find the values of h b and h e and consequently, find all the H-values inside a maximal block. Similar to [CTH00], we denote a maxi- mal block asMB(x;y;s) where (x;y) ands are the lower left coordinate and the side length of the maximal block, respectively. For instance, as illustrated in Figure 2.4(a), MB(2; 2; 2) is a maximal block of the range queryw(1; 2; 2; 6) whose H-values form a continuously increasing sequence withh b = 8 andh e = 11. We denote each sequence as a pair in the form of (h b ;h e ). Hence, the H-values insideMB(4; 2; 2) can be denoted as (52; 55). To find the maximal blocks of a range query, we use the strip splitting based optimal algorithm proposed by Tsai et al. in [TCC04]. The algorithm can find the maximal blocks of a w(x;y;n 1 ;n 2 ) range query in O(n l ) time where n l = max(n 1 ;n 2 ). The idea is to repeatedly split a strip of maximal blocks from the sides of the range query. 24 After decomposing the range query into a set of maximal blocks, we need to find the values ofh b andh e for each maximal block. Calculation of these values is explained in detail in [CTH00] and are skipped here. The H-values obtained from the maximal blocks of a range query form a sequence called Seq: For example, the range query in Figure 2.4(b) results in the following sequence: Seq =f(13; 13); (12; 12); (50; 50); (51; 51); (8; 11); (52; 55)g: The H-values in Seq generated by the strip-based decomposition algorithm may not be in increasing (or decreasing) order. Therefore, we sort the H-values inSeq and denote it asSeq . WithSeq ; we can merge adjacent pairs which belong to the same run in order to decrease the total number of runs and instead, increase their length. We can merge a pair p in Seq with its successor q; if and only if the difference between h b of q and h e of p is one. The sequence obtained by merging the elements of Seq constitute the runs of the range query and we denote the sequence asruns(w). For the above example,runs(w) is as follows: runs(w) =fr(8; 13);r(50; 55)g: Each element ofruns(w) is a run in the original range query consisting of H-values fromx toy denoted asr(x;y): For example, in Figure 2.4(b), the query has two runs: runs(w) =fr(8; 13);r(50; 55)g: Sorting and merging the H-values greatly reduce the number of range query runs. For example, we achieved a 74% reduction in number of runs for 200; 000 range queries with various side lengths. 25 We have so far showed how a range query is converted to a set of runs that consist a continuous increasing sequence of Hilbert values. To privately query the objects in each run, we need to convert each run to its private representation. In order to do that, each run r(H s ;H e ) is first converted tor((H s j1j);(H e :nextj1j)). With this representation, we guarantee that all objects whose H-values fall within r are returned as part of the range query. While evaluating a range query, we need to include all cells that overlap with the query window. As some of these cells only partially overlap with the query, this process might introduce some excessive objects in the query result set. However, we show in Section 5.2, that excessive objects constitute a marginal fraction of the objects retrieved fromELT and therefore can be easily filtered by the users without affecting the client/server communication cost. We now discuss the time complexity of the quadtree decomposition algorithm. As mentioned above, the process to find the range query runs w(x;y;n 1 ;n 2 ) consists of four parts: (1) decomposingw into its maximal blocks, (2) finding the value ofh b and h e for each maximal block and forming the set Seq; (3) sorting the elements of Seq and forming Seq ; and (4) merging the elements of Seq and forming runs(w): The first step can be done in O(n l ) as discussed earlier. The second step can be done in O(n l logT ) whereT = 2 N forN being the curve order [CTH00]. The third step can be done inO(n l logn l ) using a sorting algorithm such as quick sort. Finally, the forth step can be done inO(n l ). Consequently, the total running time to find the range query runs w(x;y;n 1 ;n 2 ) isO(n l logT ). The last step of the range query processing is to retrieve relevant objects fromELT . There are on averagen 1 n 2 many objects in a range query of sizen 1 n 2 which results in equal number of reads fromELT . 26 Figure 2.5: Algorithms Range-Generate (top) and Range-Resolve(bottom) Algorithm 4 Range-Generate(x;y;n 1 ;n 2 ) Require: SDK =fX 0 ;Y 0 ;;N; ;;g 1: w runs FindRuns(x;y;n 1 ;n 2 ); 2: whilew runs 6=; do 3: <run s ;run e > w runs :GetNextRun(); 4: H s (run s );H e (run e ); 5: R RangeResolve((H s j1j);(H e :nextj1j)); 6: whileR6=; do 7: o i :t " 1 (r:GetNextObj()); 8: result result[ o i :t; 9: end while 10: end while 11: returnresult; Algorithm 5 Range-Resolve(s;e) 1: idx s; 2: ifisEmpty(idx) then 3: idx idx:next; 4: end if 5: whileidx6=e do 6: R R[ ELT [idx]; 7: idx idx:next; 8: end while 9: returnR; We proceed to discuss the Range-Generate and Range-Resolve, the two mod- ules involved in online range query processing (Algorithms 4 and 5). For each range query w(x;y;n 1 ;n 2 ); Range-Generate first finds the runs of w by calling FindRuns(x;y;n 1 ;n 2 ). FindRuns uses the techniques discussed in Section 2.3.3.1 to return a set of runs corresponding to the range query. For each run, Range-Generate pri- vately retrieves the objects corresponding to the run using the Range-Resolve module. The Range-Resolve module privately returns all objects that belong to a run. Range- Generate then decrypts the objects retrieved in the response usingSDK and generates the actual query result set. 27 2.4 Online Query Processing-ExactKNN As we already discussed in Section 2.3.2, our proposed privateKNN query processing algorithm is approximate. To illustrate how we can enable exact KNN queries using approximateKNN and range query processing modules defined in the previous sections, consider the example in Figure 2.6. Given aKNN query centered atq, we first compute the approximate answer using Algorithms 2 and 3. In our running example, forK = 1 this step results inresult =fa;bg or 1NN(q) =a. However, this result is approximate since the actual nearest neighbor ofq isc. Let us denote byr k the distance betweenq and its K th closest object according to the approximate KNN Algorithm. Let us use C k to represent the circle centered atq with radiusr k . To obtain the exactKNN result, all objects in C k should be retrieved as their distance to q is less than r k . Therefore, we form a 2r k 2r k square centered atq that is guaranteed to include all objects inC k and use Algorithms 4 and 5 to retrieve its enclosing objects. In Figure 2.6, this process results in retrieving 2 new objects includingc as the actual nearest neighbor ofq. In summary, to achieve exactKNN results, an approximateKNN query is followed by a range query that is guaranteed to return all objects that might be potentially closer toq than the result set of the approximateKNN algorithm. This process can however, be costly for two reasons. First, the approximate nature of theKNN algorithm might result in a relatively largeC k . Secondly, such largeC k can result in multiple query runs that need to be further processed. In the following Section, we introduce our proposed dual curve query processing technique that reduces both of these costs. By improving the KNN query accuracy, we first reducer k which results in a smaller region to be passed to our range query module. In addition, we use our proposed technique to reduce the number of query runs in the range query window. Therefore, our proposed technique (i) improves the accuracy of approximateKNN results (ii) reduces the communication 28 cost in a range query and therefore, (iii) improves the computation and communication cost of our exactKNN algorithm. We proceed to present the details of our approach and how to improve the performance of our spatial query processing techniques discussed so far. Figure 2.6: ExactKNN Query Processing 2.5 Dual Curve Query Resolution Using a single Hilbert curve as a space encoder for processing approximateKNN and range queries discussed in Sections 2.3.2 and 2.3.3 has several drawbacks for each algo- rithm. In this section, we first discuss these drawbacks and then introduce our Dual Curve Query Resolution approach or DCQR which overcomes the weaknesses of the former scheme and generates significantly more satisfactory results for both range and approximateKNN queries. As we noted before, such improvements will directly affect the exactKNN algorithm as well. We study howDCQR improves approximateKNN, exact range, and hence, exactKNN algorithms in Sections 2.5.1, 2.5.2 and 2.5.4, respec- tively. 29 2.5.1 ApproximateKNN Queries: Proximity in Hilbert Curves A closer study of Hilbert curves reveals two important properties of such curves that can significantly affect the performance of the approximateKNN algorithm proposed in Section 2.3.2 (for ease of read, throughout the remainder of Section 2.5 we simply omit the word “approximate” when referring to approximate KNN query processing and instead explicitly use the term “exact” to refer to non-approximateKNN algorithm). First, consider the 1 st degree curve of Figure 2.7 (a). The curve is constructed by travers- ing a U-shaped pattern. Regardless of its orientation, such a curve fills the space at a specific direction at any given time sweeping the space in a clockwise fashion. Starting from the first degree curve of Figure 2.7(a), the curve misses one side in its first traver- sal. As the curve order grows, the number of missed sides grows exponentially as well so that an H N 2 curve misses M = 2 2N 2 N+1 + 1 sides of a (2 N 1) (2 N 1) grid. The above property makes H-values of certain points farther asN increases. For instance the Euclidean distance between pointsa andd is similar to the that of points b andc in the original 2-D space, however, asN grows, due to the above property the difference betweena andd’s H-values grows exponentially larger than that ofb andc. Therefore, points closer to two quadrants of the space (i.e., the first and last quadrants filled by the curve) will be spatially furthest from one another in the transformed space. The second drawback of using a single Hilbert curve for KNN queries is due to the fact that such space-filling curves essentially reduce the dimensionality of the space from 2 (or in general caseN) to 1. Naturally, each element in the 1-D space constructed by the Hilbert curve will have two nearest neighbors compared to the original case where each element (except those at the edges) has four (or in general case 2N) nearest neigh- bors. Therefore, as [Jag97] suggests, in the best case scenario, only half of these nearest 30 (a) (b) (c) Figure 2.7: Missed Sides ofH 1 2 andH 2 2 Curves (a), Four strips Around the Range Query (b), Shifted Range Query(c). neighbors in 2-D space will remain a nearest neighbor of the same point in the trans- formed 1-D space. It is clear to see how each of these issues play a negative role on the quality of the result set givenKNN’s sensitivity to how the underlying index structure preserves object proximity. 2.5.2 Range Queries: The Number of Query Runs In section 2.3.3, we showed how a range query is blindly evaluated by generating the range query runs and looking up ELT for all encoded objects whose H-values are located inside each run. Since ELT is stored at the server side, all requests for runs have to be transferred to the server. Therefore, it is desirable to reduce the number of runs in a range query to minimize the communication cost of a range query request and the server’s throughput in responding to client requests. Consider the following example. Given a range queryw(x;y;n 1 ;n 2 ), assumex and y to be odd and n 1 and n 2 to be even numbers. The range query has four strips each consisting of only maximal blocks of size 1 1 i.e., four strips intersecting in the four corners of the range query. An example of this situation is shown in Figure 2.7(b) for w(1; 1; 4; 4), where strips consist of 12 1 1 maximal blocks. It is easy to verify that if at least one ofx andy becomes even, whilen 1 andn 2 remain the same, the number of maximal blocks decreases substantially as some of the strips disappear. For example in 31 Figure 2.7(c) bothx andy are even numbers and there are no strips in the range query. A formal proof of the relationship between the number of quadtree blocks and the query window location can be found in [FJM97]. As mentioned in Section 2.3.3, range query runs are obtained by merging the H-values resulted from maximal blocks of the range query. This gives us the intuition that larger number of maximal blocks results in more query runs. We experimentally investigated this and the result confirms our intuition. Therefore, any technique that reduces the number of maximal blocks will consequently result in less number of query runs. Finally, it is easy to observe how the combination of the dimension reduction proper- ties of Hilbert curves in evaluating approximateKNN queries and the excessive number of runs generated from a range query can negatively affect the performance of our exact KNN algorithm as well. We now present a solution to rectify the aforementioned short- comings. 2.5.3 A Dual Curve for Query Processing Using a single curve to encode the objects in space results in a loss of precision (and thus a negative effect on overall quality of returned results) forKNN queries and server overhead for range queries. We mitigated both of these issues by creating a dual curve which is a replication of the original curve rotated 90 degrees and shifted by one unit in bothx andy directions and index the objects using both curves. As we show in this section, applying the rotation operator substantially improvesKNN query results com- pared to the shift operation. Conversely, applying the shift operator to an original curve reduces the server throughput by decreasing the number of query runs while the rota- tion operation does not significantly affect the efficiency of our range query processing algorithm. 32 Figure 2.8: Proximity in the Original Curve (middle) vs. the Rotated (left) and Shifted (right) Dual Curves. The rotation operator has a positive effect onKNN query evaluation since by rotat- ing the degreeN curve, all lower degree curves constructing the main curve will also be rotated. At each curve order, the curve rotation ensures that the missed sides generated by the discontinuation of the curve (such as the missed sides between points a and d in Figure 2.7 (a)), will be covered by the rotated curve. Therefore, the points deemed spatially far from each other in one curve will be indexed correctly in the other curve. This will address the first issue of using a single Hilbert curve for KNN queries. A rotated dual curve also mitigates the effect of the second property discussed above by transforming the 2-D space to two 1-D spaces. Therefore, each point will now have two nearest neighbors in each curve note that these two neighbor pairs can (and often do) have overlaps. However, adding the second curve results in more accuracy for KNN query results. Although a shifted-only dual curve can also cover many gaps in the orig- inal curve, it cannot solve the first shortcoming discussed forKNN queries because a one unit shift cannot bring objects with large H-value differences closer to each other in the dual curve as a shift does not change the curve orientation (See Figure 2.8). Similar to the above discussion, it is easy to show the positive effect of a shift oper- ator for range query processing. Since applying the rotation operator on the dual curve does not change the number of quadtree blocks (and thus the number of query runs), we only need to study the effect of the shift operator on the number of query runs. Shifting 33 the original curve with the translation vector (1; 1), can cause the strips with maximal blocks of size 1 1 to disappear. Note that, in some situations, a shifted curve might remove some strips around the range query while introducing new ones. One can verify that the above scheme will reduce the total number of maximal blocks in almost 5 16 of the cases for square range queries (roughly similar improvements can be achieved for arbitrary rectangular range queries). This ratio can be gained by considering the 16 pos- sible configuration of even or odd numbers representing a window query’sx;y;n 1 and n 2 values. Therefore, using a shifted and rotated dual curve, we can reduce the number of maximal blocks of a range query which causes a decrease in the number of query runs. 2.5.4 ExactKNN Queries: Smaller Regions with Fewer Runs It is now easy to observe how exact KNN query processing benefits from DCQR. The advantages are twofold. For one, DCQR substantially improves the probability of finding a closer K th nearest object (see Section 5.2.4) which in turns results in a smaller region that needs to be fetched using the range query. Moreover, given a region formed by the approximate KNN algorithm, DCQR frequently reduces the number of runs resulted from the region by identifying the curve that results in less number of runs. Therefore, the combination ofDCQR effects on approximateKNN and range algorithms significantly improves the overall performance of exactKNN while avoiding a significant performance penalty. We study the performance of all three classes of algorithms in Section 5.2. Using theDCQR approach, we need to slightly modify the offline index construc- tion 2.3.1 and accordingly adjust the spatial query processing logic for range and near- est neighbor algorithms. During the offline phase, we now need a new keySDK 0 for 34 Figure 2.9: DCQR Architecture for Spatial Query Processing. the dual curve which is computed as SDK 0 =fX 0 + 1;Y 0 + 1; + 90 ;N; ; 0 ;g fromSDK. Consequently two Hilbert curvesH N 2 andH 0 N 2 are constructed based on SDK;SDK 0 and visiting each point, we construct two records forELT andELT 0 as detailed in Section 2.3.1. The query processing also follows the logic from Sections 2.3.2,2.3.3 and 2.4. While evaluatingKNN queries, for each query pointq, we computeH =(q x ;q y ) andH 0 = 0 (q x ;q y ) using SDK and SDK 0 , respectively. We then initiate two parallel query resolution schemes using both ELT and ELT 0 simultaneously to retrieve K closest matches for each curve separately. Similar to Section 2.3.2, we decode the two result sets and choose theK best candidates (based on their Euclidean distance toq). As for a range query, we generate two parallel queries one for each curve and pick the one that results in fewer runs. Furthermore, with regard to complexity, the DCQR query processing computation and communication cost for KNN queries is almost twice as much as Algorithm 3 due to traversing two curves and returning twice as many points in the result set. For range queries, using the DCQR approach slightly increases the computation cost (due to considerable overlap among computing runs for two different curves) while reducing the communication cost (due to selecting the curve resulting in fewer runs). We experimentally measure these costs and benefits in Section 5.2. 35 2.6 Proposed End-to-End Architecture We are now ready to explain how our space encoding technique and careful placement of KNN and range query evaluation modules in client and server sides enable blind evaluation of spatial queries in location-based services. The client (e.g., a user’s portable device) issues a spatial query and provides the necessary parameters for each query type. In order to make the location server privacy-aware, we assume the architecture of Figure 2.9, which details the offline indexing and online query processing of a spatial query and use the algorithms discussed in Section 2.5 to modify the classic location- based services architecture in the following three ways: 1) A trusted entity is added to the architecture to perform the CreateIndex module once and to create and populate ELT and ELT 0 with POI’s encoded information. A second functionality of the trusted entity is to provide users with (SDK, SDK 0 ) key pair required to encrypt (decrypt) the query (results). To avoid collusion attacks among malicious users and the server, the keys are stored in inexpensive smartcards placed in subscriber’s client devices [BP] allowing encoding/decoding of data without sharing the key with actual end-users. Finally, the trusted entity provides the location server with the two encoded indexesELT andELT 0 instead of the original data set and keeps the key pair secret from the location server. Note that as opposed to anonymization and cloaking techniques discussed in Section 6.2, the trusted entity performs the above actions only once offline and thus is not involved in the online query processing schemes. 2) To enable blind evaluation of range and KNN queries, we let clients exe- cute the KNN-Generate and Range-Generate modules that require the knowledge of SDK;SDK 0 to form encrypted requests for the server. The server is in charge of exe- cuting theKNN-Resolve and Range-Resolve modules both of which only require access to the encrypted indexesELT andELT 0 . 36 3) To execute exactKNN queries, clients first issue an approximateKNN query and use its results to form a subsequent range query that retrieves a set of objects guaranteed to contain the exact results to the original query (Section 2.5.4). Once the results are returned to the client, it usesSDK;SDK 0 to decrypt the result to obtain original object information. 2.7 Summary In this chapter, we discussed the problem of location privacy in location-based services. We studied the challenges of achieving location privacy and introduced a novel way of blindly evaluating spatial queries by using one-way space transformations to map objects and query points into an unknown space and addressing the query in this new space. The major contributions of our work can be summarized as follows. (i) We proposed blind evaluation of queries using Hilbert curves as space encoders and intro- duced DCQR, our proposed Dual Curve Query Resolution approach and showed how it improves the performance of blind range and KNN queries. (ii) We showed how our proposed space transformation employs space filling curves and cryptographic one-way hash functions to achieve stronger and more generalized measures of privacy than that of commonly usedK-anonymity and spatial cloaking techniques without incurring the prohibitive costs of private information retrieval-based approaches. (iii) We conducted extensive experiments on our range, approximate and exactKNN algorithms using real- world and synthetic datasets to evaluate the performance of our proposed scheme (See Chapter 5). 37 Chapter 3 Spatial Query Processing Using Private Information Retrieval 3.1 Introduction In this chapter, we use Private Information Retrieval (PIR) techniques to provide a more generalized and powerful way of blinding the untrusted location server by converting spatial query processing into several private database retrievals from the location server. A PIR protocol allows a client to secretly request a record stored at an untrusted server without revealing the retrieved record to the server. Therefore, instead of only blurring user queries (as in anonymity and cloaking approaches), or processing the query in a transformed space (as in our space transformation technique), we use PIR to protect the queried content. This way, no information is leaked to adversaries by examining the records requested from the server. However, anonymity and cloaking approaches, by design, cannot avoid this information leakage to achieve perfect secrecy, regardless of the underlying information-hiding technique used. The use of PIR techniques as a fundamentally novel approach to protect user privacy in location-based services was first proposed in [KSMS] and [GKK + ]. However, as we show in this chapter, the approach proposed in [GKK + ] has three important restrictions compared to our approach: (i) it is limited to blind evaluation of first nearest neighbor queries (ii) it cannot avoid a linear scan of the entire database for processing each query 38 (iii) the communication complexity of each query is also very high (roughly p n where n represents the size of the dataset). We extend our initial work in [KSMS] and propose several algorithms to efficiently process range andK-nearest neighbor (KNN) queries. Therefore, our framework supports a more general class of queries fundamental to many location-based services while satisfying similar stringent privacy guarantees provided by [GKK + ] against most powerful adversaries. We empirically compare our proposed approaches with [GKK + ] for 1NN queries and elaborate why they incur significantly less communication and computation complexity. The above benefits come with the cost of utilizing a Secure Coprocessor to execute the PIR scheme. Although we utilize secure coprocessors and a specific PIR scheme, the PIR module can be treated as a black box throughout the query resolution process and thus any other practical PIR approach (that is either proposed or will emerge) can replace our current PIR scheme. While PIR can be used to privately generate a query result set, avoiding a linear pri- vate scan of the entire database is challenging. This is due to the fact that the server own- ing the objects information cannot be trusted to perform the query processing and choose what to be queried. Alternatively, moving this knowledge to the users will require the query processing to happen at the client side which results in high communication cost for transferring queried information [GKK + ]. Utilizing PIR, we employ three alternative index structures that greatly reduce the amount of information that is privately queried from the untrusted server. We have both analytically studied the performance of these index structures and performed extensive sets of experiments on real-world and syn- thetic datasets to empirically verify our analytical results. We show that although PIR constitutes around 90% of the overall query processing time, the total query response time is still in the order of milliseconds with proper system parameters. 39 A preliminary version of this work appeared in [KSMS, KSSM10]. This chapter subsumes [KSMS, KSSM10] by presenting two new private index structures as well as three KNN query processing algorithms that use these index structures. Finally, we conducted a series of experiments with three datasets to experimentally compare our techniques with several related studies (See Section 5.3). The remainder of this chapter is organized as follows. Section 3.2 provides a back- ground on location privacy preliminaries and PIR. In Section 3.3, we propose our private index structures and several algorithms to privately evaluate range and KNN queries. In Section 3.4, we discuss how to derive optimal system parameters for our framework and Section 3.5 shows how the aforementioned algorithms incorporate a practical PIR scheme to enable location privacy. We conclude this chapter in Section 3.6 with a sum- mary of our contributions and our future work. 3.2 Preliminaries As we discussed in Chapter 1, a drawback inherent in cloaking and anonymity based approaches to location privacy is the information leak where an adversary with strong prior knowledge is able to infer sensitive user location information based on the informa- tion queried from the server. Our motivation is to achieve significantly more stringent privacy guarantees of PIR by converting spatial query processing into several private database retrievals. Therefore, one of the key challenges behind such a framework is devising spatial algorithms that enable query evaluation using a privacy aware server that only responds to private object information retrievals and does not possess any location information. In the following section, we provide an overview of several PIR schemes and the one we utilize in our framework. 40 3.2.1 Private Information Retrieval Suppose Bob owns a database DB of n objects and Alice is interested in DB[i]. Although Bob might know the entire content ofDB, Alice is not willing to disclosei to Bob. A Private Information Retrieval (PIR) protocol allows Alice to privately retrieve DB[i] from Bob ensuring he does not learn the value ofi. The PIR problem was first proposed by Chor et al.[CKGS98] in an information the- oretic setting, which also proves that any theoretical PIR scheme has a lower communi- cation bound equal to the database size although it provides perfect secrecy against an adversary with unbounded computational power. In order to mitigate the communication cost, computational PIR schemes consider a computationally bounded adversary where the security of the approaches relies on the intractability of a computationally complex mathematical problem, such as Quadratic Residuosity Assumption [KO]. However, sim- ilar to information-theoretic PIR, this class of approaches cannot avoid a linear scan of all database items per query. To obtain perfect privacy while avoiding the high cost of the approaches discussed above, a new class of Hardware-based PIR approaches has recently emerged which places the trust on a tamper-resistant hardware device. These techniques benefit from highly efficient computations at the cost of relying on a hard- ware device to provide privacy [Aso04, AF, SS00, IS04]. Placing a trusted module very close to the untrusted host allows these techniques to achieve optimal computation and communication cost compared to the computational and theoretical PIR approaches. Therefore, we employ hardware-based PIR techniques as the building block for our privacy-aware location server to achieve acceptable communication and computation complexity. However, it is important to note that we treat the PIR module as a black box throughout the query resolution process and thus any other practical PIR scheme can be incorporated into our current framework. For now we assume that the operation 41 read(DB ;i) privately retrieves thei th element of the encrypted databaseDB stored at the untrusted server using the hardware-based PIR scheme. We uset read to denote the time it takes to perform a private read from the database. In Section 3.5, we detail how a secure coprocessor is used to enable hardware-based PIR and provide more details aboutDB as well as how the complexity of theread operation performed on such a database. 3.3 Privacy Aware Query Processing So far we have enabled private retrieval from an untrusted server. However, we have not focused on how spatial queries can be evaluated privately. Section 3.2.1 enables replacing a normal database in a conventional query processing with its privacy-aware variant. However, the query processing needs to be able to utilize this new privacy- aware database as well. Note that what prevents us from using encrypted databases is the impossibility of blindly evaluating a sophisticated spatial query on an encrypted database without a linear scan of all encrypted items. In this section, we employ private index structures that enable blind evaluation of spatial queries efficiently and privately. Using these index structures, we devise a sweep- ing algorithm to process range queries and three algorithms; Progressive, Hierarchical and Hilbert-based (or Hilbert for short) to privately evaluate KNN queries. We assume mobile users subscribe to the untrusted location server to query points of interest (POI) data such as restaurants and hospitals. The key idea behind our approach is to use PIR to privately query the index structures (Section 3.3.1) stored at the untrusted server to perform spatial queries. The algorithms discussed in this section employ theread() function (Section 3.2.1) to privately retrieve relevant records from the server’s database. The choice of where these algorithm are 42 executed depends on the underlying PIR protocol employed. As we elaborate in Section 3.5, we place trust on a secure coprocessor (SC) residing at the server side to execute the range and KNN algorithms. We defer the discussion of users, server and SC’s role in our hardware-based implementation of PIR to Section 3.5. Note that a PIR protocol guarantees that the server cannot gain any information about what records are actually being retrieved for query processing. In the following section, we show how we employ private spatial index structures to avoid a linear scan of the entire database records that are hosted at the untrusted server. 3.3.1 Querying Private Index Structures with PIR In this section, we present three index structures that allow blind evaluation of range and KNN queries. Each index structure is constructed, encrypted and stored at the untrusted host during a preprocessing step. These index structures allow us to efficiently scan only a subset of the records stored in DB at the untrusted server while processing spatial queries. Later in Section 3.3.3, we show how each KNN query reduces to a range query which itself reduces to several private reads from the database. The range and three KNN algorithms discussed below utilize a regular grid structure to construct the indexing used by the query processing module. The key reason behind using a grid structure in our framework is while being efficient, grids simplify the query processing. While other spatial indexes such as R-trees and kd-trees provide efficient spatial query processing, they require the client to incrementally retrieve the underlying tree structure to further guide the search. This approach is very costly when using PIR, as aside from the data (i.e., object information), tree navigation should also be performed privately using PIR. As we show in Section 5.3.7, PIR significantly dominates the cost of query processing. Knowing the grid granularity, we can quickly convert a spatial query 43 Figure 3.1: The Private Index Structures into private retrieval of certain records without requiring to privately query a tree-based representation of the objects first. Several studies have shown the significant efficiency of using the grid structure for evaluating range, KNN and other types of spatial queries [YPK, XMA, KPH04]. The spatial characteristics of indexing objects with regular grids make it challenging to readily use PIR to query grid cells. Later in Section 3.4, we detail how we modify our original private indexes developed in this section for performance and privacy reasons. More specifically, Section 3.4.1 discusses how to derive the optimum grid granularity according to the costs of different PIR and non-PIR operations and in Section 3.4.2, we propose a secure padding scheme to prevent a potential information leakage while using PIR to query grid cells. Without loss of generality, we assume the entire area enclosing all objects is repre- sented by a unit square. We uniformly partition the unit square into an MM grid where each grid cell has side length (0< 1) such thatM = 1 is a power of 2 (see Figure 3.1). Each cell is identified by its sequential cell ID (c id ) and may contain several objects each being represented by the triplet<x i ;y i ;obj id >. These cells are then used to construct listDB, countDB and hilbDB index structures. 44 The listDB index stores the objects and their location information for each cell. The listDB schema represents a flat grid and looks like<c id ;list> wherelist is a sequence of triplets representing objects falling in each grid cell. Note that listDB does not store a hierarchical representation of aggregate information regarding the object distributions (e.g., cell density). Such hierarchical representation is beneficial in pruning a large por- tion of the search space during query processing. The next index structure is introduced to capture such aggregate cell information. The countDB index, maintains a multi-level hierarchical index storing the number of objects in each cell at each depth. A cell at depthi is constructed by merging four adjacent cells of depth i 1 and adding up their object counts. The intuition behind countDB is to keep count of the number of objects in each region (note that multi-level storage of all object coordinates for all depths in listDB causes redundancy and incurs a significant space overhead). The schema of countDB looks like< c id ;depth;count > where each tuple represents the number of objects falling in cellc id at each depth. Since this schema is not consistent with our definition of PIR in which the key for any retrieval is only a single number (i.e., theith element), we design a mapping to convert countDB into a plain key-value relation. Observe that starting from an MM grid, at each depthd, the grid has M 2 d M 2 d cells whered log 2 M. Therefore, we devise the function convert (Equation 3.1) to rewrite countDB’s schema as<i;count> wherei is a unique cell identifier. i = 8 > > < > > : c id d = 0 convert(c id ;d) = d1 X i=0 M 2 i M 2 i +c id d 1 (3.1) 45 Given a skewed dataset, a KNN algorithm utilizing the grid structure will experi- ence a performance degradation caused by numerous empty cells in the grid. Increas- ing does not solve this problem as it results in coarse-grained cells containing many objects that have to be queried/processed and even a linear decrease in incurs at least a quadratic increase in the number of empty cells. The hilbDB index uses Hilbert space filling curves to avoid these shortcomings of processing KNN queries using regular grids (in particular for skewed datasets). The main intuition behind using Hilbert curves is to use their locality preserving properties to efficiently approximate the nearest objects to a query point by only indexing and querying the non-empty cells. As we show in Section 5.3, this property significantly reduces the query response time for skewed datasets. We defineH N 2 (N 1), theN th order Hilbert curve in a 2-dimensional space, as a linear ordering which maps an integer set [0; 2 2N 1] into a 2-dimensional integer space [0; 2 N 1] 2 defined asH = (P ) forH2 [0; 2 2N 1], whereP is the coordinate of each point. The output of this function is denoted by H-value. To create the hilbDB index, anH N 2 Hilbert curve is constructed traversing the entire space. After visiting each cell C, its c id = (C) is computed using the center of C. We use an efficient bitwise interleaving algorithm from [FR89] to compute the H-values (the cost of performing this operation is O(n) where n is the number of bits required to represent a Hilbert value). Next, similar to the listDB index, thec id values are used to store object information for each cell. Finally, in order to guide the next retrieval, each record also keeps the index of its non-emptyc id neighbors in hilbDB, stored in the Prev and Next columns, respectively. These two values allow us to find out which cell to query next from hilbDB hosted at the server. Figure 3.1 illustrates the original object space and the above three index structures. The circled numbers denote each cell’sc id constructed byH 2 2 for the hilbDB index. For 46 clarity, we have kept the original 3-column format for countDB and also we have not shown that all records are in fact stored in an encrypted format. 3.3.2 Private Range Queries Using the listDB index, processing range queries is straightforward. We use a sweeping algorithm to privately query the server for all cells which overlap with the specified range. A range queryR is defined as a rectangle 1 of sizelw (0<l;w 1). Therefore, to answer each range query, using listDB we must first find the set of cells R 0 that enclosesR.R 0 forms aLW rectangular grid whereLd l e+1 andWd w e+1 (see Figure 3.2a). The functionread(listDB;c id ) privately querieslistDB and performs the necessary processing to return a list of all objects enclosed in ac id . We use a sweeping algorithm to query the cells inR 0 privately. Algorithm 6 provides the pseudocode for evaluating range queries. Algorithm 6 Range(R) Require: P ll =R’s lower left point,Pur =R’s upper right point; S ;; 2: for (col =d P ll :x e;cold Pur:x e;col + +) do for (row =d P ll :y e;rowd Pur:y e;row + +) do 4: c id = ( row1 ) +col; L =read(listDB;c id ); 6: for allo i 2L do if ((P ll :xo i :xPur:x) and (P ll :yo i :yPur:y)) then 8: S =S[fo i g; end if 10: end for end for 12: end for return S; The time complexity of Algorithm 6 can be written as follows. t range = (d l e + 1) (d w e + 1)t read =O( lwt read 2 ) (3.2) 1 Without loss of generality, we assume the range query has a rectangular shape, circular regions can be queried by their enclosing rectangles and filtering the false positives at the client side. 47 SinceR6=R 0 , for the rangeR of sizelw, Algorithm 6 queriesLW cells which is linear with respect toarea(R). For a uniform distribution ofn objects, each cell on average containsn 2 items. Therefore, the total number of items queried is O(n) for =LW 2 which is also linear with respect ton. 3.3.3 Private KNN Queries The main challenge in evaluating KNN queries rises from the fact that the distribution of points can affect the size of the regionR that contains the result set (and hence the cells that should be retrieved). In other words, no region is guaranteed to contain the K nearest objects to a query point (except in a uniform distribution) which implies that R has to be progressively computed based on object distributions. Therefore, it is important to minimize the total number of cells that should be privately queried. In Sections 3.3.3.1 through 3.3.3.3, we examine three variants of evaluating KNN queries and discuss how each index structure allow us to query only a small subset of the entire object space. Note that due to the strong similarity of these algorithms with their first nearest neighbor counterparts, we directly consider the more general case of KNN. The following general approach is utilized by each of our KNN algorithms: (i) create a regionR and set it to the cell containing the query pointq (ii) expandR until it encloses at leastK objects (iii) compute the safe regionR 0 as the region guaranteed to enclose the result set and (iv) find the actualK nearest objects inR 0 usingrange(R 0 ) defined above. The main difference among the three algorithms is related to how they perform the step (ii) mentioned above. Regardless of the approach,R is not guaranteed to contain the actualK nearest neighbors ofq due to approximating the circular region aroundq with the rectangular regionR (see Figure 3.2b whereO 7 2 2NN(q) butO 7 = 2 R and 48 (a) (b) Figure 3.2: The Range QueryR (a) Computing the Safe RegionR 0 for the KNN Query (b) O 2 2 R althoughO 2 = 2 2NN(q)). ThereforeR has to be expanded to a safe regionR 0 which is the rectangle enclosing the circular region that contains the actualK nearest objects to the query point. As shown, if relative location ofq and its furthest neighbor in R is known, a safe region can be constructed. It is easy to verify thatR 0 is a square with sides 2djjc q far q (K)jje wherec q is the cell containingq andfar q (K) is the cell containingq’sK th nearest object inR andjj:jj is the Euclidean norm [YPK]. Once the safe region is computed, the objects located inR 0 are retrieved and added to the result set. We slightly modify the range algorithm to avoid querying cells previously checked during the computation ofR. We now elaborate on how different expansion strategies for step (ii) mentioned above generate different results and discuss the pros and cons of each strategy. 3.3.3.1 Progressive Expansion With this approach, ifK objects are not found in the cell containing the query point, we expand the regionR in a concentric pattern until it enclosesK objects. The most impor- tant advantage of this method is its simple and conservative expansion strategy. This property guarantees that the progressive expansion minimizes R. However, the very 49 same conservative strategy might also become its drawback. This is because for non- uniform datasets, it takes more time until the algorithm reaches a validR. Algorithm 7 details the progressive expansion strategy. Once the querying cell is identified, objects within that cell are privately retrieved and added to the result set (lines 2-4). Next, the region starting from the query cell is expanded progressively (the Expand function) and objects in the examined cells are added to S until K objects are found (lines 5-10). Next, the safe regionR 0 is first computed then queried and the final result set is com- puted (lines 11-12). Note that Algorithm 7 only uses the listDB index to evaluate a KNN query. The time complexity of this algorithm can be written as follows. t progressive =O( K n 2 t read +t range ) (3.3) The first term in Equation 3.3 corresponds to the average number of private cell retrievals to find the first K objects and the second term denotes the time it takes to query the cells added by the safe region and construct the actual result set. Obviously, in contrary tot range , the time complexity of processing KNN queries varies withn (i.e., the number of objects in the database). Also note that due to the conservative and symmetric expansion,jR 0 Rj is relatively small compared tojRj. Algorithm 7 KNN-Progressive(K;q) Require: K;q; S ;;cnt 0; cell:x =d q:x e;cell:y =d q:y e; 3: c id = ( cell:y1 ) +cell:x; S =read(listDB;c id ); letregion cellc id ; 6: cnt =jSj; while (cnt<K) do region =Expand(region); 9: S =S[read(listDB;region:cellidsnew ); cnt =jSj; end while 12: R 0 =safeRegion(S); return Range(R 0 ); 50 3.3.3.2 Hierarchical Expansion The hierarchical approach takes a completely different strategy to construct the region R. Given the query point q, at each step if K items are still not found, the algorithm moves one level higher in the cell hierarchy and picks q’s parent, grand parent, etc. untilR enclosesK objects. The advantage of moving from a cell to its parent, lies in the fact that constructingR reduces to very few reads of the countDB index instead of progressively counting the number of objects in each regular cell privately. Furthermore, this approach trivially overcomes the limitation of the progressive expansion and very quickly converges, however, it suffers from another drawback. The exponential increase in the size of R results in a relatively large R 0 which in turn increases the overhead of privately retrieving numerous cells in R 0 . The time complexity of the hierarchical algorithm is given in Equation 3.4. Note that due to its expansion strategy, the t range term here is significantly higher thant range from Algorithm 7. Furthermore, in contrast to Algorithm 7, the hierarchical algorithm requires countDB to find the safe region and listDB to finally find all objects enclosed in the safe region. Due to the simplicity of the hierarchical expansion approach and its similarity to Algorithm 7, we do not list its pseudocode here. t hierarchical =O(( K n 2 + log 4 4K n 2 )t read +t range ) (3.4) 3.3.3.3 Hilbert Expansion The progressive expansion strategy, and in general similar linear expansion strategies such as the one proposed in [YPK], are usually very efficient in finding the safe region for the query q due to their simplicity. However, as noted in Section 3.3.1, the time 51 Figure 3.3: Progressive, Hierarchical and Hilbert KNN Algorithms it takes to find the region that includes at leastK objects can be prohibitively large for non-uniform datasets (see Section 5.3). The Hilbert expansion overcomes this drawback by navigating in the hilbDB index which only stores cells that include at least one object. The main advantage of using hilbDB is that once the cell with closest H-value to(q) is found, expanding the search in either direction requires onlyK 1 more private reads to generate a safe region. This 1-dimensional search gives a huge performance gain at the cost of generating a larger search region due to the asymmetric and 1-dimensional Hilbert expansion. In Section 5.3, we extensively study these trade-offs. Algorithm 8 illustrates the (simplified) Hilbert expansion approach. The search begins by computing the Hilbert value ofq and identifying its immediate neighboring cells in the Hilbert space (lines 1-4). Next, these cells are privately read and objects within them are added to the result set (using HasMoreObjects and NextPOI functions) and the search expands in Hilbert space untilK objects are found (lines 5-16). The rest is similar to Algorithm 7. Equation 3.5 denotes its complexity (the curve order is a very small number (N < 10)). The t Hilbert consists of the time to compute (q), its closest c id ’s (accessing at most K 1 other records) to construct the safe region, and finally performing a range query, respectively. 52 t Hilbert =O(log 2 2N +Kt read log 2 2N +t range ) =O(Kt read +t range ) (3.5) Algorithm 8 KNN-Hilbert(K;q) Require: K;q c id (q:x;q:y) ; 2: qIndexMore minc id jc id 2hilbDB &c id (q:x;q:y); G read(hilbDB;qIndexMore); 4: qIndexLess =G:prev; cnt 0;S ;; 6: while (cnt<K) do G read(hilbDB;qIndexMore); 8: while (G:HasMoreObjects() && (cnt<K)) do S =S[G:NextPOI(); 10: cnt + +; end while 12: qIndexMore =G:next; G read(hilbDB;qIndexLess); 14: while (G:HasMoreObjects() && (cnt<K)) do S =S[G:NextPOI(); 16: cnt + +; end while 18: qIndexLess =G:prev; end while 20: R 0 =safeRegion(S); return Range(R 0 ); Figure 3.3 illustrates the three variants of evaluating KNN queries privately. The numbers in each cell show the step at which the cell is examined. For readability, cells examined in step 3 are shaded instead of being numbered in the first two images. 3.4 Optimizations In this section, we first discuss how the granularity of the underlying grid can affect query processing and how the right granularity can be computed. Next, we discuss our secure padding scheme that while protecting data privacy, incurs up to 90% less overhead compared to a naive padding approach. 53 3.4.1 Grid Granularity Choosing the right value of can significantly improve the overall efficiency of the above algorithms. There is an obvious trade-off between two competing factors in choosing the right value of. As grows, a coarser grid (having less cells) decreases the total number of cell retrievals. This is desirable given the relatively high cost of each private read from the database. However, large cells result in retrieving more excessive (unneeded) objects which coexist in the cell being retrieved. These excessive objects result in higher computation and communication complexity which increase the overall response time. Similarly, small values of result in an inverse scenario. The following theorem shows how the optimal value of can be calculated offline (a simplified version of this theorem can be found in [YPK]). Theorem 1. For a grid of size 1 1 which containsn objects, 3 q tc to 1 p n is the optimal grid granularity to minimize the total query time for all three algorithms where tc to denotes the relative time complexity of identifying a cell and querying it, compared to processing an object. PROOF: For all three algorithms, the overall query processing timeT is dominated by the safe region (i.e.,R 0 ) computation which is the time required to queryn c cells and to processn o objects located in them, thusT = t c n c +t o n o . We show how the optimal value of can be derived for the progressive expansion strategy. The proof for other two algorithms is similar. ComputingR 0 for a KNN query involves finding the circleC(o;r), centered at point o with radius r, which includes the K th nearest object to the query point. Therefore,r q K n . A squareS ofn c = (2r+) 2 2 cells boundingC is guaranteed to contain the result set. S on average includesn o (2r +) 2 n objects. Replacingn c andn o inT =t c n c +t o n o with the above values and setting @T @ = 0 yields 3 = tcr ton or = 3 r tc to q K 1 p n . 54 ForK <<n, the above formula can be simplified to = 3 q tc to 1 p n q In Section 5.3, we empirically find using our measured values of tc to and show that it is very close to the value calculated above. 3.4.2 Secure Record Decomposition and Padding A closer look at the listDB, countDB and hilbDB indexes, stored in an encrypted form at the untrusted server, reveals that the records of thelistDB andhilbDB indexes do not have the same length and the size of each record is in fact determined by the object distribution in the 2-D space. The reason behind this difference is the fact that countDB only stores aggregate object information as opposed to the other two indexes storing exact object locations. The unequal record sizes of listDB and hilbDB can result in several security vulnerabilities while using the PIR algorithm. This is because Algorithm 9 assumes all database records have the same size. However, if the records have different lengths, it becomes possible for the server to identify a subset of records, with one of the records being the ciphertext of a plaintext known to the attacker, simply by correlating the length of the encrypted records and the length of the plaintext (e.g., using the publicly available set of all restaurants). One obvious solution is to pad each record with junk data until each record size matches the length of the largest record. However, this results in an explosive growth in the size of the dataset. For instance, in our real-world dataset (refer to Section 5.3 for the characteristics of our four datasets), padding all the listDB records results in a 1700% increase in the size of the index. This is because the highly dense cells are vastly outnumbered by cells with many fewer objects (e.g., one cell containing over 200 objects vs. average object/cell density of around 10). Figure 3.4 illustrates the object 55 (a) Real-World (b) Uniform (c) Highly Skewed (d) Sparse Uniform Figure 3.4: Record Size Distribution density for each cell (represented by itsc id ) for our experimental datasets. These graphs show the inefficiency of using a naive padding approach to secure data. In order to overcome this drawback, we devise a record decomposition and padding scheme which first breaks records larger than a certain cut-off threshold to reduce the size of the largest record. The key advantage of this approach is that by choosing the right cut-off value, the space and complexity overhead of padding can be significantly reduced. Figure 3.5 illustrates how our scheme works. Given a cut-off threshold, and a record r whose size is larger than , we first recursively divide r into two smaller recordsr 1 andr 2 wherejr 1 j =. We refer tor as the original record and tor 1 andr 2 as its decomposed records. To maintain the logical correspondences among decomposed records, every record is assigned a pointerp which points to the index of its respective r 2 during the record decomposition and to an invalid dummy location otherwise (i.e., if jrj). For each record, this process is recursively continued untiljr 2 j. Next, all records smaller than are padded to make them equal in size. Figure 3.5a shows the object space and its correspondinglistDB index is shown in Figure 3.5b. Utilizing the above process, the index is first decomposed, the links are added and finally the smaller cells are padded as shown in Figure 3.5c. There are two important issues to be addressed using the decomposition technique discussed above. First, choosing the right cut-off threshold is of paramount impor- tance. Note that 1 max(jrj) with = max(jrj) representing the naive padding 56 case which results to a huge increase in average record size. Also, = 1 represents decomposing every single record recursively such that we end up with each record stor- ing only one object. Any value of in between these two extremes results in addingn c 0 more records as well as padding an equivalent ofn o 0 objects (note that this padding does not actually add objects to the records, it instead adds dummy data at the end of each record whose size is always a multiple of an object’s size). The overhead incurred by the padding can then be denoted byOverhead = n 0 c t c +n 0 o t o . While the first term makes PIR more costly (it retrieves more records privately for the same query), the second term negatively affects the client server communication cost (due to retrieving larger records for the same query). Therefore, based on the characteristics of the PIR technique and the overhead equation above, the right value of minimizing the overhead can be com- puted. In Section 5.3, we compare the overhead of our padding scheme with the naive padding technique. With our proposed padding technique, a new information leak concern may arise due to adding pointers between decomposed records. Hence, we need to ensure that the pointers connecting different pieces of an original record do not leak any information to an adversary. This is important because while querying any original (and larger than cut- off threshold) recordr,r 1 andr 2 will be requested sequentially. Fortunately, theread() function discussed in Section 3.2.1 guarantees perfect secrecy regardless of the sequence of the records being queried. In other words, any PIR algorithm by design is resilient towards correlation attacks and thus the attacker cannot learn any information from the sequence of records being retrieved from the untrusted server. Therefore, while record encryption protects content (i.e., objects and pointers), PIR protects access patterns from leaking any sensitive information to adversaries. In Section 3.5, we elaborate on how an original record is queried by recursively retrieving its decomposed elements. 57 (a) Object Space (b) Database Records (c) Padded Records Figure 3.5: Record Decomposition and Padding 3.5 A Sample Implementation In Section 3.2, we discussed the notion of privately retrieving an object from a database using theread operation. Here, we detail how we implemented such a protocol using an almost optimal PIR scheme proposed in [Aso04, AF] which utilizes secure coprocessors as a trusted platform to execute an efficient PIR protocol. 3.5.1 Hardware-Based PIR A Secure Coprocessor (SC) is a general purpose computer designed to meet rigor- ous security requirements that assure unobservable and unmolested running of the code residing on it even in the physical presence of an adversary [SS00]. These devices are equipped with hardware cryptographic accelerators that enable efficient and fast imple- mentation of cryptographic algorithms such as DES and RSA [SC]. Recent advances in hardware technology have enabled successful implementation of several real-world applications such as data mining [BAG + ] and trusted web servers [JSM] on trusted computing environments. The use of secure coprocessors has also been proposed to increase the security of outsourced data and reduce the bandwidth requirements in the 58 database as a service model [MT]. In particular, two studies have designed and imple- mented privacy-enhancing features using IBM 4758 Secure Coprocessors. In [IS05], the authors implement a privacy enhanced X.509 certificate directory to enhance client pri- vacy while accessing server data. Similarly [IS] details how IBM’s secure coprocessors are used as trusted third parties to implement a secure function evaluation scheme. Note that trusting a secure coprocessor is fundamentally different from trusting an anoymizer. This is due to the fact that while anonymization-based approaches reduce user query location exposure by blurring or making it indistinguishable, our PIR-based approach addresses an additional and orthogonal problem of query content privacy. In other words, regardless of the anonymization technique used, the server still infers sen- sitive information about querying location by monitoring the requested information. However, using our hardware-based PIR, the server is entirely blinded from learning any information from its communication with the secure coprocessor. Therefore, while both techniques need to rely on some entity to ensure privacy, anonymizers attempt to protect user privacy and still leak information through access patterns. However, in our approach secure coprocessors fully protect sensitive user location information by addressing both user (i.e., query) and access (i.e., response) privacy. TrustingSC is also substantially different from trusting a location server in several respects. Aside from being built as a tamper resistant device, secure coprocessors are specifically programmed to perform a given task while location servers consist of a variety of applications using a shared memory. Therefore, unlike the secure coprocessor in which the users only have to trust the designer, using a location server requires users to trust the server admin and all applications running on it, as well as its designer. Last but not least, in our setting, the secure coprocessor is mainly a computing device that 59 receives its necessary information, per session from the server, as opposed to a server which both stores location information and processes spatial queries. The idea behind using SC is to place a trusted entity as close as possible to the untrusted host to disguise the selection of desired records within a black box. In order to avoid the linear cost of going through each record in the host or sending the entire dataset to the user (i.e., O(n) computation and communication cost, respectively) we use the technique proposed by Asonov et al. [AF] to achieve optimal (i.e., constant) query computation and communication complexity at the cost of performing as much offline precomputation as possible. Since [AF] uses shuffling techniques, we first offer a brief overview of how shuffling can be efficiently performed. Definition 2. Random Permutation: For a databaseDB ofn items the random permu- tation transformsDB intoDB such thatDB[i] =DB [[i]]. For example for DB = fo 1 ;o 2 ;o 3 g and DB = fo 3 ;o 1 ;o 2 g the permutation represents the mapping =f2; 3; 1g. ThereforeDB[1] =DB [[1]] =DB [2] =o 1 , DB[3] = DB [[3]] = DB [1] = o 3 etc. It is easy to verify that the minimum space required to store a permutation ofn records isO(n logn) bits. The basic idea behind utilizing a secure coprocessor is to use to privately shuffle and then encrypt the items of the entire dataset DB. While this encrypted shuffled datasetDB is written back to the server,SC keeps for itself. Later, a user interested in thei th element ofDB, encrypts his query usingSC’s public key and sends it toSC through a secure channel.SC can then retrieve and decryptDB [[i]], re-encrypt it with users’ public key and send it back (hereinafter we distinguish between a queried item which is the item of interest requested by the user and retrieved/read record which is the itemSC reads fromDB ). Although the server is blindly retrieving an encrypted record and returning it to SC, the scheme is not yet private. Launching a chosen plaintext 60 cryptanalysis, the protocol still leaks to the untrusted dataset owner whether queriesq andq 0 asked for the same item or not. In order to avoid this problemSC maintains a list L which contains the indices of all items retrieved so far. SC also caches the records retrieved from the beginning of each session. In order to answer the k th query, SC first searches its cache. If the item does not exist in the cache,SC retrievesDB [[k]], stores it in its cache and adds k to L. However, if the element is already cached, it randomly reads a record not present in its cache and caches it. With this approach, each record of the database might be read at most once regardless of what items are being queried. This way, an adversary monitoring the database reads can obtain no information about the record being retrieved. Once each record is retrieved and decrypted, we check whether it points to another record in the database (this happens if a decomposed record is retrieved) and we recursively call theread() function to retrieve the next record. The problem with the above approach is that afterT threshold retrievals,SC’s cache becomes full. At this time a reshuffling is performed onDB which clears the cache and L. Note that sinceT threshold is a constant number independent ofn, query computation and communication cost remain constant if several instances of reshuffled datasets are created offline[AF], alternatively the shuffling can be performed regularly on the fly which makes the query processing complexity equal to the complexity of the recurring reshuffling which takesO(n) for a database of sizen [WDDB]. Algorithm 9 details how theread operation is performed privately. Note that between the shufflings, it reads a different record per query and thus ensures each record is accessed at most once. We have now developed the necessary operations behind a read request formed as read(DB ;i). All details regarding the shuffling and the permutation are hidden from the entity interacting withDB . The average time it takes to privately retrieve a record 61 fromDB can be written as Equation 3.6. We assume a singleSC and no use of paral- lelism (if multi-threading is used inSC or if more than oneSC is available,t read 0 since an unused permuted and encrypted instance ofDB can always be made available). Note thatt read is inversely proportional to a linear increase inT threshold . t read =O( n T threshold ) (3.6) Algorithm 9read(DB ;i) Require: DB;TfThresholdg;LfRetrieved Itemsg 1: if (jLjT ) then 2: DB ReshuffleDB using a new random permutation; 3: L ;; 4: ClearSC’s cache 5: end if 6: ifi = 2L then 7: record DB [[i]]; 8: Addrecord toSC’s cache 9: L =L[fig; 10: ifvalid(record:p) then 11: read(DB;p); 12: end if 13: else 14: Readrecord fromSC’s cache 15: ifvalid(record:p) then 16: read(DB;p); 17: end if 18: r random index fromDBnL; 19: temp DB [[r]]; 20: Addtemp toSC’s cache 21: L =L[frg; 22: end if 23: returnrecord; Theorem 2. Algorithm 9 does not leak information. PROOF: Proved by Asonov [Aso04] and Wang et al. [WDDB] showing the joint entropy of a sequence of queries is maximal. During the execution of the read operation, we might reach T threshold , the thresh- old of maximum allowed elements to be retrieved before shuffling (Section 3.2.1) in which case either swapping to an unused instance ofDB is required or reshuffling is 62 performed online to generate a new instance before the range and KNN algorithms can continue. We can now discuss how the PIR scheme (Section 3.5.1) and private spatial queries (Section 3.3) are integrated. The entire query processing can be divided into the follow- ing phases. 3.5.2 Preprocessing During this phase, the server first creates listDB, countDB or hilbDB index (while listDB is required by range queries, any of the above indexing schemes can be chosen depend- ing on the KNN algorithm of choice). To simplify notation, in this section we denote them asDB 1 ,DB 2 andDB 3 , respectively. Next,SC generates a random permutation and privately shuffles the above indexes and encrypts them. The encrypted shuffled databasesDB 1 ,DB 2 andDB 3 are then written back to the server. Note that depend- ing on the storage availability ofSC, several instances ofDB i can be created offline and stored at the server as long as their corresponding permutation indices are kept in SC. The server might try not to follow the protocol by manipulating the construction of the permuted databases. However, Theorem 2 states that the server cannot infer any patterns by monitoring retrievals. Also, any diversion from the protocol is quickly noticed by the users who will abandon the protocol immediately. 3.5.3 Query Processing During the query processing phase, users first establish a secure channel with SC through an SSL tunnel and submit their queries to SC. A query q initiated by a user u i is received by SC as Enc spk (q;s id ) where q is one of the algorithms discussed in 63 Section 3.3 ands id is the session id for the communication betweenSC and the user. UsingEnc, each user encrypts his query withSC’s public keyspk and sends it along with hiss id . Depending on the value ofq,SC invokes one of the algorithms discussed in Section 3.3. While executing the queries, each SC’s interaction with the server is via the privateread() requests by which SC privately retrieves a sequence of encrypted records from DB 1 , DB 2 or DB 3 (depending on the type of query and KNN pro- cessing algorithm used) that contains the result set for q. Each member of the result set is first decrypted by SC and then re-encrypted with the user’s public key and the compiled result set is transferred to the user (to thwart replay attacks,SC binds a nonce to the result set). Note that the result setR 0 returned by each algorithm might include some extra objects. AlthoughSC can remove them before sending the final results to the user, we assume the filtering step is performed by the user mainly to reduceSC’s computation overhead. The main success factor behind this technique is the anonymization of user queries by converting them into a sequence of PIR reads from the server all being performed by the secure coprocessor. Therefore, the server is no longer able to trace database records back to separate user queries. In other words, by breaking the correspondence between user queries and database reads, and detaching user identities from their queries (throughSC) the server cannot infer any information about the query (e.g., size ofR or value ofK). This is critical to hinder adversaries from learning user locations by gaining information about their query parameters. Note that a conventional PIR scheme would have to blindly query the entire database for each grid cell being queried. Our main motivation behind designing the private index structures in Section 3.3.1 is to avoid this processing cost. 64 Using the above framework, range and KNN queries are blindly evaluated. This is becauseSC is the only entity interacting with the server and no user identity informa- tion is passed to the server and therefore from server’s point of view, the query could have been initiated by anyu j forj2f1:::mg. Furthermore, according to the query processing logic,SC decomposesq into a set of privateread() requests for the records (cells) r 1 ;r 2 ;r 3 ;:::;r s from the server’s encrypted and shuffled databases. Note that r i ’s are highly correlated toq andread(DB ;i) operations are monitored by the sever. However, Theorem 2 guarantees read(DB ;i) leaks no information about i, r i and thus aboutq. Therefore, no location information is revealed to the server through cell retrievals. 3.6 Summary In this chapter, we proposed a framework to evaluate private location-dependent queries utilizing the techniques developed in the private information retrieval literature. Using PIR, we satisfy more stringent user privacy guarantees compared to anonymity and cloaking-based approaches by blinding the server from learning any information about user locations as well as the content of their queries. We carried out extensive sets of experiments to empirically verify the effectiveness of our proposed approach (See Chapter 5). 65 Chapter 4 Oblivious Index Traversal 4.1 Introduction The ever increasing demand for querying spatial data through location-based applica- tions has dramatically increased the popularity of location-based services (LBS). With LBS, location data such as points of interest are hosted at a location server LS and queried by usersU =fu 1 ;:::;u s g who subscribe to the service. The location server LS (or server for short) usually employs a tree-structured spatial indexing scheme such as a kd-tree or an R-tree [Gut] to index data. However, as we mentioned in Chapter 1,LS is potentially untrusted and therefore, a common concern in LBS is achieving privacy; User queries are usually location-dependant and therefore the focus is on achieving user location privacy. Protecting the location information embedded in user queries as well as the data accessed to respond to user queries are the key objectives in the location privacy domain. The second goal is also referred to as content privacy [SS01]. With many proposed solutions for privacy-aware processing of spatial queries, encryption has been employed as the de facto solution to achieve content privacy. How- ever, even with encryption, severe information leakage toLS can happen by revealing how the underlying spatial index structure is accessed. Consider the following scenario. Assume N =fo 1 ;o 2 ;:::;o n g static objects (e.g., restaurants) are first encrypted and then indexed by a tree-structured index such as an R-treeR which is then stored atLS and queried by a subscriberu2U. Clearly, eacho i and hence its enclosing nodes (i.e., 66 MBRs) have a certain level of “popularity” represented by how frequently the indexed objects are requested by users. Encrypting the contents of each MBR, however, does not affect how frequently it is being accessed. Therefore,LS can associate the immediate MBR (and potentially all its ancestor MBRs) indexing the most popular objecto i with the most frequently queried node inR with high probability. Exploiting its prior knowl- edge ofo i popularity (e.g., the location of the hottest restaurant in LA) or how objects in N are distributed, LS can guess u’s location with good enough accuracy. Protect- ing information about how frequently nodes are accessed is also referred to as access privacy [SS01]. It is known that without access privacy, content privacy is not fully achievable [LC04] and hence to protect user privacy, both content and access privacy are required at the same time. Achieving access privacy by protecting the adversaries from learning any sensitive information from the patterns of accessing data is not trivial. The private information retrieval techniques aim to achieve access privacy by entirely blinding the server from learning any information about what records are being accessed and hence how fre- quently they are requested. However, as we pointed our in Chapter 3, PIR is very costly. Using information theoretic PIR, one can achieve perfect secrecy at the cost of a linear client server communication [CKGS98]. To reduce this prohibitively expensive cost, computational PIR schemes bring the communication cost down toO( p n) by assum- ing a computationally bounded server where the security of the approaches relies on the intractability of a computationally complex mathematical problem, such as Quadratic Residuosity Assumption [KO]. However, similar to information-theoretic PIR, this class of approaches cannot avoid a linear scan of all database items per query. Ultimately, the class of Hardware-based PIR approaches places trust on a tamper-resistent hardware 67 device [Aso04]. These techniques achieve almost optimal computation and communi- cation overhead at the cost of relying on a hardware device (with severe computing and storage resources) to provide perfect secrecy. Therefore, the PIR-based approaches to location privacy [KSSM10, GKK + ] are still relatively costly to be employed in practice despite achieving strong user confidentiality. In this chapter we protect access privacy by proposing two novel techniques with significantly less cost than the proposed PIR approaches mentioned above while still achieving strong measures of user, content and access privacy. Both of our schemes are based on the observation that if access privacy is achieved through protecting infor- mation leakage from the patterns of accessing index nodes, encryption and client-side query processing are enough to achieve content and user location privacy, respectively. Therefore, our contribution is proposing two schemes to achieve access privacy. Our first technique flattens node access frequencies for each tree level under a practical and relaxing assumption about the variation between node access frequencies and our sec- ond technique employs a permutation scheme to disguise the original assignments of elements to tree nodes while preserving the tree structure. Obviously, to achieve privacy, both methods incur extra computation, communication and storage overhead compared to processing queries on the original index. However, we analytically and experimen- tally verify that the costs remain acceptable for LBS service providers and subscribers. Our proposed techniques are query-independent. In other words, they manipulate the underlying structure of the tree and abstract away the details of tree navigation from the query processing module. Therefore, they require minimal extra computation or storage overhead at the client side by only requiring encryption and decryption of nodes. In Section 5.4 we experimentally study the performance of our approaches using two real- world datasets with an actual client/server setup communicating over the network. Our 68 empirical evaluations verify the effectiveness of our proposed techniques in preserving the node access frequencies while incurring acceptable overhead in navigating the index. The overall end-to-end response time for processing user queries in all experiments allows processing of multiple queries per second. 4.2 Preliminaries To serve data efficiently,LS uses a tree-like index such as an R-treeR to index objects in N. Without loss of generality and to simplify discussion, we focus our attention on R-trees due to their popularity for indexing static spatial data. Since all objects are available during the offline index (i.e.,R) construction, we also assume all tree nodes are filled with data. We relax both of these assumptions in Section 4.5 and show how similar techniques can be applied to R-trees with partially full nodes and to other tree-structured spatial indexes such askdtrees and quadtrees. AssumingR’s capacity isc, each node containsc children represented bym 1 :::m c where eachm i is itself an MBR ofc objects (by using the term “objects” hereafter, we refer to internal elements of each node which could be MBR’s of lower nodes or the actual POI data for leaf nodes. To avoid confusion, we explicitly use “leaf objects” to refer to the latter case). We use the notationN i;j to refer to thej th node ofR at depthi = f1;:::;hg. Figure 4.1 illustrates such a tree where leaf nodes are represented byN h;j indexing o 1 :::o n represented by their location and identifiers. Table 4.1 summarizes the notations used throughout the chapter. Using its prior knowledge of past queries,LS which hostsR, knows how frequently internal and leaf nodes of R are requested. We use f(N i;j ), or its short form f i;j , to show the normalized (0 f i;j 1) access frequency of the nodeN i;j inT number of queries. We also assume the strongest adversarial case whereLS is aware of the spatial 69 Figure 4.1: R-tree distribution of elements inN. To achieve privacy, we replaceR with its privacy-aware variantR 0 and haveLS serve queries usingR 0 whose nodes and their access frequencies are denoted by N 0 i;j and f 0 i;j , respectively. We assume the trend in which objects are queried (i.e., their popularity) does not vary significantly with time and in particular after replacingR withR 0 . EachN 0 i:j inR 0 is assigned a node identifierid =H(i;j) computed as a one-way hash function ofi;j (e.g., SHA512) and is encrypted with a private key unaccessible to LS to protect content privacy. Similar to [LC04, LC05, YGJK] we assume the secret key is shared by subscribers ofLS to decrypt R-tree nodes. To avoid collusion attacks, the cryptographic operations can be performed by the assistance of inexpensive smartcards placed in subscriber’s client devices [BP]. Since nodes are encrypted, LS cannot traverseR 0 and tree navigation becomes an interactive scheme between the user u and the server. To perform any spatial query such as range orKNN,u privately requests a series of nodesN i;j chosen based on the query processing logic. At each step u first requests the next node N 0 i;j by its id = H(i;j). After decryptingN 0 i;j ,u identifies one of the node’s childrenN 0 i+1;j 0 for further expansion and sends a subsequent request to the server for the node represented by id =H(i + 1;j 0 ). This read, decrypt, encrypt and request process is repeated byu until 70 Symbol Meaning N POI data R R-tree R 0 Privacy-awareR-tree N i;j (N 0 i;j ) j th node ofR(R 0 ) at depthi c node capacity H One-way hash RS query result set m 1 :::m c N i;j ’s children each an MBR f i;j (f 0 i;j ) normalized access frequency ofN i;j (N 0 i;j ) F i (F 0 i ) histogram off i;j (f 0 i;j ) values F i (F 0 i ) averagef i;j (f 0 i;j ) values at depthi N i;j :m k !N i;j 0 assigningm k toN i;j 0 r i;k!i;k 0 probability of includingN i;k 0 inN i;k ’sRS S i sequence of siblings at depthi Table 4.1: Notations and Symbols the leaf nodes (likely) containing the query result are retrieved. If several MBRs at level i intersect with a query, they are each retrieved separately using the scheme discussed above. Note that encrypting each MBR in a node as opposed to encrypting the whole node data would result in an information leakage by exposing the ordering among the elements of each node. Since both R and R 0 are hosted by the location server LS, it knows f i;j values while serving R and later learns f 0 i;j values by serving R 0 . The location server is not trustworthy and is curious to exploit this knowledge to infer user locations fromR and R 0 node access frequencies. To do this,LS employs a frequency variation attack. That is, the server tries to correlate thef i;j andf 0 i;j values or exploits the variations among f 0 i;j values in R 0 and combines this with its prior knowledge to infer the contents of N 0 i;j . Our two approaches presented in Sections 4.3.1 and 4.3.2 show how we generate a privacy-aware tree R 0 whose f 0 i;j values are meaningless to LS thus achieving user location privacy. 71 4.3 Obfuscating Access Frequencies Consider the scenario whereLS hostsN and the client (we hereafter use the terms client and user interchangeably)u forms a queryQ to find a nearby objecto i . As we discussed in Section 4.1, one extreme solution (in terms of both privacy and efficiency) is for u to use PIR to privately navigate the index structure hosted byLS to prevent him from learning his location. While remaining perfectly secret, this approach is very expensive (See Section 6.3). In another extreme, K-anonymity can be used to confuse LS by sending himK queries to makeu indistinguishable among a redundancy set of sizeK. Although being efficient, this technique offers significantly weaker privacy guarantees as in the best case, it makesu indistinguishable among a usually small set ofK1 other users. Moreover, recent studies [GKK + , YJHLa, KS, KSSM10] show how sophisticated attacks can be mounted against such schemes. We strike a compromise by utilizing the hierarchical nature of tree structured spatial indexes to enable efficient yet oblivious traversal of the tree by making node access frequencies meaningless to the untrusted server. With both techniques, the original R-tree R is replaced by its privacy-aware variantR 0 such that the index navigation is performed onR 0 nodes denoted byN 0 i;j . To processQ,u interactively requests a series of nodes fromLS that allow his to findo i . We show that with both of our approaches, the server does not learn the frequency (and hence content) of nodes accessed to executeQ. The basic idea behind both of our proposed schemes is to perform redundant reads from the server for each node requested byu. We obfuscate node access frequencies by grouping “less popular” nodes with more frequently accessed ones in a redundancy set RS. With our first technique, we compute for each nodeN 0 i;j , a probability distribution function stored at its parent node. This function instructs the client to form a redun- dancy set RS which includes N 0 i;j as well as another node of the same height chosen 72 according to a certain precomputed probability function to make node access entirely uniform. Although secure and efficient for query processing, as we show in Section 4.3.1, this approach has certain limitations when there is abnormal variations between access frequencies among the nodes of the same tree depth. Our second scheme takes R and permutes the assignments of internal children (i.e.,m i ’s in Figure 4.1) to nodes at each tree depth to make node access frequency “very close” to a uniform access. We preserve the tree structure by keeping extra information at parents to maintain the child parent relationship. To navigate the tree and request a node N 0 i;j , a redundancy set is formed according to the offline permutation scheme including all nodes that con- tain one ofN i;j ’s original children. As we discuss in Section 4.3.2, the latter approach relaxes the frequency constraint of the first approach at the cost of replacing a per- fectly uniform access frequency distribution, with a nearly uniform one. More for- mally, with both approaches our goal is to break the correlation between the original node access frequenciesff i;1 ;f i;2 ;:::;f i;c i1g and the modified access frequencies of R 0 nodesff 0 i;1 ;f 0 i;2 ;:::;f 0 i;c i1 g for i2f1:::hg by making node access frequencies uniform. To achieve uniformity in node access frequencies, one can think of creating a redun- dancy setRS for eachN 0 i;j access inR 0 with capacityc. In particular, eachRS would include the original node requested as well as K other elements chosen uniformly at random from the c i1 nodes with the same height to protect node access frequencies. In the following lemma, we show that this naive protocol does not protect information leakage because the resulting node access frequencies f 0 i;j are not uniform and are in fact highly correlated with f i;j values. More formally, we show that the histogram of access frequencies would uniformly increase for all nodes and thus the new frequency histogram will not be uniform since the original frequencies are not uniform. 73 Lemma 1. Using the above naive scheme, f 0 i;j = + f i;j where = p i P 1kc i1 f i;k and = 1 1 p i forp i = K c i1 1 . Proof: By randomly choosing elements of RS, at depth i, each N i;k ;k6= j has a p i = K c i1 1 chance of being included in RS. Such added nodes each contribute tof 0 i;j with probabilityp i . In other words, includingN i;k inN i;j ’s redundancy set addsf i;k to f 0 i;j with probabilityp i . In general, f 0 i;j =f i;j +p i X k6=j f i;k (4.1) f 0 i;j =f i;j +p i 0 @ X 1kc i1 f i;k f i;j 1 A (4.2) f 0 i;j = p i 1 p i f i;j +p i X 1kc i1 f i;k (4.3) f 0 i;j = +f i;j (4.4) Corollary: ForN 0 i;j 1 ;N 0 i;j 2 :f 0 i;j 1 f 0 i;j 2 =(f i;j 1 f i;j 2 ). The above observation states that members of the redundancy set cannot be uni- formly picked if the original nodes are accessed at different frequencies. Therefore, we need to take values of f i;j into account while constructing the redundancy set. In the following sections, we present two protocols to achieve this goal. 4.3.1 Probabilistic Uniform Node Access In this section, we present our first technique to achieve access privacy. Consider the tree of Figure 4.5 where each N i1;j access results in a subsequent request for one of the nodes in the next level of the tree (i.e., i) at different frequencies. Our goal is to 74 add one redundant node to each original node request in such a way that the overall node access frequenciesf 0 i;j for all nodes at depthi become equal (hereafter, we refer to this criteria as the uniform node access frequency). To achieve this, we define for each internal nodeN i1;j a probability tablept i1;j whose values are of the formr i;j!i;k denoting the probability of accessing eachN i;k wheneverN i;j is originally requested. In other words, for eachN i;j request at leveli 1, oneN i;k is added to a redundancy set RS in a random order according tor i;j!i;k values inpt i1;j . In order to enforce one and only one redundant read per node access, we require that P k2S i j r i;j!i;k = 1 which means each node N i;j is paired with one N i;k as its redundant read with probability r i;j!i;k . Figure 4.2 illustrates how a redundant node is chosen. Each node picks a random variable x 2 (0; 1] and drops it the large rectangle whose length is 1. The small rectangle containing x identifies the redundant node to be read. Note that even for a fixed node N i;j redundant nodes can be different each time as the value of x is determined randomly for eachN i;j request. To store each pt i1;j , for an internal node we form a cS i matrix where S i = f1;:::;c i1 g and each (encrypted) entry represents r i;j!i;k . In order to maintain the original capacity (fan-out) of the tree structure, we store these tables in a separate data structure where eachpt i1;j is identified byN i1;j ’s id. These probability tables result in an storage overhead to prevent leakage from the pattern the index is traversed. With our implementation, maintaining these tables incurred a 14% 24% increase (14MB 23MB for a dataset of 94MB) in total amount of space needed to store the index for different values ofc. We proceed to the offline calculation of the probability tables. Since the size of the redundancy setRS is always 2, using the above scheme each node access adds to the access frequency of one and exactly one other node included in its RS. To compute 75 Figure 4.2: Probability Contributions of NodeN 0 i;3 the modified node access frequency f 0 i;j we add to its original access frequency f i;j , the weighted probabilistic frequency contributions of all other nodes of depth i. This contribution is a function of each node’s popularity, as well as the probability of that node pickingN i;j as its redundant node. More formally, performing the above scheme increasesf i;j tof 0 i;j forj2f1;:::;c i1 g: Lemma 2. The new access frequency of a nodeN i;j will increase by the sum of all other node’s probabilistic contribution toN i;j at depthi. Orf 0 i;j =f i;j + P k6=j f i;k r i;k!i;j . Using Lemma 2, we now prove the following property. Theorem 3. The above scheme increases the sum of access frequencies at each depth by a factor of 1. Or P j2S i f 0 i;j = 2 P j2S i f i;j . Proof: Using Lemma 2 and setting P k2S i j r i;j!i;k = 1, we write: X j2S i f 0 i;j = X j2S i f i;j + X j2S i ( f i;j X k2S i j r i;j!i;k ) (4.5) X j2S i f 0 i;j = X j2S i f i;j + X j2S i ff i;j 1g = 2 X j2S i f i;j (4.6) To clarify, let us take the following examples for the simple case of a tree with heighth = 2. Forc = 2 (Figure 4.3a), this scheme is straightforward. We havef 0 2;1 = f 2;1 +f 2;2 r 2;2!2;1 = f 2;1 +f 2;2 = 1. Similarly, f 0 2;2 = 1. In other words, each request for one node includes the other with probability 1. Therefore, the probability of 76 (a) h = 2;c = 2 (b) h = 2;c = 3 Figure 4.3: Examples accessing both nodes are equal. For higher values ofc such asc = 3, the case is slightly more complicated (see Figure 4.3b). To obtain new node access frequencies, we need to solve the following system of equations. 8 > > > > > > > > > > > > > < > > > > > > > > > > > > > : f 0 2;1 =f 2;1 +f 2;2 r 2;2!2;1 +f 2;3 r 2;3!2;1 f 0 2;2 =f 2;2 +f 2;1 r 2;1!2;2 +f 2;3 r 2;3!2;2 f 0 2;3 =f 2;3 +f 2;1 r 2;1!2;3 +f 2;2 r 2;2!2;3 r 2;1!2;2 +r 2;1!2;3 = 1 r 2;2!2;1 +r 2;2!2;3 = 1 r 2;3!2;2 +r 2;3!2;1 = 1 9 > > > > > > > > > > > > > = > > > > > > > > > > > > > ; (4.7) The above system has 9 unknowns (i.e., 3 newf 0 and 6 newr i;j!i;k unknowns for k2S i j) and 6 equations. However, a closer look at the protocol gives us the remain- ing information required to solve the above system. Based on our objective of equaliz- ing modified node access frequencies, we havef 0 2;1 = f 0 2;2 = f 0 2;3 . Using Theorem 3, f 0 2;1 +f 0 2;2 +f 0 2;3 = 2. This property gives us three less unknowns: 8 > > > < > > > : f 0 2;1 = 2 3 f 0 2;2 = 2 3 f 0 2;3 = 2 3 9 > > > = > > > ; (4.8) Therefore, for the general case, in order to achieve uniform access for each tree depth i we setf 0 i;j = 2 P j2S i f i;j jS i j (observe thatjS i j is equal to the number of nodes at leveli) and solve a linear system ofe equations ande unknowns offline. The unknowns arejS i j 1 probability contributions for each of thejS i j nodes. Thus, e =jS i j (jS i j 1) = jS i j 2 jS i j. 77 To execute a queryQ, at each step the useru receives the encrypted original (N 0 i;j ) and redundant (N 0 i;j 0) MBRs along with their probability tables. Next,u discardsN 0 i;j 0 and its table (which needs to be transferred to u to prevent LS from identifying the redundant MBR), decryptsN 0 i;j and picks the next MBR from leveli + 1 ofN 0 i;j to be expanded and uses the probability table to pick the redundant MBR for the next original node request. This process is repeated for every nodeu requests as part of processing Q. Note that even for the first query at timeT = t 0 , the likelihood of any two nodes being included in the user request is jS i j 2 . Let us elaborate a bit more on this topic. The probability contributions of each node (derived from solving the system of equa- tions for each probability table) determines which redundant node will accompany each originally requested node in client’s request to the server. However, one might wonder if the actual node access frequencies will converge to our calculated values after certain number of node requests atT =t 1 ,t 1 >>t 0 . This is in fact not the case. Consider the example of Figure 4.3b and letf 2;1 = 0:6;f 2;2 = 0:1 andf 2;3 = 0:3. Solving a system of 6 equations and unknowns will yield 8 > > > > > > > > > > > > > < > > > > > > > > > > > > > : r 2;1!2;2 = 0:54 r 2;1!2;3 = 0:45 r 2;2!2;1 = 0:07 r 2;2!2;3 = 0:93 r 2;3!2;1 = 0:20 r 2;3!2;2 = 0:80 9 > > > > > > > > > > > > > = > > > > > > > > > > > > > ; (4.9) For the first query submitted atT =t 0 , we calculatep(N 2;1 ;N 2;2 ) which represents the probability of nodesN 2;1 andN 2;2 belonging to the user’s request. p(N 2;1 ;N 2;2 ) =f 2;1 r 2;1!2;2 +f 2;2 r 2;2!2;1 = 0:60:54+0:10:07 = 1 3 = 3 2 . Similarly, p(N 2;1 ;N 2;3 ) = p(N 2;2 ;N 2;3 ) = 1 3 = 3 2 which means any two nodes could be the first nodes requested from the server. To see why observe that in Equation 78 4.8, we set similar values for the expected node access frequencyf 0 i;j . Therefore, each node is equally likely to be present in a user’s request. It is important to note that although node access frequencies are equal, one cannot take away the attacker’s prior knowledge of objects popularity. For instance, consider an extreme case where the client downloads the entire database and processes his query locally to achieve perfect secrecy. Even though all nodes are accessed once (and in fact transferred to the client), the server can still assign a higher probability to a certain object being the actual user intended object. However, this knowledge is not acquired from monitoring the client/server interactions. Similarly, our goal is to ensure our meth- ods do not leak any extra information to the attacker by the way objects are accessed. Finally, aside from the storage overhead, the security of this method comes at the cost of transferring two probability tables for each node request to the client. Although the above system of equations can be easily solved using Gaussian reduc- tion, there are cases where the solution is not valid. To see why note that f 0 i;j = 2 P j2S i f i;j jS i j =f i;j + X k6=j f i;k r i;k!i;j Iff9i;j s:t:f i;j 2 P j2S i f i;j jS i j g, there will at least exist one value ofr i;k!i;j smaller than 0 for somek which is an invalid probabilistic contribution for a node. This situation occurs if there is large variations among the access frequencies of nodes of a certain tree level. For instance, ifc = 3, an optimal solution can be found only if8i;j : f i;j 2 3 . One way to solve this problem is to replicate a node whosef i;j is more than the above threshold into two identical nodes each with half the original frequency. To maintain the structure of the tree, it suffices to randomly choose one of the replicated nodes at its parent with probability 1 2 . However, replication introduces new complexities to our approach. Therefore, this approach is effective whenever there is normal variations 79 amongf i;j values at each depth. We believe this is a practical assumption for POI data (this was roughly the case for over 70% of node access frequencies in our two datasets). However, applying this technique to cases where this property does not hold is an open question of our first approach. In the following section, we present our second approach based on an object permutation technique that does not assume restrictions on node access frequency variations and works well even in the presence of outliers. 4.3.2 Object Permutation In this section, while still using the access redundancy approach, we take an entirely new direction to hide access frequencies. We present an object permutation scheme that obfuscates the histogram of node access frequencies in the original tree (henceforth denoted byF i for each depthi) by converting it to a semi-uniform distribution (hence- forth denoted by F 0 i for each depth i). This is achieved by iteratively regrouping the children of tree nodes according to their original access frequencies. Our intuition is to find a regrouping of elements that gets us as close as possible to the average node frequency for each tree height. In doing this, we show how we preserve the spatial relationship among R-tree nodes by introducing the notion of auxiliary links. To illustrate the process, consider the treeR of Figure 4.5 whose node capacity is denoted byc. A tree nodeN i;j includesc childrenm 1 :::m c each of which is an MBR containing all objects of its respective sub-tree. To ease explanation, the MBR’s of each m i are labeled differently (withm i ,m 0 i andm 00 i , respectively). The variations between f i;1 ;f i;2 ;:::;f i;c values ofR is due too i ’s being queried with different frequencies (each POI has a certain level of popularity). Therefore, the immediate and hierarchical MBR’s 80 Figure 4.4: Matrix Representation of each group ofc objects has a different access frequency. To resolve this issue, we per- mute the assignments of objects to nodes at each depth to achieve a “flattened” distribu- tion. For instance, givenR in Figure 4.5, we can permuteR according to the following set of transformations: = 8 > > > > > > > > > > < > > > > > > > > > > : N i;1 :m 1 !N i;c N i;1 :m c1 !N i;2 N i;2 :m 3 !N i;1 ::: N i;c :m 00 c !N i;2 9 > > > > > > > > > > = > > > > > > > > > > ; (4.10) TheN i;j1 :m k !N i;j2 notation implies that the objectm k of the nodeN i;j1 is trans- ferred (assigned) to the nodeN i;j2 on the right hand side of the! symbol. We can also use the matrix representation to illustrate the permutation process. Let M denote the matrix where thek th row is filled withN i;k from Figure 4.5 for 1kc. Here,M is a cc matrix. We takeM and permute the elements of each row. Once this is done for all rows, we useM 0 = (M) to denote the permuted matrix using the permutation . Let (M 0 ;k) denote thek th column ofM 0 . We setN 0 i+1;k = (M 0 ;k). Each permutation results in a different assignments of MBR’s to permuted nodes. As we elaborate soon, we do not permute objects randomly and choose a permuta- tion that flattens node access frequencies by minimizing an objective function. How- ever, before getting into the details of identifying good permutations, we tackle a more important issue which is maintaining the original spatial relationship among objects of 81 the tree. This is critical to ensure correctness in navigatingR 0 . With each transforma- tion, we maintain the tree structure by adding an auxiliary link from the original parent of an object to its new location. These extra links ensure that whenever an MBR m i is triggered to be expanded, all of its children are accessible by following the original as well as the auxiliary links to find all “pieces” of the right child MBR. Figure 4.6 shows how the original tree of Figure 4.5 can be permuted using the permutation func- tion discussed above. As illustrated, for each object movement, an auxiliary link is generated forN i1;j . For instance, the link fromN i1;j :m 1 toN 0 i;c corresponds to the transformationN i;1 :m 1 !N i;c . A closer look at the matrix representation of the permutation reveals an important constraint we impose on shuffling objects among nodes. This constraint which we refer to as the shuffling constraint, states that if two objects m i and m j belong to the same nodeN i;j in the original tree, in the permuted tree, they should belong to two different nodes. The shuffling constraint ensures that (c 1) MBR’s of each node are moved out of its original node and are transferred toc 1 new nodes. Therefore, moving the objects of each node results in exactlyc 1 new auxiliary links. Similar to the discussion in Section 4.3.1, to avoid reducing the tree fan-out, we store the auxiliary links in a separate data structure. In our implementation, each MBR is represented by 4 float numbers of 4 bytes each and a 4 byte pointer. With our proposed approach, for each MBR we need to storec 1 additional auxiliary pointers. Each node hasc child MBRs and hence the space required is 4c (c 1) bytes to store auxiliary links for each node. The storage overhead of this approach for a system with 4K disk blocks is around 2 to 4 extra page accesses to retrieve the auxiliary links for each node. In Section 5.4 we experimentally evaluate how this cost affects the overall response time. 82 Figure 4.5: Original TreeR We can now show how tree navigation is performed in R 0 . Similar to the read, decrypt, encrypt, request protocol discussed in Section 4.2, the clientu first receives the nodeN 0 i;j and its children’s auxiliary links. After decrypting the node,u identifies the relevant MBR (saym k ) for the next round of the protocol. However, in contrast to the original tree where all m k ’s children can be accessed by reading a single subsequent node, here u needs to construct the redundancy set RS which includes the original and auxiliary links to all nodes that containm k ’s children in the permuted tree. Using the tree of Figure 4.6 as an example, suppose after examining the content of N i1;j , u identifies m 2 for further expansion whose children could be retrieved by requesting N i;2 in the original tree (See Figure 4.5). However, in order to retrieve allm 2 children, RS =fN 0 i;k jk2f1:::cgg includes a request forc (encrypted) nodes. This process is repeated for further expanding one ofm 2 ’s children (e.g.,m 0 1 ). Note that in many cases, more than one MBR might overlap with (say) a range query. Each MBR is requested separately using the above fashion. Given the shuffling constraint, any node request results in a redundancy set of sizec except for the tree root. This might result in some nodes being sent to the client multiple times. However, this redundancy is needed to ensure node access frequencies remain balanced. The transfer of c 1 extra nodes and the auxiliary pointers to client is the communication overhead of this approach to obfuscate node access frequencies. What is left to discuss is the challenges of devising an optimal permutation algo- rithm. We now discuss our objective function used for object permutation. 83 Figure 4.6: Shuffled TreeR 0 We first review an important property of the constrained object permutation tree discussed in Section 4.3.2 that helps us in designing the permutation algorithm. To illustrate this property (as well as others presented later in this section), we use the tree R of Figure 4.7a as a running example where the upper numbers shown inside the leaf nodes ofR are object identifiers and the lower numbers denote POI access frequencies (for clarity, we do not show that node contents are in fact encrypted). Similarly, for internal nodes, children are denoted by m i and their frequencies are calculated as the sum of their children frequencies. To ease explanation and for clarity, we do not use normalized access frequencies during this discussion and use whole numbers as node frequencies instead. AggregateAccessFrequencyProperty: The new access frequency (or frequency for short) of each nodef 0 i;j is the sum of the frequencies of its parents MBR’s at leveli 1. Proof: Each nodeN 0 i;j has one original as well as several incoming auxiliary links. Each time any of the MBRs from leveli 1 needs to be expanded, all its outgoing links should be queried to ensure the correct child MBR (or object) is found. Therefore, a child node is accessed each time one of its parents at leveli1 are being expanded each of which add to the frequency ofN 0 i;j . Corollary: After each permutation, average node access frequency will be multiplied byc. In other words,F 0 i =cF i fori> 1. Consider the leaf node permutation of R in Figure 4.7b. The node N 0 4;1 has two incoming links: one fromN 0 3;1 :m 2 and another fromN 0 3;4 :m 1 . Therefore,f 0 4;1 = 23+5 = 84 (a) Original Tree (b) i = 4 (c) i = 3 (d) i = 2 Figure 4.7: Shuffling Steps for TreeR 28. Also, note that the average node access frequency has increased from 100 8 = 12:5 to 56+54+44+46 8 = 25 =c 12:5 forc = 2. Let us denote the histogram of node access frequencies at each depthi of the tree R by F i . Intuitively, a good permutation is one that groups popular objects with the ones queried less frequently to get as close as possible to a fully uniform frequency distribution. We now define the “closeness” objective more formally. LetF 0 i = 1 c i1 P c i1 k=1 f 0 i;k denote the average node frequency for each depthi of the permuted tree. We define the best permutation as the one that most closely satisfies the following goal: 85 Minimum Standard Deviation: Minimize the deviation among thef 0 i;j values for each depthi. In other words i = v u u t 1 c i1 c i1 X k=1 f 0 i;j (F 0 i ) 2 ;minimize( i ): Observe that this goal is designed to prevent the server from mounting a frequency variation attack defined earlier in Section 4.2. Since an optimal solution assigns equal frequencies to each node, i = 0 for all tree depths of the optimal solution. Therefore, the closeness to i = 0 represents the security strength of a permutation. As we discuss below, constructing a permutation that results in a fully flat frequency distribution is not a trivial task. Let us define the maximum range as i = U i L i where U i = max(f i;j );L i = min(f i;j ) for depth i. To achieve minimum standard deviation, we need to mini- mize . However, this problem is an instance of the uniform k-partitioning problem. Thekpartitioning problem, also known as Minimum Makespan Scheduling Problem [Sah76] aims to partitionm sets, each of cardinalityk, intok sets of cardinalitym such that each of the produced k sets contains exactly one element from each original set while optimizing a certain “set uniformity” measure (e.g., minimizing the largest sum of cell elements or in our case the maximum range defined above). While many instances of the problem (such as minimizing the largest size) are trivial and have polynomial solutions, minimizing is equivalent to one of the few instances of thekpartitioning problem (out of a total of 32) which is NP-hard [DHPS05] (even forc = 2 this problem becomes the subset sum problem which is still NP-hard). This is because in our case the number of bins is fixed. Therefore: Theorem 4. Finding the optimal object permutation isNPhard. 86 Given the hardness of finding the optimal solution that satisfies the minimum stan- dard deviation goal, we devise an efficient permutation technique that uses the follow- ing intuition to find a sub-optimal solution based on heuristics that aim to minimize the objective function defined above. Algorithm 10 Object Permutation Require: SetF 0 =fF 0 1 ;:::;F 0 h g =? 1: for Each tree leveli =fh;:::; 2g do 2: forj =f1;:::;c i1 g do 3: N 0 i;j ? 4: f 0 i;j 0 5: end for 6: forj =f1;:::;c i2 g do 7: forN i1;j :m k ;k2f1;:::;cg do 8: replicateN i1;j :m k frequencyc times 9: add the replicated value toF 0 tmp i1 10: end for 11: end for 12: SortFtmp 0 i1 descending 13: whileFtmp 0 i1 6=? do 14: N 0 i;j node with minimumf 0 i;j 15: N i1;j :m k object with largest frequency inFtmp 0 i1 16: letN i;j :m 0 k be a random unmarked child ofN i;j :m k 17: if ((N 0 i;j satisfies shuffling constraint) and (is not full)) then 18: N i;j :m 0 k !N 0 i;j 19: remove an instance off i1;j :m k fromFtmp 0 i1 20: markm 0 k 21: add an auxiliary link fromN i1;j :m k toN 0 i;j 22: UpdateF 0 i withf i1;j :m k frequency 23: end if 24: end while 25: end for 26: return F 0 According to our grouping strategy, it is clear that our goal is to pair objects from either end of the frequency spectrum together to achieve uniformness. At the first glance, it seems that to find such a shuffling for depth i, frequency values for MBRs of N i;j (i.e., nodes belonging to same the same level) should be considered. However, the aggregate access frequency property stated that the access frequency of each node is a summation of its parents’ frequencies. Therefore, to permute the tree at leveli, we need to consider the frequency values of MBR’s at leveli 1. As we discussed in Section 4.3.2, during the reshuffling of the tree at depthi, each object of each node at depthi1 87 generatesc 1 new outgoing auxiliary links for its children in addition to its original child/parent link inR. Each such auxiliary link (as well as the original link) contributes the parent’s frequency to the nodes containing its original children. In other words, each parent (i.e., object) of depth i 1 contributes its frequency c times to the frequency patterns of its children at depth i. Therefore, at depth i, we have c (i1)1 nodes each with c children whose frequencies should be replicated c times. In other words, we need to flatten c i2 cc frequencies by assigning them to c i1 destination nodes (placeholders) of level i according to the shuffling constraint. For instance, in Figure 4.7b, to shuffle the leaf nodes (i = 4), we need to find a grouping of 2 2 2 2 = 16 values 13; 13; 23; 23; 21; 21; 6; 6; 8; 8; 10; 10; 5; 5; 14; 14 to 2 3 = 8 permuted leaf nodes. As shown (say) nodeN 4;4 adds the frequencies of its parentsN 3;2 :m1;N 3;2 :m2. Hence, f 0 4;4 = 21 + 6 = 27. To find the right frequency assignments to destination nodes (placeholders) we use the following greedy heuristic. At each step of the algorithm, we want to assign the “largest” unassigned frequency share to the “emptiest” node. To achieve this, during an offline process, all destination nodes are kept sorted in ascending order based on their “remainder space” that could be assigned. Similarly, remaining unassigned frequencies are sorted in descending order. Next, the first available node on the sorted list (i.e., the emptiest) receives the first frequency from the other sorted list (i.e., the largest). Each such assignment triggers an object transformation and hence the creation of an auxiliary link from the parent to the destination node. This process is continued until allc i frequencies are partitioned evenly in theirc i1 placeholder nodes. This process is repeated bottom up to shuffle the tree. Note that performing this process in a top down fashion would destroy the even distribution of the previous phase. Algorithm 1 details the shuffling process and Figure 4.7 shows how hierarchical shuffling is performed on 88 an entire R-tree. For instance, the (replicated) level 3 frequencies are grouped asf23; 5g, f5; 23g,f21; 6g,f6; 21g,:::,f13; 10g,f13; 10g in level 4 (leaf). 4.4 Security and Complexity Analysis In the remainder of this section we review the server overhead in employing our pro- posed probabilistic uniform node access and object permutation schemes henceforth denoted by PU and OP , respectively, for spatial query processing using the privacy- aware treeR 0 . We also present the security properties of our techniques. Theorem 5. LetC O (n);C PU (n);C OP (n) denote the computational cost of processing a spatial queryQ overn objects using the original,PU andOP methods, respectively. C PU (n) 2C O (n) andC OP (n)cC O (n) wherec is the R-tree capacity. Proof: ProcessingQ usingR at each step, requires a “visit” phase to inspect a node N i;j and an “expand” phase where the server identifies the next MBR to be visited. This MBR is selected based on the nature ofQ (for instance an MBR overlapping withQ ifQ is a range query). ReplacingR withR 0 results in two changes. First, the MBR inspection process is shifted to the clientu as nodes ofR 0 are encrypted. Furthermore, each “visit” to a single node is replaced by requesting a redundancy set of size where = 2 for PU and = c for OP . However, the expansion phase remains intact because of all nodes requested byu, only one MBR would trigger the next client server interaction (see Figure 4.8). Therefore, the complexity of processing Q is C O (n) where is a constant. The notation is used to account for the (constant) extra cost of retrieving the probability tables and the auxiliary links for thePU andOP methods, respectively. As we noted in Section 4.3.1, with regard to security, thePU method achieves fully uniform node access frequency where the probability of any two nodes being requested is jS i j 2 . Moreover, according to Lemma 2, the server overhead for thePU technique is 89 (a) Original Tree (b) Permuted Tree Figure 4.8: Visiting vs. Expanding Nodes two fold due to an increase in the size of the result set. To further understand the security strength of theOP method consider the following properties. Frequency Anonymity Property: At each tree level, there are at leastc nodes with equal access frequency. Proof: The replication step in Algorithm 1 guarantees that there are at least c instances of each frequency to be assigned to placeholders. Given that all placeholders are initially empty, each group of identical frequencies inFtmp 0 i1 are assigned to a group ofc placeholders with identical remaining capacity. Therefore, according to the aggregate access frequency property, this process is repeated c times and therefore, the aggregate frequency ofc nodes will be identical at the end of Algo- rithm 1. Path Anonymity Property: There are at least c 2 leaf objects that share the same navigation path. Proof: To access a leaf objectN 0 h;j :m k , its parentN 0 h1;j 0:m 0 k should be read first. However,N 0 h1;j 0:m 0 k points toc 1 other nodes as well all of which need to be accessed and theirc children should be inspected to find the correct leaf object. This argument holds for recursively reading higher intermediate nodes as well. Thesec 2 leaf objects cannot be determined by the server’s prior knowledge of MBRs inR. This is because the objects that are paired inR 0 do not share any proximity with each other and are grouped solely based on the randomized frequency flattening strategy 90 of Algorithm 1. However, as expected, theOP method is still less secure compared to PU. In Section 5.4, we evaluate the security and efficiency properties of theOP method. 4.5 Generalizations We assumed in Section 4.2 that leaf objects inN are indexed using an R-tree with capac- ityc where each node contains exactlyc elements and deferred the generalization of our techniques to partially full R-trees as well as applying our proposed schemes to obliv- ious navigation of other tree-structured spatial indexes such askd-trees and quadtrees. In the following sections, we discuss these two generalizations, respectively. 4.5.1 Partially Full R-trees To prove several properties of our proposed R-tree variants, we assumed each internal nodeN 0 i;j of R-tree includes exactlyc child MBRs and each leaf node groups exactlyc objects into a leaf MBR. Although having the entire dataset available offline enables the construction of a balanced tree where most of the nodes are in fact full, several nodes at each level might not contain exactlyc objects. The partially full structure of the tree, even though nodes are encrypted, can potentially leak information to the server about the distribution of the actual static objects. To deal with this issue, during an offline process, we traverse the tree from level h 1 up to level 1 and for each parent node N 0 i;j withc 0 <c children, we padc 0 c children with dummy data and access frequency of zero and link them toN 0 i;j . This process guarantees that no information is leaked to the server from any asymmetry of the tree structure. Moreover, one can easily verify that using the above approach, all tree properties and proofs of correctness and security still hold. Finally, observe that all proofs of complexity assumed a worst case scenario 91 where the tree nodes are all full and therefore, padding the nodes with dummy data does not exacerbate the complexity analysis of earlier sections. 4.5.2 Other Tree Structured Spatial Indexes In previous sections, we detailed our schemes for privacy-aware navigation of R-trees. Although we focused our attention on R-trees, we did not make any assumptions spe- cific to R-trees that do not hold in other tree structured spatial indexes. BothPU and OP techniques discussed in Sections 4.3.1 and 4.3.2 can be employed with any other tree-structured index that respects the notion of recursively grouping lower level objects in higher nodes. Furthermore, the tree should have the same number of objects in each node (this is similar to the discussion in Section 4.5.1). Obviously, several spa- tial indexes such as kdtrees and quadtrees satisfy all the above properties. Finally, since tree nodes are encrypted, the client should be capable of performing the query processing interactively with the server. This requires the client to be aware of how the underlying spatial index is used for query processing. 4.6 Summary To protect the location privacy of users who subscribe to location-based services, encryption is not sufficient for hiding the contents of the underlying spatial index. The potentially untrusted server hosting spatial data can obtain sensitive information by mon- itoring “how” (i.e., in what frequency) the encrypted index nodes are being accessed. Combining this information with its prior knowledge about the data, the server can easily deduce the location of the user querying the data. In this chapter, we proposed two techniques that enable oblivious navigation of tree-structured spatial indexes while 92 incurring acceptable communication overhead and server side complexity and imposing minimal burden on the client side. We analytically and empirically studied the security and efficiency of our approaches and verified their practicality for privately executing multiple spatial queries per second (See Chapter 5). 93 Chapter 5 Experiments In this chapter, we experimentally evaluate the effectiveness of our proposed approaches. In particular, we extensively study the privacy and efficiency characteristics of our space transformation, PIR-based and oblivious tree traversal schemes with various synthetic and real-world datasets. 5.1 Datasets and Experimental Setup We have used a total of 9 real-world and synthetic datasets to evaluate our frameworks. While our real-world datasets vary in size and location, for our synthtic datasets, we have used uniform and skewed datasets of varying sizes. Each uniform dataset contains a collection of randomly generated locations while for each skewed dataset 99% of the objects form four Gaussian clusters (with = 0:05 and randomly chosen centers) and the other 1% of the objects uniformly distributed. Based on the availability of each dataset at the time of performing the empirical stud- ies for each approach, we have used a subset of these datasets to evaluate each technique. The datasets used for each approach are mentioned before each set of experiments for our space transformation, PIR-based and oblivious tree traversal approaches. We have generated random range andKNN queries with varying sizes and for each approach measured the privacy costs associated with the developed techniques as well as the end-to-end response time in milliseconds. Experiments were carried out on an IntelP 4 3:20 GHz with 2 GB of RAM and are implemented in Java. 94 5.2 Space Transformation We have conducted extensive sets of experiments to evaluate the performance of our space transformation technique from Chapter 2. The effectiveness of our dual curve query resolution technique (DCQR) is determined in terms of (i) the effect of the curve orderN on range andKNN queries (ii) accuracy and efficiency of our proposed methods for approximateKNN and range queries and (iii) the performance of exactKNN query processing. 5.2.1 Experimental Setup Our experiments are performed on three different datasets: (i) a synthetically gener- ated uniform dataset (ii) a real-world dataset of restaurants obtained from NA VTEQ (www.navteq.com) covering a 26 mile by 26 mile area surrounding the city of Los Angeles and (iii) a synthetically generated skewed dataset where 99% of the objects form four Gaussian clusters (with = 0:05 and randomly chosen centers) and the other 1% of the objects uniformly distributed (Figure 5.1). All three datasets contain around 10000 objects (n = 10000). Experiments were carried out on an IntelP 4 3:20 GHz with 2 GB of RAM and are implemented in Java. (a) Uniform (b) Real-world (c) Skewed Figure 5.1: Datasets. 95 5.2.2 Choosing the Right Curve Order (N) As our first set of experiments, we evaluate the effectiveness of our proposed indexing technique by analyzing the curve behavior for different values ofN (i.e., curve order) and choosing the right value for the rest of our experiments. Using two H N 2 curves for indexing objects, we measure the average number of objects which are assigned the same H-value (i.e.,) for each value ofN for our three datasets. Ideally we want to have 1. Figure 5.2 shows how changes withN and quickly approaches 1 forN 10 for all of our three datasets. Due to the extra overhead of an unnecessary increase inN for range andKNN queries, we choose the smallest value ofN that results in 1. Therefore, for the rest of our experiments, we setN = 12 unless otherwise stated. Figure 5.2: Finding the Right Value of Curve Order (N). 5.2.3 Evaluating Query Results’ Quality Suppose the ground truth result of a query q, issued by a user located at point l i is R = (o 1 ;o 2 ;:::;o K ) if no privacy is required. Let us denote byR 0 = (o 0 1 ;o 0 2 ;:::;o 0 K 0) the query result obtained using our proposed techniques. Given the nature of our approx- imateKNN algorithm and the fact that our range algorithm might append extra objects to the result set, we use the following two metrics to evaluate the quality of results. 96 Precision: By precision, we measure what fraction of the points retrieved are rele- vant for range and approximate KNN queries. This metric is defined in Equation 5.1 (jRj denotes the cardinality of the setR). Precision = jR\R 0 j jR 0 j (5.1) For our approximateKNN algorithm, we introduce an additional displacement met- ric defined below (note that we do not approximate range queries). Displacement. Measuring how closely R is approximated by R 0 using our KNN algorithm: Displacement = 1 K ( K X j=1 jjl i o 0 j jj K X j=1 jjl i o j jj) (5.2) wherejjl i o j jj is the Euclidean distance between the fixed query pointl i and an object o j . Obviously, since R is the ground truth, displacement 0. In the remainder of this section, we use the above metrics to experimentally evaluate our range andKNN algorithms. 5.2.4 Approximate KNN Query Evaluation The next set of experiments evaluates the performance of our approximateKNN query processing algorithm for different values of N and K for our three datasets by com- paring the single curve approach against DCQR. In the remainder of this section, each experiment is performed for 1000 randomly generated queries and the results are aver- aged. As illustrated in Figure 5.3, DCQR results in an average 15% improvement in pre- cision over the single curve approach (from 50% to 65%). It is also clear from the figure that for all three datasets the precision quickly reaches 70% for N = 8 (i.e., 97 (a) Uniform (b) Real-world (c) Skewed Figure 5.3: Precision Vs. Curve Order (N). (a) Uniform (b) Real-world (c) Skewed Figure 5.4: Displacement Vs. Curve Order (N). < 2) and slightly oscillates for larger values of N. Finally, as expected, the uni- form dataset achieves slightly higher precision compared to our real-world and skewed datasets. Figure 5.4 also illustrates how DCQR reduces the displacement error by 50% for all three datasets. Also, as expected, the DCQR displacement is slightly higher in the skewed dataset. Note that the displacement is a more reasonable measure forKNN query evaluation compared to the precision because it shows howclose the query results are approximated rather than just showing the percentage of exact neighbors in the result set. For instance even a 0% precision with 0:05 mile displacement means that although no exact match for the query is found, each approximate result is less than 0:05 mile farther from the query point than the actual result. Finally, we measured the DCQR overhead in overall approximate KNN query response time. As Figure 5.5 illustrates, the overhead stays around 2 milliseconds on 98 (a) Uniform (b) Real-world (c) Skewed Figure 5.5: Response Time Vs. Curve Order (N). average for the three datasets. Also the response time stays around 6 milliseconds for all three datasets. (a) Uniform (b) Real-world (c) Skewed Figure 5.6: Precision Vs. K The next sets of experiments are identical to what we discussed above except that we now study how approximateKNN query processing is affected by varyingK and fixing curve order atN = 12. As Figure 5.6 illustrates, DCQR improves precision by more than 25% for all three datasets averaging 67%. The displacement also increases linearly with K while DCQR results in more than 50% error reduction (Figure 5.7). Similar to the former set of experiments, the skewed dataset results in relatively larger displacements compared to the other datasets. Finally, Figure 5.8 shows how response time linearly grows with K, confirming our query complexity derivation of Section 2.3.2. The overhead caused by DCQR is less than 6 milliseconds on average for all three datasets. 99 (a) Uniform (b) Real-world (c) Skewed Figure 5.7: Displacement Vs. K (a) Uniform (b) Real-world (c) Skewed Figure 5.8: Response Time Vs. K 5.2.5 Range Query Evaluation We now study the performance of our range query algorithm with the single curve and DCQR approaches. In contrast to the above experiments, our range query evaluation is exact. However, the query result set might include some excessive objects that should be filtered out by the client. Clearly, it is desirable to minimize the number of these excessive objects. Here, we first demonstrate the effect of Hilbert curve order (N) and the range query size on the number of excessive objects. We use precision (i.e., the percentage of objects falling in the range out of all objects retrieved) to quantify the amount of excessive objects. Figure 5.9 shows the effect of Hilbert curve order on the value of precision for our real-world dataset. This effect is evaluated for four different square range queries with the selectivity of 0:05%, 0:1% and 0:5%. Clearly increasing 100 the value of curve order results in higher precision values (reaching almost 1 forN = 12). Very similar trends were observed for the other two datasets. Figure 5.9: Precision Vs. Curve Order for Different Range Query Sizes. The second observation is that for a given N, larger range queries achieve better precision. This is because the excessive objects can only be in the grid cells which partially overlap with the range query. Clearly, the number of such cells is proportional to the perimeter of the range query. On the other hand, the number of objects retrieved from the location server is on average proportional to the area of the query. Therefore, increasing the query side length increases the total number of retrieved objects more than the number of excessive objects. As our second set of experiments we evaluate the effect of Hilbert curve order on the average number of query runs. It is desirable to have a small number of runs to reduce the communication cost as well as reducing the number of requests to the server (and hence increasing the server throughput). Figure 5.10 shows the average number of runs for 1000 range queries over the real-world dataset with the selectivity of 0:05%; 0:1% and 0:5%. For both original and DCQR approaches, asN increases, the average number of query runs increases as well. This is due to the fact that the average number of query runs is linearly proportional with the query side length in terms of the number of grid cells. Therefore, there is a trade-off in choosing the right curve order to minimize the number of query runs while maximizing precision. Considering different curve orders 101 for a specific query size in Figure 5.10, we get an average improvement of around 21% when we use DCQR approach over original curve. Moreover, it is clear that larger range queries have more runs. Note that the average number of runs is independent from the dataset type and only depends on the coordinates of the query. (a) 0:05% Selectivity (b) 0:1% Selectivity (c) 0:5% Selectivity Figure 5.10: Number of Runs Vs. Hilbert Order for a Query with Different Selectivity. (a) 0:05% selectivity (b) 0:01% selectivity (c) 0:5% selectivity Figure 5.11: Running Times Vs. Curve Order (N) for Different Query Selectivity. In the last set of our range query experiments we study the running time of the pro- posed range query algorithm. Figure 5.11(a)-(d) shows the running time of both original and DCQR approaches for different Hilbert curve orders for our real dataset with four different query sizes. For a fixed query side length, increasing the value of curve order increases the running time. This is consistent with the range query complexity derived in Section 2.3.3.1. Also, fortunately the overhead of DCQR approach is marginal com- pared to the original approach (around 6 milliseconds on average). This is important 102 since it confirms the fact that we can use DCQR to evaluate a range query without sig- nificant time overhead. Note that both approaches retrieve the same number of objects from the location server and the difference between DCQR and original approach is the step required to find the runs in the dual curve. As the grid becomes more fine-grained, the time to retrieve objects from the location server dominates the time to calculate query runs and hence increases the overhead of DCQR. Similar trends were observed for the other two datasets. Although the distribution of objects varies for different datasets, the number of runs and the average number of objects retrieved for a unique query size are independent from the dataset. Therefore, the running time is almost independent from the data distribution. 5.2.6 Exact KNN Query Evaluation In the previous sections, we empirically evaluated the performance of approximate KNN and range algorithms and observed howDCQR improves the precision and dis- placement forKNN queries and reduces the number of runs for range queries. In this section, we use these constructs with DCQR to experimentally measure the perfor- mance of our exactKNN algorithm from Section 2.4. Similar to Section 5.2.4, we first study the effect of the curve orderN on the quality of results. Since our exactKNN algorithm employs the range algorithm for retrieving all objects in a square region centered at the query point, the server’s response for this range request includes several objects that do not belong to the actualKNN result set. In our next experiment, we measure how the curve order affects the number of such excessive objects that are returned to the client. As Figure 5.12 suggests, the average number of such objects quickly drops as N grows and approximately stays constant for larger values ofN (where approaches 1). Moreover, relatively more unnecessary 103 objects are returned for the skewed dataset. This is obviously due to the fact that the average size of the circle centered atq is larger for skewed datasets as the approximate KNN algorithm has to expand beyond larger gaps in the Hilbert space to obtain the approximate result set. (a) Uniform (b) Real-world (c) Skewed Figure 5.12: Excessive Objects Vs. Curve Order (N) We proceed to measure the overall response time for the exact KNN algorithm. The results are presented in Figure 5.13 where contributions of each phase for query processing are highlighted. The Aprx piece denotes the time it takes to find the circular region that includes theK approximate results and the Exact piece represents the time it takes to retrieve all objects within the square range that encompasses the above region to compute the exact results. Several observations can be made from Figure 5.13. The first phase of exactKNN query processing takes significantly less portion of the total running time for the algorithm. Moreover, asN grows, both Aprx and Exact portions of the response time grow and so does the total query time. This is consistent with the observations from Figures 5.5 and 5.11. However, it stays below 20 milliseconds for our real-world and uniform datasets and below 120 milliseconds even for the skewed dataset. As our last set of experiments, we evaluate the effect of increasingK on the number of excessive objects and the total time of our exact KNN algorithm. The results are shown in Figures 5.14 and 5.15, respectively. As expected, increasingK almost linearly 104 (a) Uniform (b) Real-world (c) Skewed Figure 5.13: Response Time Vs. Curve Order (N) increases the number of excessive objects returned to clients as larger regions need to be queried with the range algorithm to find the exact results. The second observation from Figure 5.14 is how the number of excessive objects for the skewed dataset are significantly larger than that of real-world and uniform datasets. This can be explained by revisiting Figure 5.7. The skewness of data results in more displacement errors in finding the candidate KNN results that in turn leads to a relatively larger region to query with the range algorithm. With regard to overall response time, as Figure 5.15 illustrates how increasing K results in an increase in overall response time. As we illustrated in Figure 5.8 larger values of K result in more time spent to find the candidate result set and in a larger region. Querying this relatively larger region is the cause of an increase in the time contributions for the second phase of exactKNN query processing. (a) Uniform (b) Real-world (c) Skewed Figure 5.14: KNN Excessive Objects Vs. K 105 (a) Uniform (b) Real-world (c) Skewed Figure 5.15: KNN Response Time Vs. K 5.3 Private Information Retrieval In this section we empirically examine the overall efficiency of our PIR-based frame- work from Chapter 3. We conducted extensive experiments to determine the effective- ness of our framework in terms of (i) the effect of system parameters such as the grid size, the cache size, etc. (ii) the overall response time of the range and three KNN algo- rithms for different datasets and (iii) the overall end-to-end performance of the frame- work. 5.3.1 Experimental Setup Our experiments are performed on four different datasets (Figure 5.16) three of which contain around 40; 000 objects. These datasets are as follows: (i) a real-world dataset obtained from NA VTEQ (www.navteq.com) covering the restaurants in an 1280 by 1280 km area in central United States. (ii) a uniform distribution (iii) a synthesized highly skewed dataset where 99% of the objects form four Gaussian clusters (with = 0:05 and randomly chosen centers) and the other 1% of the objects uniformly dis- tributed and (iv) a sparse dataset of 4; 000 uniformly distributed objects. Experiments were run on an Intel P4 3:20 GHz with 2 GB of RAM emulating a secure coprocessor. 106 Figure 5.16: Real-world, Uniform, Highly Skewed and Sparse Datasets One of the key challenges in hardware-based PIR is dealing with relatively low computation power and available storage space of secure coprocessors. For example, the IBM’s PCIXCC secure coprocessor runs at 266 MHz, and supports 80MB of main memory, Ethernet connection and a Linux OS [AvD04]. Although such processors are slower than their non-secure counterparts, they are equipped with cryptographic accel- erators that significantly outperform a typical high-speed processor in performing cryp- tographic operations. The most recent IBM 4764 PCI-X Cryptographic Coprocessor [TIPXCC] generates up to 470 2048-bit RSA signatures per second compared to a P4 @3:4GHz that generates around 40 signatures during the same time (i.e., more than one order of magnitude improvement). Due to very frequent encryption (decryption) opera- tions required in our algorithms, such accelerators can significantly improve the overall response time. Furthermore, as we show in this section, for optimal system parame- ters (i.e., grid, cut-off threshold and cache size), our response times are mostly in the orders of milliseconds and even a secure coprocessor that runs almost 10 times slower, still achieves very satisfactory response times. With regards to space, running range and KNN algorithms onSC does not strain its relatively limited memory due to the fact that the large index structures are being stored at the untrusted server and the proposed range and KNN algorithms are designed to consume the limited amount of space available in SC. In Section 3.3.1 we discussed why we chose to use regular grids as opposed to other indexing variants such as R-trees or kd-trees. In addition, here we observe that using 107 such techniques require the tree information to be either stored atSC or privately and incrementally queried from the server per query. While the former approach is infeasi- ble due to highly restricted available storage and computation resources ofSC, the latter approach also incurs significantly more query processing costs due to private retrieval of tree nodes prior to retrieving the actual query result set from the server. 5.3.2 Space Complexity Analysis As our first set of experiments, we briefly analyze the space requirements of our entire framework. Table 5.1 summarizes the required space for storing each item inL (the list storing the index of previously retrieved items in PIR),SC’s cache and the permutations and 0 , respectively. The second column shows the exact values forn = 10 5 (i.e., the total number of objects), M = 1 = 2 7 , T threshold = 10 3 and s = sizeOf(double) in bits. These values closely represent the bulk of our experiments. L represents the number of bits needed to store each c id of an MM grid. For each c id , the cache stores on average n M 2 objects each represented by the triplet< longitude,latitude,id >. As expected, the most significant space requirement is imposed by and 0 where each of the M 2 rows in stores a mapping of one c id to another. However, as discussed in Section 3.3.1, their sizes are determined by the granularity of the underlying grid structure and are invariant ofn (as we are indexing the cells instead of the objects located in them). We note that since our records are relatively small, the storage requirements of SC’s cache are fairly nominal. However, if enough space is not available inSC’s cache to store significantly larger records, we can use a variant of PIR scheme proposed by [Aso04] which does not require caching at all by reading allk 1 previously accessed records for evaluating the k th query instead of using its cache to read only a single element. 108 Table 5.1: Storage Requirements of SC Storage (Byte) Value (KByte) L log(M) 4 2 Cache log(M) 4 + ( 2s+log(n) 8 ) n M 2 26 ( 0 ) M 2 log(M) 230 (a) C and E (b) Time Figure 5.17: Effect of on Range Algorithm for the Real-World Dataset 5.3.3 The Effect of As we discussed in Section 3.3, choosing the right value of can significantly affect query performance. In this section, we evaluate how the performance of our proposed algorithms changes for different values of . To focus on the overhead of our range and KNN algorithms, for now we assume database reads are not private (i.e., t read 0). Later in Section 5.3.7, we replace database reads with the private read function developed in Section 3.2.1 to study the overall query response time which also includes the PIR overhead. Unless otherwise stated, the results of each experiment is averaged over 200 ran- domly generated range queries of size 100km 2 and KNN queries withK = 100 (these are much larger than the typically used range and KNN queries in many LBS). For each experiment, we measure (i) the average number of cells (i.e, records) being queried from any of our private index structures, hereafter being represented byC (ii) the aver- age number of excessive objects being queried (i.e.,jR 0 jjRj for range andjR 0 jK for KNN queries) hereafter represented byE and (iii) the overall query response timeT 109 (a) C and E (b) Time Figure 5.18: Effect of on Progressive, Hilbert and Hierarchical Algorithms for the Real-World Dataset in milliseconds. Figure 5.17 illustrates the effect of on these three parameters for the range queries and figure 5.18 illustrates how these values change in three KNN query processing algorithms (note that the Y-axes numbers are in logarithmic scale). All above experiments are performed on our real-world dataset. As expected, using the uniform dataset, the results demonstrated very similar trends to the real-world data (both for the above experiments as well as the rest of our experiments in this section). Therefore, we only report the results for the three remaining datasets. Obviously, it is desirable to minimize all three factorsC;E andT simultaneously, to achieve the optimal performance. However, as discussed before, small values ofC are achieved with very coarse grids which clearly increasesE (i.e., false positives) and hence the overall response time for both range and KNN query processing. Alterna- tively, a small value ofE can only be guaranteed when dealing with very fine-grained cells (i.e., small values of ) which comes at the cost of a quadratic increase in t range and also a quadratic increase in the K n 2 t read andt range terms oft progressive . However, shrinking only increases the t range term for the Hilbert Algorithm and hence incurs up to two orders of magnitude less overall response time compared to the progressive 110 algorithm (see the derivations of t range , t progressive and t Hilbert in Section 3.3.3). Fur- thermore, observe that as discussed in Section 3.3.3.2, the exponential growth of the examined region in KNN-Hierarchical results in a huge increase forC andE thus tak- ing significantly longer to execute KNN queries compared to the other two algorithms. Therefore, for the rest of our experiments, we only report the efficiency of the range and the two superior KNN Algorithms 7 and 8. Observe that if tc to = 1, the value of at intersection ofC andE would correspond to T ’s global minimum. However, for our experimental setup, we obtained tc to 35. Utilizing Theorem 1, 1 61 forn = 40000. As Figures 5.17 and 5.18 illustrate, the measured values ofT confirm our analytical derivation of optimal (i.e., 1 grids per axis ). Hence, can be chosen in advance using Theorem 1. When dealing with our first three datasets, we set = 1 64 for our remaining experiments. In our next sets of experiments, we evaluated the performance of the progressive and Hilbert algorithms on highly skewed and sparse datasets. Figure 5.20 (a and b) illustrates the results for the skewed dataset. AlthoughT increases in both algorithms, Algorithm 8 still significantly outperforms Algorithm 7. Also, the average number of cells queried by progressive (compared to the real-world dataset) grows twice faster than the Hilbert algorithm. This confirms the superiority of Hilbert indexing for non-uniform datasets. Figure 5.20 (c and d) shows how our KNN algorithms perform on the sparse dataset. Settingn = 4000 in Theorem 1 yields 1 19 which conforms to our empir- ical local minimum ofT at = 1 16 . Furthermore, compared to the real-world dataset, the progressive and Hilbert algorithms each incurs 15% and 10% increase inC and one order of magnitude increase inT , respectively. This is expected given the derivations of t progressive andt Hilbert from Section 3.3.3. Performing the same experiments for Algo- rithm 6 demonstrates similar behavior which is not shown here. 111 (a) 90% ( = 18) (b) 78% ( = 23) (c) 89% ( = 88) (d) 61% ( = 7) Figure 5.19: Relative Overhead Reduction of Secure Padding for Datasets of Figure 5.16 5.3.4 Choosing the Optimum Cut-off Value Using the values oft o andt c derived in Section 5.3.3, we proceed to compute the optimal value of for each distribution. Figure 5.19 illustrates the padding overhead for different values of . It also illustrates how our secure padding scheme reduces the padding overhead compared with the naive padding without compromising security. For each distribution, we choose the that results in smallest padding overhead. For instance, setting = 18 results in a 90% reduction in cost of padding compared to the naive padding approach for our real-world dataset. 5.3.5 The Effect of K We proceed to examine the effect ofK on KNN query processing timeT . As Figure 5.21 illustrates, values ofC;E andT linearly increase withK for both algorithms. As 112 (a) C and E (b) Time (c) C and E (d) Time Figure 5.20: Effect of for Skewed (top) and Sparse (bottom) Datasets expected, the Hilbert method outperforms progressive for all values ofK (see Section 3.3.1). 5.3.6 The Effect of Hilbert Curve Order As our next set of experiments, we analyze how our Hilbert-based KNN Algorithm behaves for different values of the Hilbert curve order (i.e., N). Figure 5.22 (a) shows how an increase in N quickly reduces the value of C which shows the curve’s effect in more effectively finding adjacent objects to the query point. However, as the granu- larity of the Hilbert curve moves beyond the granularity of the underlying grid cell, the Hilbert nearest neighbor approximation reaches its maximum accuracy and stays con- stant. Similarly, an increase inN quickly reduces the response time due to the smaller size of the safe region. However, a further increase inN increases theKt read log 2 2N 113 (a) C and E (b) Time Figure 5.21: Effect ofK for the Real World Dataset (a) C (b) Time Figure 5.22: Effect ofN on C (left) and Time (right) for the Real World Dataset Using Algorithm 8 term int Hilbert up to the point where 2 log 2 2N is dominated by thet range term in overall query processing time. 5.3.7 End-to-End Performance So far we focused on the performance of our private spatial query processing schemes in the absence of PIR. We now measure the overall response time of our framework which includes the overheadt pir introduced by Algorithm 9 for private evaluation of the above queries. Figures 5.23 (a,b) and (c,d) illustrate the effect of PIR overhead on the end-to- end performance for 1000 randomly distributed range and KNN queries, respectively. Note that response time is inversely proportional to the cache size (represented in terms of number of objects) and thus larger cache size reduces the overall query processing 114 time due to less number of private reads and less frequent reshuffling. Figure 5.23 also shows how the overall query processing time is dominated by the relatively expensive PIR modules for all three algorithms. The superior performance of Algorithm 7 over Algorithm 8 might initially look counter intuitive. This is because Figure 5.18 (b) showed how Hilbert-based KNN query processing outperformed the progressive expansion in the absence of PIR. While Algo- rithm 8 very efficiently computes the safe region, Algorithm 7 spends significantly more time to compute its (slightly smaller) safe region which includes fewer cells. However, adding PIR greatly increases the time to retrieve a cell due to the cost of reshuffling. Therefore, the cost of reading extra cells is significantly higher than the non-PIR case. A careful examination of Figure 5.18 (a) reveals that Algorithm 8 retrieves more cells compared to Algorithm 7 which negatively affects t Hilbert compared to t progressive . In other words, the shuffling time imposed by the PIR routine dominates the savings in quick identification of the safe region by Algorithm 8. We finally note that if approx- imate results were needed, no safe region had to be computed and Algorithm 8 would still significantly outperform Algorithm 7. 5.3.8 Comparing with Other Approaches In Section 3.5, we analytically showed the superiority of our approach in satisfying sig- nificantly more stringent privacy guarantees compared to the cloaking/anonymity-based approaches. However, an empirical comparison between our techniques and these stud- ies is not straightforward because anonymity and cloaking approaches mostly evaluate performance based on the size of the K-anonymity set or the cloaked region and the effectiveness of the anonymization techniques used. More importantly, contrary to our approach, the level of privacy attained through anonymization highly depends on the 115 (a) = 64 (b) = 128 (c) = 64 (d) = 128 Figure 5.23: End-to-end Performance for Range (a,b) and KNN (c,d) Algorithms number of users as well as their distribution. Therefore, in the following two sections, we only consider the PIR and transformation-based approaches for an empirical com- parison. 5.3.8.1 PIR-Based Approaches We proceed to compare our approach with the recent work of Ghinita et al. [GKK + ] which similar to us utilizes PIR to evaluate 1NN queries while ensuring perfect pri- vacy. Figure 5.24a summarizes the differences between our progressive and Hilbert algorithms with the exactNN algorithm proposed by [GKK + ] in terms of the commu- nication and computation cost, as well as the number of excessive objects disclosed to users. The results show that our algorithms outperform the proposed exactNN algo- rithm in the amount of communication and computation. This is because by avoiding 116 (a) PIR-Based (b) Transformation-Based Figure 5.24: Comparing with Other Approaches a secure coprocessor, the theoretical PIR protocol used in [GKK + ] incurs significantly more computation cost compared to hardware-based PIR techniques. Moreover, in order to provide a fair comparison, we reduced our CPU clock to a tenth of its original value to simulate the slowerSC speed and used very moderate cache (32Kb) withM = 1 64 . Finally, all three methods roughly disclose the same number of extra objects to clients while evaluating 1NN queries. Note that the exactNN algorithm proposed in [GKK + ] is only applicable forK = 1 and thus we were unable to further compare our approach with [GKK + ] for range and forK > 1 in case of KNN queries. 5.3.8.2 Transformation-Based Approaches We also compare our work against the two transformation-based approaches discussed in Section 6.3. We compare the communication cost and privacy guarantees of our approach with that of [YJHLb] and the Dual Hilbert Curve technique of [KS] denoted by DHC. Since neither approach provides the end-to-end computation cost, we were unable to compare the three approaches based on query processing time. However, we anticipate our approach to be clearly more computationally intensive due to the high cost of private record retrieval. 117 Similar to SpaceTwist, we used the dataset of 172; 188 school locations from USGS 1 to compare our approach against SpaceTwist and DHC. We generated 100 random KNN queries and studied the effect of varyingK on communication cost measured in terms of the total number of points transmitted to the client. Our observations are illustrated in Figure 5.24b. Among all four techniques, larger values of K linearly increase the communication cost. Although the communication costs of our approaches are compa- rable, DHC and SpaceTwist both outperform our PIR-based approaches. However, it is important to note that this is achieved at the cost of generating approximate response and leaking query location information (see Section 6.3). In fact for both techniques, we used their recommended parameter settings to optimize performance (vs. privacy or result accuracy). For instance, requiring SpaceTwist to return exact results in the above experiment results in a ninety-fold increase in the size of the result set. 5.4 Oblivious Index Navigation In this Section, we study the security and performance characteristics of our object per- mutation approach (OP for short) from Chapter 4. We analytically studied the security and complexity of thePU method and discussed its restrictions in Section 4.3.2. Here, we conducted numerous experiments to evaluate OP with regard to (i) the effects of skewness, node capacity and range of the original frequency domain values on the fre- quency distribution of the permuted treeR 0 (ii) the flatness of the permutation and (iii) the overhead in terms of server’s load (measured by the number of page accesses) and client’s response time compared against query processing cost on the original tree R. Table 5.2 summarizes our varying parameters (default values are boldfaced). 1 http://geonames.usgs.gov/index.html 118 5.4.1 Experimental Setup Our experiments are performed on two real-world datasets obtained from Navteq 2 . The first dataset includes 50000 restaurant locations in the United States and the second dataset covers 10000 restaurants in the city of Los Angeles. These datasets are shown in Figure 5.25 denoted byUS andLA, respectively. (a) US (b) LA Figure 5.25: Datasets As we stated earlier, our goal is to make it impossible for the server to deduce infor- mation about POI data from their respective node access frequencies in the permuted treeR 0 . Therefore, while the physical distribution of objects in space affects the R-tree structure, it does not directly contribute to the histogram of node access frequencies in the original treeR. In other words, the effectiveness of our frequency flattening strategy is directly affected by variations in POI data popularity (i.e., access frequencies). We were unable to obtain a real-world dataset of access frequencies for POI data. There- fore, we assigned access frequencies to each object of our datasets from a uniform and a Zipfian distribution with varying degrees of skewnesss where 0:5s 5 determines the frequency variations among objects. These numbers simulate access frequencies to each node ofR inT number of queries. We used the R tree [BKSS] code from R-tree Portal (rtreeportal.org) with the default settings (of 70% fill factor and cache size of 128) on a machine with disk page size of 4096 bytes to index objects of our two datasets. Experiments were written in 2 http://www.navteq.com 119 Java and were run on an actual LAN with an Intel Core II Quad CPU 2:40 GHz with 3:25 GB of RAM acting as a server and an Intel Pentium 4 with 2:00 GB of RAM acting as the client. We used sockets to enable the actual client server communications over the TCP/IP protocol and to measure the overall end-to-end response time of process- ing queries. We employed DES for symmetric key encryption/decryption of the nodes and SHA512 for one way functions and pseudorandom number generation. Therefore, all client requests are first encrypted before being transferred to the server through the TCP/IP protocol. Table 5.2: Experiment Parameters s:frequency skewness 0:5; 1;:::; 4:5; 5 c:node capacity 30 (LA),40,50 (US),60,70 U i :original frequency range 50,60,70,80 :query selectivity 5%,15%,25% In reality, various factors (such as the number of POI data, the R-tree fill factor and the node capacity) will cause R to have several nodes that are only partially full. Therefore, we first construct the tree with a realistic fill factor of less than 100% and then pad the tree nodes according the method described in Section 4.5 before performing the object permutation. To study the effects of other factors on the construction of theR, we varied the skewness (s), node capacity (c), maximum original object frequencyU i and query selectivity according to table 5.2 where U i defines the rangef0;:::;U i g from which original node frequencies are obtained and is defined as the ratio of the result set size over the dataset size. We used the guidelines in [Gut, BKSS] to derive acceptable ranges for node capacityc. 120 (a) 1; 0; 0:5K; 6:4K (b) 5; 0; 0:08K; 1:5K (c) 1; 1; 10K; 321K (d) 5; 1; 2:4K; 76K Figure 5.26: Skewness:fs;l;f max ;f 0 max g values for US (a) 1; 0; 0:3K; 2:1K (b) 5; 0; 50; 513 (c) 1; 1; 3:3K; 63:1K (d) 5; 1; 0:81K; 15:2K Figure 5.27: Skewness:fs;l;f max ;f 0 max g values for LA 5.4.2 Frequency Skewness In our first set of experiments, we vary the skewness of original object frequencies. Sincef i;j values are computed as the summation of their children frequencies they can remarkably affectN i;j access frequencies inF i . As we discussed before, our goal is to get as close as possible to an optimal solution where all node access frequencies in the permuted tree are equal to average node access frequency F 0 i . Figures 5.26 and 5.27 illustrate the effect of original object frequency skewness on the resulting histogramF 0 i for ourUS andLA datasets, respectively. Due to the large number of varying parameters (skewness and tree depth) we have only illustrated the results for two values ofs = 1 ands = 5 and for the lowest two levelsl of the tree denoted by (l = 0;l = 1). To clearly visualize the frequency variations inf i;j andf 0 i;j , we hereafter normalize node frequencies (Y axis) by dividing f i;j and f 0 i;j values by their maximum values f max ;f 0 max (inT number of queries) and denote these normalized frequencies byf and f 0 , respectively. This allows us to plot small variations amongf i;j andf 0 i;j values that 121 are not visible without frequency normalization.The X axis represents individual nodes of tree depthi inR andR 0 unless otherwise stated. Therefore,f = 0 denotes an empty node in R. Also, the horizontal line y = 1, denoted by “uniform” shows the optimal case where f 0 = F 0 i . Therefore, the optimality of a solution can be interpreted as its closeness to the horizontal line. Several observations can be made from Figures 5.26 and 5.27. First, the maximum node access frequencies in R 0 (f 0 max ) are up to c times higher than that of R (f max ). Note that average node access frequencies can be directly computed according to the corollary of the aggregate access frequency property in Section 4.3.2. Increasing s in fact has a positive effect on achieving uniformity in f 0 values. This is because more skewed distributions assign similar frequencies to more number of nodes which makes it easier to level frequencies during the permutation. This results in approximately 75% reduction in f 0 max values from s = 1 to s = 5. We also repeated this experiment replacing the Zipfian with a uniform distribution and the results were very similar to the above experiment for small values ofs (not shown here). Finally, with all observations f 0 closely follows the uniform distribution. We defer the security analysis and optimality considerations of each experiment to Section 5.4.5. 5.4.3 Node Capacity We proceed to study the effect of node capacity c on uniformity. In Section 4.4, we proved several important properties of the resulting treeR 0 and how they are affected by c. Since choosing a small (large)c results in a very deep (wide) tree, for each dataset we picked a range of reasonable values that result in a well-balanced tree [Gut, BKSS]. We study how node access frequencies are affected by the choice ofc for our two datasets using the default values for other parameters. Figures 5.28 and 5.29 illustrate the results 122 (a) 50; 1:4K; 25:7K (b) 55; 1:5K; 23:3K (c) 60; 1:7K; 21:5K (d) 65; 1:8K; 20K (e) 70; 2K; 18:4K Figure 5.28: Capacity:fc;f max ;f 0 max g values for US of these experiments. Asc increases, with both datasets, the resultingf 0 values become less jagged and more plateaus start to form. This is due to the fact that increasing c allows a more uniform assignments of f 0 values to the destination placeholders and hence, more nodes end up with the samef 0 values. This observation also justifies why the values off 0 max decrease slowly asc grows. However, more uniformity comes at a cost. This cost is hidden in the query processing overhead for larger redundancy sets. We study these privacy/efficiency trade-offs in more detail in Sections 5.4.5 and 5.4.6. While we only showed the results for the leaf nodes (l = 0), very similar trends were observed for higher tree nodes. To save space for the remainder of this section, we only focus on leaf nodes. 5.4.4 Frequency Domain Range In this set of experiments, we vary the range of initial frequencies assigned to each POI data from 0 toU i for different values ofU i to study how this range affects the values of f 0 . The major observation from this experiment (Figures 5.30 and 5.31) is that enlarging the original domain range does not significantly affect the overhead andf 0 max values are 123 (a) 30; 0:8K; 8:5K (b) 35; 1K; 7:4K (c) 40; 1:1K; 6:5K (d) 45; 1:2K; 5:8K (e) 50; 1:3K; 5:4K Figure 5.29: Capacity:fc;f max ;f 0 max g values for LA (a) 50; 1:4K; 25:8K (b) 60; 1:7K; 30:7K (c) 70; 2K; 35:8K (d) 80; 2:3K; 40:9K Figure 5.30: Frequency Range:fU i ;f max ;f 0 max g values for US (a) 50; 0:8K; 8:5K (b) 60; 1K; 10:2K (c) 70; 1:2K; 11:9K (d) 80; 1:3K; 13:6K Figure 5.31: Frequency Range:fU i ;f max ;f 0 max g values for LA always less than c times f max . This observation can be explained by the fact that an increase inU i uniformly increases original (and hence, the permuted) node frequencies. Furthermore, the effect of increasingU i for the LA dataset is slightly higher than the US dataset since the default node capacity for LA is smaller than that of US which reduces the number of placeholders (nodes) Algorithm 1 can use to flattenf values. 124 5.4.5 Security Analysis So far we visually showed how various parameters affect the uniformity of thef 0 val- ues. In this section, we use the standard deviation to quantify the uniformity of node frequencies with respect to s;c and U i . We repeated the above experiments this time measuring the standard deviation of the resulting node frequencies. The results are sum- marized in Figures 5.32a to 5.32d. As discussed in Section 5.4.2, increasing skewness has a positive effect on uniformity and this is verified by the decreasing values of in Figure 5.32a. Also as we saw in Section 5.4.2, the larger dataset (US) results in higher values off 0 max which explains relatively higher values of for the US dataset. To under- stand the trends of in Figures 5.32b and 5.32c, note that generally increases with the number of empty nodes. As node capacityc grows, the number of empty nodes quickly increases which in turn increases. However, as soon asc passes a certain threshold, the height of the treeR (and henceR 0 ) is decremented by one which suddenly reduces the number of empty cells. This restructuring ofR happens atc = 55 andc = 45 for our US and LA datasets, respectively. Finally, asU i is increased in Figure 5.32d, node frequencies start to deviate more from each other which in turn increases the standard deviation of their permuted nodes’ frequencies. The nonzero values of in the above experiments indicate that perfect uniformity is not achieved during the object permutation. However, the frequency anonymity prop- erty ensures that this non zero value is not generated by few outlier frequencies. In our experiments, we observed large groups of nodes with very similar frequencies. More importantly, even the nodes with slightly larger frequency cannot be related to more popular nodes in the original tree as the permutation has fully destroyed the original object assignments to nodes. Finally, as shown in Figures 5.26 through 5.29, the varia- tions among values off 0 is always significantly lower than that of the originalf values. 125 (a) vss (b) vsc: US (c) vsc: LA (d) vsU i Figure 5.32: Effects of each parameter on This is illustrated by how close thef 0 values are to the horizontal line representing the optimal solution. Table 5.3: Response Time c = 30 t(R) t(R 0 ) pa(R) pa(R 0 ) = 5% 0.31 27.79 24 768 = 15% 0.39 73.93 77 2464 = 25% 0.50 129.32 151 4832 c = 35 t(R) t(R 0 ) pa(R) pa(R 0 ) = 5% 0.39 27.46 25 925 = 15% 0.50 81.15 78 2886 = 25% 0.84 145.54 137 5096 c = 40 t(R) t(R 0 ) pa(R) pa(R 0 ) = 5% 0.42 30.84 19 798 = 15% 0.45 77.70 58 2436 = 25% 0.56 131.53 125 5250 c = 45 t(R) t(R 0 ) pa(R) pa(R 0 ) = 5% 0.39 29.15 34 1598 = 15% 0.47 79.61 52 2444 = 25% 0.59 136.42 89 4183 c = 50 t(R) t(R 0 ) pa(R) pa(R 0 ) = 5% 0.39 30.65 23 1196 = 15% 0.46 82.07 65 3380 = 25% 0.56 135.75 105 5460 5.4.6 End-to-End Response Time In our final set of experiments, we study the overhead of our proposed technique with regard to end-to-end response time at the client side for performing spatial queries. We generated 1000 random range queries of varying selectivity and capacity c and measured how they affect sever’s overhead (measured in number of page accesses) and client’s response time (measured in milliseconds). Table 5.3 summarizes the results for our LA dataset. We denote bypa(R),pa(R 0 ) the total number of page accesses while 126 processing range queries on R and R 0 , respectively. Similarly, t(R) and t(R 0 ) repre- sent the response time of processing queries onR andR 0 , respectively. Therefore, the differences among pa(R);pa(R 0 ) and t(R);t 0 (R) values indicate the overhead of our proposed approach against the original R-tree. Several observations can be made from Table 5.3. The performance of our object permutation scheme is consistent with the theoretical analysis mentioned in Sections 4.3.2 and 4.4. As expected, response time for R 0 increases linearly with selectivity. However, varyingc does not significantly affect t(R 0 ). More importantly, whilepa(R 0 ) incurs slightly more thanc fold increase over the originalpa(R) values (due to the retrieval of auxiliary nodes andc redundant nodes per each node access),t(R 0 ) has a higher overhead due to the encryption/decryption delays in addition to transferring redundant nodes and auxiliary pointers to the client. However, t(R 0 ) still remains below few hundred milliseconds for all values ofc and. Slightly higher results with similar trends were obtained for our US dataset. 127 Chapter 6 Related Work To protect against various location privacy threats while using LBS, the research com- munity has proposed a wide range of approaches to protect the privacy of users while interacting with potentially untrusted location servers. In this chapter, we present a tax- onomy of the dominant approaches proposed for the location privacy problem. In one classification, these approaches can be divided into Anonymity/Cloaking, Transforma- tion, Private Information Retrieval and Oblivious Tree Traversal techniques. 6.1 Anonymity and Cloaking Inspired by anonymization techniques in privacy-preserving data mining, a large body of work in location privacy is based on the concept ofK-anonymity or location cloaking [GG, MCA, GL, KGMP06, BLPW08, BS03]. With this approach, a trusted anonymizer blurs raw user locations by (for example) extending them from a point location to an area (spatial extent) and sending a region containing several other users to the untrusted server. However, aside from the well-known privacy issues of anonymization in data mining [WFY07, QLW08, TD09] location anonymization and cloaking suffer from sev- eral drawbacks. First, all users have to trust the anonymizer with their private location information although anonymizers can be as sophisticated as the location servers. More importantly, there are certain scenarios in which the private location information of users leak to malicious entities [KGMP06] or the cloaking process fails for certain user distri- butions or privacy preferences [BLPW08]. Furthermore, the quality of service or overall 128 system performance degrades significantly as users choose to have more strict privacy preferences [KS]. For example, if the user requires a betterK-anonymity, the system needs to increaseK for that user which would result in a larger cloaked area and hence less accurate query response. Alternatively, if one requires to maintain the quality of service the location server has to resolve the spatial query for each and every point in the cloaked region and send the entire bulky result to the anonymizer to be filtered out. This will obviously affect the overall system performance, communication bandwidth and server throughput and results in more sophisticated query processing. Finally, users have to interact with the anonymizer per query during the system’s normal mode of operation. To alleviate the drawbacks of centralized cloaking, some studies propose a decentral- ized approach in constructing the cloaking region. The most notable work is proposed by [GKS07] which utilizes a hierarchical overlay network resembling a distributed B+ tree for constructing the cloaked region. However, aside from very slow response time, such approaches assume all users trust one another and can communicate with each other in real time to construct the cloaking region both of which are impractical assumptions for real-world scenarios. 6.2 Space Transformation Similar to our space transformation technique detailed in Chapter 2, a relatively new class of transformation-based approaches, avoid some of the shortcomings of cloaking- based techniques. Most recently, [YJHLb] proposed a framework termed SpaceTwist to blind an untrusted location server by incrementally retrieving points of interest based on their ascending distance from a transformed query point termed the anchor point which is a fake location near the query point. We note that while [YJHLb] mitigates some of the 129 shortcomings of anonymity and cloaking-based approaches, it introduces a new set of drawbacks. First, it is only practical when approximate results are desired as providing exact answers results in severe query location information leakage in [YJHLb]. Fur- thermore, SpaceTwist relies on a query processing technique only suitable for proximity queries such as KNN and does not discuss range queries. Finally, achieving perfect secrecy in the absent of PIR is an open problem even when approximate answers are sufficient and space transformation techniques are not exceptions in this sense. Space filling curves in general and Hilbert curves in particular have been proposed for range and nearest neighbor query evaluation. For example, [FR89] evaluates the efficiency of Hilbert curves in terms of number of disk accesses for range and nearest neighbor queries. Range query on the images encoded by Hilbert curve is also studied in the image processing community. Chung et al. [CTH00] studied the problem of answering range queries on the Hilbert-scan-based compressed images. However, what distinguishes us from the above studies is our stringent privacy constraints; we want to evaluate spatial queries blindly (Section 2.2). In other words, our main challenge is protecting the very same piece of information (i.e., location information) that is needed by a potentially adversarial entity (e.g., the location server) to effectively respond to location queries. 6.3 Private Information Retrieval Recently, Private Information Retrieval (PIR) techniques have been proposed to achieve high degrees of privacy while responding to location-dependant queries by entirely blinding the server from learning any information about what records are being accessed. The use of PIR to enable location privacy is also proposed by two other 130 studies. Ghinita et al. [GKK + ] utilize computational PIR to enable private evalua- tion of first nearest neighbor queries. Although this work does not extend toKNN or range queries, we have compared our PIR framework with [GKK + ] for 1NN queries in Section 5.3. Due to the hardware independence, the PIR protocol used in [GKK + ] incurs much more processing penalty than our hardware-based PIR approach. In fact, processing each private read requires a linear scan of database items at the server. Fur- thermore, the underlying PIR scheme incurs very costly communication complexity for each object retrieval. Also, Hengartner [Hen] presents an architecture that uses PIR and trusted computing to protect user locations from an untrusted server. However, the proposed architecture is not yet implemented. 6.4 Oblivious Tree Traversal As we stated in Chapter 4, the bulk of existing work on location privacy does not con- sider access privacy and the attacks associated with it. Therefore, they mostly fall short of protecting user location confidentiality where access frequency to indexed data is available to the untrusted server. The server can combine the information gathered from analyzing the frequency of nodes requested during query processing with its prior knowledge about the objects indexed to deduce user locations with high probability. This approach is also referred to as the correlation attack. While PIR-based approaches to location privacy are resilient to correlation attacks, they incur very high computation and communication costs [GKK + ] or rely on a trusted hardware with severe computa- tion and storage limitations [KSSM10]. Perhaps closest to our oblivious index traversal work, that aims to thwart such corre- lation attacks, is the techniques proposed by Lin and Candan in [LC04, LC05]. Although the focus is on traversing XML and other structured documents rather than performing 131 spatial queries, the proposed approaches can be applied to our set up. The authors devise schemes that use a redundancy set with a node swapping technique to obfuscate tree nav- igation. With both techniques, the complexity of tree traversal isO(dm) ford andm representing the tree height and the redundancy set size, respectively. However, two fundamental differences between our approach and the techniques discussed in [LC04, LC05] are our read-only nature of protocols and the relaxed bur- den on the client side. With our approach clients iteratively perform read, decrypt and request cycles with the server. In contrast, even a read request for a certain data element in [LC04, LC05] requires clients to write/modify the underlying tree structure for each node access during the index traversal. This requires the establishment of concurrency control mechanism to maintain the integrity of the tree while avoiding deadlocks. More- over, the read/write nature of client operations during the index traversal and the locking mechanism both exacerbate client response time when server is interacting with multiple clients concurrently. Finally, while [LC05] assumes query load is uniformly distributed among nodes and [LC04] does not consider exact original node access frequencies as server’s prior knowledge, we assume the server’s awareness of the non-uniform query access frequencies as prior knowledge and devise schemes that prevent the server from gaining any useful knowledge from access frequencies to our proposed privacy-aware indexes. Another relevant study is the recent work of Williams et al. [WSC08] which employs Oblivious RAM to enable private retrieval of a data item. The novel reshuf- fling protocol proposed improves the costly computation complexity of an ORAM pro- tocol and yields an amortized cost of O(logN log logN) per query. There are three key differences between this work and our proposed techniques. For one, [WSC08] places significant demands onto the client with regard to storage (anO( p N) temporary 132 client side storage) and computation (the construction of an encrypted bloom filter, an O(N log logN) oblivious scramble algorithm, etc.). Moreover, the proposed ORAM protocol incurs expensive communication cost; An O(N log logN) shuffling coupled with several roundtrips for online query processing result in up to 100 second response time for some queries in a dataset of 100 MB (almost similar in size to ours) even under a simulated network latency setup. Lastly, the ORAM protocol is designed to retrieve a single data item. Using it in our LBS setup to process range orKNN queries requires an index traversal which results in the retrieval of multiple items further exacerbating the response time. 4.3 we show how our protocols 133 Chapter 7 Conclusions In this dissertation, we identified the key underlying impediment in achieving privacy while evaluating spatial queries and introduced novel frameworks for protecting user privacy in location-based services. We first developed a space transformation technique that allows fast computation of spatial queries in a transformed space, unknown to the untrusted server. With our second approach, we devised our framework based on the theory of Private Information Retrieval to convert a spatial query into a sequence of completely private retrievals from the encrypted index hosted at the untrusted server. Finally, we proposed a novel spatial index traversal scheme that allows spatial queries to be privately evaluated by obliviously navigating the underlying encrypted index struc- ture. We have performed extensive sets of experiments to obtain a deeper insight into the privacy guarantees of our approaches and the overhead they introduce to conventional query processing in the absence of PIR. In a more broad context, we contributed several approaches that belong to different spots of a privacy/efficiency spectrum allowing for adoption of the right technique depending on the real-world privacy and efficiency needs of users in location-based services. As part of our future work, for our Space Transformation technique, we intend to explore new privacy aware locality preserving space mappings that provide less over- head and allow blind evaluation of a wider class of spatial queries. Moreover, we focused on the Euclidean space in Chapter 2 and plan to extend our approach to network- based spatial queries. Similarly, for our PIR-based approach, we plan to study novel 134 techniques for enabling location privacy that are less costly than PIR-based techniques while still providing the same level of privacy. Finally, while efficient for querying static data, our oblivious tree traversal techniques are not suitable for processing dynamic data. As part of our future work, we are investigating on how to improve our proposed meth- ods to efficiently deal with dynamic data. 135 Appendix A Location Privacy in Emerging Social Applications A.1 Introduction We are witnessing the emergence of a new killer-application at the crossroad of two popular paradigms: Location-Based Services (LBS) and Social Networking (SN). Peo- ple make friends and create buddy lists in virtual worlds using social-networking sites such as Facebook 1 and then use their mobile devices to locate their virtual buddies in the real world. Enabling this emerging application, however, has serious privacy ram- ifications. The users of mobile devices may be willing to reveal their locations and/or profiles to their buddies but not to the LBS+SN server and other users. Hence, the challenge is how to support buddy searches without revealing information to the server and other users in the system. To illustrate the difficulty of this challenge, let us consider two extreme solutions. One solution is to first encrypt and then store all the users’ location and profile information at a centralized server. The advantage of this approach is that all the frequent updates to users’ locations (and profile) will only be communicated to a single server. The disadvantage is that the server cannot support any querying/searching on the data efficiently because it is all encrypted. That is, for every client query, the entire database needs to be sent to the client for searching, hence, a 1 http://www.facebook.com 136 high query overhead. The other extreme solution is to eliminate the central server alto- gether and push server’s asks to the clients. The problem with this approach is that either every update to a user’s location needs to be sent to all the members of the user’s buddy list (push) or the user needs to communicate with all its buddies at the query time (pull). Some studies try to make this approach more practical by searching only the area around the user by wireless broadcasting [CDM]. While this approach can find user’s buddies around him, it cannot answer whether the user has buddies beyond his cellphone’s short range BlueTooth signals. Furthermore, the proposed techniques to protect location pri- vacy in LBS [KGMP06, KS, GKK + , YJHLa] do not apply to this problem because they focus on searching for the location of static objects (e.g., hospitals), while here the buddies are very dynamic and continuously move in the environment. Meanwhile, the approaches to protect privacy in SN [CSM + , CDM, GLC + ] do not work either because they do not consider spatial query processing using an untrusted server and mainly focus on protecting a user from those not in his buddy list. However, here no entity beyond a user’s buddy list is trusted while our goal is to enable users to query their buddy lists privately. In this chapter, we propose a framework, called Private Buddy Search (PBS), which would protect the users’ profiles and locations from both the server and other non-buddy users. Our approach strikes a compromise between the two above mentioned extremes by storing users’ aggregate information in various encrypted index structures at a cen- tralized server and then pushing the querying to the clients. Hence, the clients utilize the encrypted index structures to only retrieve a small portion of the database related to the query area. Consequently, our approach benefits from the advantages of both worlds. First, all the location (and profile) updates are only sent to and stored (encrypted) in a single centralized server. Second, at the query time, by securely communicating with 137 the (untrusted) server, the client receives enough information to answer most typical LBS+SN queries. Another major contribution of this chapter is that we discuss our approach as part of a complete end-to-end framework for PBS, which includes proposed cryptographic protocols to certify users, enable secure communication within groups, support various group operations (e.g., join, leave) and enable private spatial queries such as range and k-nearest neighbor (KNN) search. To evaluate our PBS framework, we performed extensive sets of experiments mea- suring client and server overhead for various PBS operations and queries. The results confirm that by distributing cryptographic and querying workload between the clients and the server, PBS supports very efficient interactions for a large number of mobile users. The remainder of the chapter is organized as follows. Section A.2 sets out our trust and adversary models. Section A.3 reviews the key design decisions made in PBS and Section A.4 details PBS constructs that collectively provide strong user privacy. In Section A.5, we present PBS’s spatial query processing techniques whereas Section A.6 covers the client/server computation protocols which enable secure user interactions with the server and their peers. Our experimental evaluations are detailed in Section A.7 and Section A.8 reviews the security issues of PBS. Finally, Section A.9 surveys the related work and we conclude the chapter in Section A.10. A.2 Trust and Threat Model We model a social networking framework as a central location server LS (or server for short) and a set of usersU =fu 1 ;u 2 ;:::;u n g. Each user carries a client device (e.g., cellphone, laptop or PDA) equipped with a positioning technology such as GPS, Wi-Fi or GSM. All users communicate with LS which acts as a central repository for users’ 138 Figure A.1: The PBS Trust Model data and querying needs. For the rest of the chapter, by referring to a user, we imply his client device by which he communicates with LS or other users. Each user belongs to a group from a set of groupsG =fg 1 ;g 2 ;:::;g m g. We assume users are partitioned into m disjoint groups and defer the discussion of multiple group affiliations to Section A.8. We also define a user’s buddy list or peers as all users belonging to his group. Users trust members of their buddy list with their sensitive information. This trust is established when a user invites another user to become part of his buddy list. We assume the trust relationship between two users is symmetric (see Figure A.1). Users are willing to share their current location and other non-spatial information with their peers and query their buddies’ information if they so choose. Users trust neither anyone outside their buddy list nor the LS. We use the term adversary to refer to any such entity. Although users trust their peers, recent studies have highlighted that users might not be willing to share their location information even with their peers at certain times and prefer to maintain their “social boundaries” [CSM + ]. Therefore, it is important to allow users to temporarily stop sharing their location information even with their trusted buddy lists. We assume users do not trust the location server. However, we take an honest but curious behavioral model for the location server. That is, LS does not deviate from PBS protocols. However, it is curious to take advantage of any sensitive user data. This is a 139 practical assumption in many disciplines such as database outsourcing [HL] and secure file sharing [KRS + ]. To interact with the location server and other peers, each user u j creates a ran- dom pseudonym as his confidential identity. Useru j also creates the asymmetric pub- lic/private key pairu j :pub, u j :pri and securely storesu j :pri on his client device. The mapping betweenu j ’s real identity and his pseudonym is only revealed tou j ’s buddies during the group invitation process (Section A.6.1). Therefore, whileu j :pub is used to verifyu j during his communication to the server and other peers, his pseudonym cannot be resolved to his real identity by an adversary. We use public key digital signatures to allow peers to efficiently authenticate each other and henceforth denoteu j ’s Verifiable Pseudonym byvp. Each user u j owns a profile denoted by u j :t which contains non-spatial attributes such as gender, age and self-description. We refer to au j ’s current position by the pair <u j :x;u j :y>. To enable querying users’ profile and location information within a group, any sharable user information has to be stored and maintained on the untrusted location server. In this chapter, we do not consider a peer-to-peer architecture with no central repository and assume the existence of a centralized server to resemble real-world social network and location-based service architectures. In addition, a P2P solution provides a limited set of features such as location discovery of peers within a user’s proximity [CDM] and cannot efficiently support spatial queries over a user’s buddy list. Finally, while P2P approaches enable two-party computation schemes such as “where is Bob now?”, they cannot respond to queries such as “which one of my friends are now in New York?” which is a fundamentally different type of query also supported by PBS. 140 (a) (b) (c) (d) Figure A.2: (a) Object Space (b) Object List (c) Naive Encryptions (c) Aggregate (top) and Isolated (bottom) Indexing While querying other peers, user’s communication with LS should not reveal sensi- tive information about both identity and location of the querying user or the user being queried to any adversary [KS]. Obviously, PBS or other privacy-enabling frameworks cannot protect users’ information from being leaked to adversaries through other means such as physical observation. A.3 Querying an Untrusted Server In this section, we review the major design decisions we made in PBS to support two conflicting objectives namely query efficiency and user privacy. We show how indexing aggregate data at the server side and shifting the query evaluation to the client side enable efficient yet private query processing. While we mainly focus on querying spatial data, non-spatial data can be queried in a similar manner. 141 A.3.1 Client Side Query Processing with Space-Driven Indexing Storing encrypted user information in a central location server enables users to share and query location information of their buddies. However, as we discussed in Section A.1, the immediate effect of encrypting such information is crippling LS from being able to efficiently process queries. To address this drawback, we push query processing to the client side and demote the server’s responsibility to a central storage and retrieval mod- ule for our encrypted indexes. However, client side query processing imposes certain restrictions on the choice of the partitioning used to index spatial data. One of the most conventional ways of efficiently querying spatial data is to partition objects according to their distribution into sets of nearby objects and maintaining these sets in a tree structure. The most well-known example of such data-driven index struc- tures [RSV00] is the R-tree index [Gut] and its variants where each set is the MBR of a group of nearby objects. However, data-driven indexes are not viable solutions for our framework. First, these indexes maintain a hierarchical tree representation of objects at the server. Since the server is not trusted in our model, the encrypted tree which indexes the objects (MBR’s) should be maintained by the clients and be communicated to them for query processing. This approach requires users to go through several rounds of downloading, decrypting, modifying, re-encrypting and communicating tree nodes with the server for each query or location update request. More importantly, while efficient with static data, maintaining data-driven indexes is costly in the presences of highly dynamic data [YPK, KPH04]. Therefore, using encrypted data-driven indexing is not an attractive approach in our setting. We avoid these drawbacks by using fixed grids which is an example of space-driven index structures [RSV00]. With this class of indexes, objects are mapped to a certain cell independent of other objects and solely based on some geometric criterion. Knowing 142 the grid granularity, henceforth denoted by, users can directly query a certain region without having a global knowledge of objects distribution. For instance, the client can quickly identify a set of cells that (partially or fully) overlap with his range query. This is a clear advantage over data-driven indexing in terms of the complexities of maintaining and querying a centralized encrypted index. While we use grids as our primary index structure to achieve efficiency, we stress that PBS can also utilize data-driven index structures such as quad-trees at the cost of more complex client side query processing and higher communication cost. We leave as part of our future work, the design of an encrypted data-driven index with reasonable cost for decentralized maintenance. A.3.2 Plain Indexing: Aggregation and Isolation We showed how client side query processing and space-driven indexing are necessary to ensure privacy and efficiency. However, simply encrypting space-driven indexes in gen- eral, and grids in particular, do not guarantee privacy and efficiency. Consider the object distribution of Figure A.2a where the server stores for each grid cellC xc;yc , its enclosing objects as shown in Figure A.2b (for simplicity, we have only shown object identifiers). To ensure privacy, we can encrypt each record as e(C xc;yc ): < e(u 1 ),e(u 2 ),::: ,e(u r ) > wheree() represents an encryption. However, looking at Figure A.2c top, it is obvious that the server can roughly obtain user distribution and mobility patterns from the size of each tuple. Alternatively, one can treat all objects in a cell as a whole during encryption. As illustrated in the bottom of Figure A.2c, this process results in the encrypted index e(C xc;yc ) :< e(u 1 ;u 2 ;:::;u r ) >. However, while such encryption does not resolve the information leak, it further exacerbates the communication cost for each location update 143 as well as buddy tracking request as an entire row has to be queried, downloaded and decrypted by the client for accessing each object. We address the security threats and inefficiencies associated with storing raw encrypted data by breaking the non-uniform encrypted index discussed above into an encrypted Aggregate Cell Index (ACI) and an encrypted Isolated Object Index (IOI). Both of these indexes are plain, meaning all encrypted tuples in either index have the same size regardless of object distribution. The plain structure of ACI is achieved by storing aggregate data for each cell while the plain structure of IOI is achieved by index- ing each individual object independent and in isolation from other objects. Figure A.2d illustrates a simplified example to show how breaking the object information into two plain indexes prevents an adversary from learning the object distribution. We defer more details regarding the ACI and IOI structures to Section A.4.2. A.4 The PBS Framework (a) (b) Figure A.3: (a) ACI and IOI Structures (b) PBS Architecture 144 We now proceed to provide more details about PBS. We introduce the concept of group keys and detail how our proposed secure server side indexes enable users to effi- ciently and privately query their peers in a social networking environment. A.4.1 Group Keys To ensure the privacy of users, PBS should support two privacy features: (i) enabling peers to execute location queries on their buddy list. (ii) preventing an adversary from obtaining sensitive information about users during query processing. To achieve these goals, members of each group share a symmetric secret group key which enables users to query the current location or other information of users in their buddy list. For a group g i , all communications between the members, as well as any sharable user information stored at LS are encrypted byg i ’s group key denoted byk i . We usee k i () andd k i () to denote encryption and decryption of a value withk i , respectively. A.4.2 Server Side Indexes In Section A.3.2, we briefly discussed how to distribute users’ raw encrypted location data into an encrypted Aggregate Cell Index (ACI) and an encrypted Isolated Object Index (IOI) and presented simplified versions of these indexes. We now provide more details about these two encrypted indexes. We recognize two fundamentally different query types supported in PBS. Data-driven queries allow a user to query another user in his buddy list for his current location or other profile information. Space-driven queries such as range and KNN queries on the other hand, allow users to query a region (as opposed to an object) for the presence of other peers. In PBS, space-driven queries are supported by ACI while data-driven queries are supported by IOI. Figure A.3a illustrates 145 the ACI and IOI indexes stored at LS. Each tuple in ACI stores aggregate user informa- tion for each cell per each group and is represented by ACI =fe k i (xc;yc;g i );<e k i (g i :cnt)>g; wherecnt or the object count denotes the number ofu j ’s peers co-located inu j ’s cell. Each object in PBS owns a record in IOI with the schemaIOI =fe k i (xc;yc;g i ;u j :rnk);u j :vp;< e k i (u j :x;u j :y;u j :t) >g. A user’s ranku j :rnk is a sequence number assigned to each object denoting an ordering between peers of a group in each cell based on their arrival time. Therefore, u j :rnk 2f1:::cntg. Note that the bold-faced columns are indexed and searchable. Aside from the above indexes, the server maintains each user’s public key certificate in a Key Relation R K =fu j :vp;< u j :pkc >g for authentication purposes. Figure A.3b illustrates a global view of how private information is distributed between different PBS entities. The benefits of breaking the object information into the ACI and IOI indexes are threefold. First, our two proposed indexes prevent the adversaries from learning any information about the object distributions from the size of the encrypted indexed data. This property is achieved via the plain nature of ACI and IOI indexes. Second, ACI and IOI indexes efficiently support various types of queries. With space-driven queries, knowing the grid granularity, users first identify the right cells which are likely to con- tain information about their buddies. Next, by learning each cell’s aggregate information from ACI, they form a request packet which is a set of requests for IOI tuples each con- taining information about one of user’s buddies. For data-driven queries, users directly query their peers in IOI using their pseudonyms to locate or track them without knowing their current location. We detail query processing in Section A.5. Third, as we discuss in Section A.6, PBS supports a wide range of features that allow users to interact with their peers and other users in a social networking environment. 146 A.5 Private Spatial Queries with PBS We now discuss how the ACI and IOI indexes enable private evaluation of space and data-driven queries in a social network. A.5.1 Range Queries A range queryRange(R;g i ) allows a useru j of groupg i to find the location of his peers in the rectangular regionR (circular ranges are approximated by their surrounding rect- angles by later filtering excessive results at the client side). Algorithm 11 illustrates how range queries are supported with PBS. Users first query ACI for aggregate information of the cells overlapping withR. Next, the server’s response is decrypted and a second request for IOI tuples each corresponding to one of the user’s buddies inR is formed. Processing each range query involves two rounds of client/server communication. These steps are underlined in Algorithm 11 which illustrates range query processing in PBS. Algorithm 11 Range Queries Require: R;g i ;;frange, group and grid granularity infog for allC =<xc;yc >overlappingwithR do req 1 req 1 [e k i (xc;yc;g i ); end for res 1 LS:ACI[req 1 ];fserver processingreq 1 g for allT =e k i (xc;yc;g i );e k i (cnt) 2res 1 do C <xc;yc > d k i (e k i (xc;yc;g i ));cnt d k i (e k i (cnt)); if (cnt6= 0) then U U[C <xc;yc >;ffind non-empty cellsg end if end for for allC2U do for (rnk 1;rnkC:cnt;rnk + +) do req 2 req 2 [e k i (xc;yc;g i ;rnk);fadd objects2Rg end for end for res 2 LS:IOI[req 2 ];fserver processingreq 2 g for allT 0 =<e k i (u j :x;u j :y;u j :t)> 2res 2 do <u j :x;u j :y;u j :t> d k i (T 0 ); res res[<u j :x;u j :y;u j :t> end for return (res); 147 (a) Expansion (b) Safe Region Figure A.4:KNN Algorithm A.5.2 K-Nearest Neighbor Queries ResolvingKNN queries is similar to range queries except that here, the region contain- ing users’ k-nearest peers is not known in advance. Therefore, users progressively form concentric rectangular regions and query ACI until enough cells are found that include at leastK objects. Figure A.4a illustrates this progressive expansion strategy. The cells are shaded and numbered according to the step they are visited. Next, a second request queries IOI for objects located in the expanded region. The server’s response will hence include location information ofK nearby objects. However, approximating a circular region with rectangular regions might result in some false negatives (points such asO 7 in Figure A.4b located inside the circle but outside the rectangle) that are part of the result set. Therefore, once thek th object is found, users expand the queried regionR to a safe regionR 0 which represents the region including false negatives to guarantee query accuracy. It is easy to verify thatR 0 is a square with sides 2djjc q far q (k)jje wherec q is the cell containingq andfar q (k) is the cell containingq’sk th nearest object inR and jj:jj is the Euclidean norm [YPK]. This process is performed by theaddSafeRegion() function in Algorithm 12 which details KNN query processing (underlined sections represent client/server communication). 148 A.5.3 Buddy Tracking In addition to the space-driven queries discussed above, an important functionality in a mobile social networking environment is to enable users to query a specific peer’s location or profile information. To enable these data-driven queries, users keep the list of their buddy lists vp’s in their client devices (denoted by blvp in Figure A.3b). The useru j trying to tracku v ’s location (or to viewu v :t), queries IOI withu v :vp. As part of their invitation, users have received the inviter’svp, as well as his real identity and hence they keep this mapping in blvp. Note that the server cannot verify whether u j andu v belong to the same group and hence cannot preventu v ’s information from being queried by adversaries. However, this is not an issue as an adversary cannot decrypt the sever’s response if he is not part ofu v ’s buddy list. Algorithm 12KNN Queries Require: q;k;g i ;;fKNN center, k, group and granularityg ct 0;fobject countg xc b q:x c;yc b q:y c; req 1 e k i (xc;yc;g i );fadding the querying cellg Letregion C =<xc;yc >; while (ct<k) do region =expand(region);fstripe surroundingregiong for allC =<xc;yc > 2region do req 1 req 1 [e k i (xc;yc;g i ); end for res 1 LS:ACI[req 1 ];fserver processingreq 1 g for allT =<e k i (cnt)> 2res 1 do C:cnt d k i (T ); if (C:cnt6= 0) then ct+ C:cnt; for (rnk = 1;rnkC:cnt;rnk + +) do req 2 req 2 [e k i (xc;yc;g i ;rnk);ftag cell’s objectsg end for end if end for res 2 LS:IOI[req 2 ];fserver processingreq 2 g for allT 0 =<e k i (u j :x;u j :y;u j :t)> 2res 2 do u =<u j :x;u j :y;u j :t> d k i (T 0 ); res 2 res 2 [u; end for req 1 ;;req 2 ;; end while return (res 2 [addSafeRegion(u j :x;u j :y;res 2 )); 149 A.6 PBS Operations PBS supports a range of functionalities that enable various user interactions with other peers. In this section, we provide the two-party computation protocols between the users and the server that enable such interactions. A.6.1 Group Related Operations Initiating Groups: To initiate a groupg i , a useru j creates a secret symmetric group key k i and computes the respective ACI and IOI tuples by settingg i :cnt = 1;u j :rnk = 1. He then signs this group init request and sends it to the server. The server verifies the signature and processes the request. Note that for all client/server communications, users bind a nonce to their request to thwart replay attacks. Joining Groups: A user can be invited to a group by any of the group mem- bers. In order for a user u j 2 g i to form an invitation request to u v , he first needs to learn u v :vp (this step is analogous to asking u v for his email address or phone number except that here, vp is anonymous). Next, u v receives the invitation e uv:pub u j :vp;u j ;e u j :pri (invitation;k i ) . The invitee (i.e., u v ) first decrypts the invita- tion withu v :pri. This step ensures no one else can take advantage of the invitation or learn anything by snooping the communication. Next,u v tries to decrypt the invitation. If successful, this step guarantees the invitation is sent byu j and transfers the group key tou v . Havingg i andk i ,u v queries ACI and learns the object count of his cell (i.e.,cnt). He then setsu v :rnk = cnt + 1 and constructs a signed join request to the sever which updates ACI (incrementingu v ’scnt) and adds one tuple to IOI storingu v ’s information. Note that the server does not learn any information about the group or the identity of either user. 150 User Revocation: Revoking users from a group is challenging as the revoked user can share the secret group key with adversaries to snoop future group communications. One solution to revocation is for the remaining users of each group to negotiate a new group key and to re-encrypt all relevant ACI and IOI tuples. Unfortunately, this is a costly approach. We use the lazy revocation with key rotation scheme from [KRS + ] for revoking users’ access to shared files. Following the eviction of a user from g i , the remaining users negotiate a new group key which will be used to encrypt all future ACI and IOI tuples during write operations such as location updates. This scheme is called lazy revocation as the revoked users still have access to the content they had access to, prior to their eviction. However, this is not a security flaw as such members could have cached the data and hence blocking their access to unmodified data does not have any advantages. As new users join PBS or during updates to the two indexes, one always uses the most recent group key. However, the key rotation scheme proposed by [KRS + ] guarantees that after each eviction: (i) given the most recent key, it is easy for all existing group members to rotate a key backward and obtain previous keys that are still in use for certain tuples while (ii) it is computationally infeasible for any expelled user from the group to compute future keys given their current version of the key. This scheme allows group members to only keep the most recent version of the key and only if needed, compute the previous key versions. A.6.2 User Related Operations Location and Profile Updates: Depending on their types, location and profile updates can be divided into two groups. For intra-cell movements of user u j from (x;y) to (x 0 ;y 0 ) in the cell C xc;yc and updates on u j :t, only a single change in IOI is 151 required. The user u j sends a signed update request to LS for his record indexed by e k i (x c ;y c ;g i ;u j :rnk) to be updated to e k i (u j :x 0 ;u j :y 0 ;u j :t 0 ). LS verifies the signature and updates the IOI tuple. For inter-cell movements, three rounds of more complex communications between the client and server are needed. The basic steps taken are querying the old and new cell information foru j , replacingu j ’s position with the last user who has joinedu j ’s old cell and finally updating the ACI and IOI records affected byu j ’s move in old and new cells. Although the cell count cnt remains encrypted in ACI and IOI during the above process, the server knows the initial value ofcnt for each cellC. Therefore, each new (repeated) encrypted value ofcnt would imply an increase (decrease) in the number of objects in C. To avoid this vulnerability, we always attach a timestamp to each cnt before encryption. This makes an increase or decrease equiprobable (during updates) from the server’s point of view. Therefore, afteri updates, the server can guessC:cnt only with probability 1 d i 2 e which quickly declines after a short time. Note that even a right cnt guess reveal neither the actual location of C nor the identity of its enclosed objects at any time. Privacy Mode Request: As we discussed in Section A.1, users sometimes prefer to stop sharing their location information even with their peers due to a variety of reasons [CSM + ]. PBS allows users to efficiently go to a Privacy Mode by sending an IOIremove request to temporarily remove their location information from IOI, as well as an ACIup- date request to accordingly adjust the count value of their current cell after querying the server for the cell’s current information. To switch back to the original Privacy Mode and start sharing location information, the above process is simply reversed. 152 A.7 Experimental Evaluation In this section, we empirically examine the overall efficiency of PBS. We conducted extensive experiments to determine the effectiveness of our framework in terms of (i) PBS operations overhead (ii) the effect of different datasets and grid granularity on spa- tial queries (iii) the client/server computation and communication overhead for Algo- rithms 11 and 12 and (iv) comparing PBS with similar approaches. (a) Oldenburg (b) Hennepin Figure A.5: Datasets A.7.1 Datasets and Experimental Setup We used the widely accepted network-based generator for moving objects [Bri02] to generate our datasets. The generator takes the road map of a region (e.g., a city) and outputs for each object, a set of locations along the road network of the given region. We used as input the city of Oldenburg in Germany and the Hennepin County in Minnesota. Figure A.5 illustrates simulated user locations for these two datasets. The second dataset is also used to compare PBS with Capser [MCA] which also enables querying moving users through a trusted anonymizer. For Oldenburg, we generated three user datasets O 1 , O 2 and O 3 containing 500, 5K and 50K users, respectively. We fixed the group number to 5 in these three datasets to generate groups with average size of 100, 1K and 153 10K users, respectively. Similarly, we used 50K users in 5 groups for the Hennepin County dataset, denoted byHC. Experiments were run on two different Intel P4 2:66 GHz machines with 4 GB of RAM acting as clients and the server. We used sockets for the client server commu- nications over the TCP/IP protocol to measure the actual network latency, DES for symmetric key encryption/decryption, 1024 bit DSA for public key cryptography and authentication and SHA1 for one way functions and pseudorandom number generation. From the algorithms and operations introduced in previous sections, it is obvious that most of computation complexity is transferred to the client side in order to achieve both security and scalability. Therefore, throughout the following experiments, we focus on average end-to-end response time from client side, denoted byT C , in milliseconds as a key metric for PBS efficiency. Later in Section A.7.4, we provide a comprehensive breakdown of the response time in terms of client and server computation and commu- nication time. A.7.2 PBS Operations As our first set of experiments, we measure the overall response time for joining groups (t join ), location update (t update ) and buddy tracking (t track ) operations. Results fort join andt track were averaged over 1K requests. As fort update , we averaged the overhead for 1K; 100K; 1M and 1:5M location updates for theO 1 ;O 2 ;O 3 andHC datasets, respec- tively. The results are shown in Figure A.6. We observe that the overhead of all three operations is almost invariant to the choice of the dataset and stays around 15 to 20ms fort join andt update and less than 1ms for location tracking (due to its simplicity). Also, due to the plain structure of ACI and IOI 154 Figure A.6: PBS Operations and non-spatial nature of these operations, they are not significantly affected by grid granularity. A.7.3 Spatial Queries In this section, we evaluate the effect of the grid granularity () on the performance of Algorithms 11 and 12. We first study the effect of on the response time for 100 randomly selected range queries. Figure A.7a illustrates the overall response timet range for different datasets with 1% selectivity (i.e., relative range size). While having very similar trends, shrinking selectivity to 0:5% and 0:1% resulted in smaller values oft range . There is a trade-off in choosing the right value of for space-driven queries. For coarse grids, users have to falsely query numerous objects from IOI simply because they are co-located in a large cell with other objects relevant to the query. This increases the communication and processing overhead. Alternatively, fine-grained grids result in numerous cells overlapping with range queries (or expanded withKNN queries) whose information has to be queried from ACI which in turn increases the client/server com- munication overhead. One important observation from Figure A.7a is that while optimum grid granularity is a function of the number of users [YPK], t range is not affected by the number of objects for highly fine-grained grids and converges to very similar values across all 155 (a) (b) Figure A.7: (a) Range Queries (b) NN Queries datasets. This is because t range = n 1 t ACI +n 2 t IOI and for very small values of, the number of cell queries from ACI significantly dominates the number of object queries from IOI (i.e., n 1 >> n 2 ). Given that both indexes are plain, t ACI t IOI and hencet range will be dominated by the common value ofn 1 t ACI across different datasets. Finally, it is obvious from Algorithm 11 and Figure A.7a thatt range increases for more dense datasets due to an increase in average IOI tuple requests from the server per each query. Next, we examined the response time of Algorithm 12 for evaluatingKNN queries (t KNN ). Figure A.7b illustrates t KNN for 100 randomly generated nearest neighbor queries for all datasets (similar trends were observed for higher values of k). Similar to range queries, there is a trade-off for choosing the right value of for optimum query performance. There is however, a distinct trend observed for processingKNN queries which is caused by the fundamental difference between these two queries. While pro- cessing range queries in densely populated areas requires more accesses to IOI tuples, KNN queries are evaluated more efficiently in dense areas simply because the region containing the result set is relatively small. This explains the trend change in Figure A.7b for fine-grained grids. While the number of IOI requests remains the same, sparse datasets examine significantly more ACI tuples to find the result set. Finally, the slight 156 (a) O 1 andO 2 (b) O 3 andHC (c) O 1 andO 2 (d) O 3 andHC (e) O 1 andO 2 (f) O 3 andHC Figure A.8: Response Time for (a,b) 0:1%, (c,d) 0:5%, and (e,f) 1% Selectivity variation between the response times of HC and O 3 are caused by different mobility patterns of simulated users in these two cities. A.7.4 End to End Query Processing As our next set of experiments, we measure the overall efficiency of PBS based on (i) t s , the time it takes to process a query at the server (ii)t c , the client side processing time (iii)t cs , client to server communication time and (iv)t sc , server to client communication time. For this experiment, we generated 100 randomly chosen range queries with 0:1%, 0:5% and 1% selectivity and measured the above four values. Several observations can be made from our findings summarized in Figure A.8. The first noticeable trend is an increase in client and server’s overhead for larger datasets, as well as for higher selectivity (notice the different scales of Y axes). To explain these 157 trends we first note thatt range =n 1 t ACI +n 2 t IOI . For a fixed selectivity, increasing the dataset size will only increasen 2 in the above equation and higher selectivity for a fixed dataset causes bothn 1 andn 2 to increase. The graphs also show that both client and server’s overhead are reasonable for all 6 different cases always staying below 60 milliseconds. We also see that theHC dataset consistently yields better results thanO 3 despite having the same number of objects. This is caused by more uniform distribution of objects in the HC dataset. Finally,t cs in Figure A.8 increases more rapidly thant c as dataset size grows due to higher server overhead for larger datasets causesing the client to spend more time communicating with the server. Figure A.9: Comparison with Casper A.7.5 Comparison with Other Approaches The closest work to PBS in terms of enabling private queries over dynamic user locations is the Casper system [MCA] which allows users to perform range andKNN queries to request other users’ locations. However, privacy in Casper is achieved by relying on a trusted anonymizer to cloak users, per query, in an anonymity set which contains at least K 1 other users. We used the HC dataset with 50K moving users to compare PBS with Casper for evaluating NN queries. Aside from the drawbacks of relying on an anonymizer (detailed in Section A.9 and [KS, GKK + , YJHLa, KGMP06]), as illustrated in Figure A.9, Casper 158 suffers from a costly privacy/efficiency trade-off. To achieve comparable performance with PBS, Casper provides significantly lower privacy guarantees by making a user indistinguishable among a small anonymity set of (K < 50) users. A.8 Security Analysis In this Section, we briefly review some key security strengths and weaknesses of PBS. Multiple Group Affiliation and Variable Privacy: So far we have assumed a binary notion of trust between two users (i.e., buddies vs. adversaries). However, users might have a more flexible approach towards privacy. For instance, while a user is will- ing to continuously share his exact location with his family or close friends, she might prefer to share much coarser information during certain times with his co-workers. This notion of variable privacy can be supported in PBS by allowing users to join multiple groups with different levels of privacy where members of each group negotiate a group- specific (i.e., spatial resolution) based on their common privacy preferences. Users with multiple group affiliations use a differentvp for each of their group memberships. This technique, however, exposes PBS to a powerful attack where a user u j 2 g i ;g 0 i sharesg i ’s secret keyk i with someone in his buddy list fromg 0 i 6= g i . Addressing this attack is challenging and existing approaches do not provide a solution for it. One solu- tion is to store each group key in a client’s tamper-resistent device to prevent users from accessing and hence being able to share keys with their peers from other groups. Server Collusion with Adversaries and Trusted Users: While group keys prevent the sever (or multiple adversaries) to collude against a user, PBS cannot protect user privacy against an adversary colluding with a seemingly trusted user in one’s buddy list who might share the group key with an outsider. This powerful attack remains an open 159 problem in our system as well as other privacy management systems such as [CDM]. However, it can only affect users of the compromised group. Statistical Cryptanalysis: The server can compile query frequencies for different IOI tuples to find the most frequently queried user, or the user with most number of loca- tion updates. While such known plaintext attacks are powerful with querying static data (e.g., restaurants), the server cannot infer the original and encrypted objects mapping to identify or locate users due to the highly dynamic nature of users. Similarly, the server might infer relative cell positions by the sequence ACI tuples are requested by clients in Algorithms 11 and 12. To thwart this attack, users can randomly break their requests packets into two or more sub-requests to protect the expansion sequences of cells. A.9 Related Work Privacy issues in mobile social networking systems have been the focus of several social and technical studies [CSM + , CDM, GLC + ]. Perhaps the most relevant study to our work is the SmokeScreen framework [CDM] proposed for private location sharing. SmokeScreen supports presence sharing with trusted users, as well as with strangers which is a feature PBS does not support. However, this work bears strong differences with our approach. First, SmokeScreen operates under a model with users “periodically broadcasting their identity via short-range wireless technology such as BlueTooth or WiFi”. Second, it does not study query processing. Third, it employs a complex trusted broker which maintains a user interest graph and other sensitive information about user relationships. Numerous research studies have also examined user privacy in location based ser- vices [KGMP06, KS, GKK + , YJHLa]. However, they mostly rely on cloaking or trusted 160 anonymizers to blur a user’s location and do not focus on querying dynamic user loca- tions. The only study in this group which addresses querying dynamic user data is the Casper framework proposed in [MCA]. Although Casper supports range and KNN queries, it suffers from several privacy issues shared among cloaking-based approaches. For instance, under certain distributions, cloaking might reveal exact user locations to malicious entities [KGMP06]. Furthermore, the quality of service degrades significantly for users with strict privacy preferences. Finally, Casper does not address the issues of trust among users and assumes that all users trust each other and a central anonymizer. To the best of our knowledge, PBS is the first work to address privacy issues of enabling mobile users to execute a set of spatial queries predominantly used in social networks. While supporting these queries, PBS does not suffer from privacy implica- tions of cloaking techniques or their costly query overhead by utilizing decentralized and self-maintaining encrypted index structures stored at a central untrusted server. A.10 Conclusion and Future Work In this chapter we presented the Private Buddy Search (PBS) framework which enables users to privately perform a variety of queries and interactions with other users in a highly dynamic social network. Our experimental evaluation verified that PBS is highly scalable due to its distributed query workload. PBS provides various user interactions currently supported in social networks while protecting the privacy of its users. As part of our future work, we are performing an in-depth study of PBS components’ security. We are also extending PBS to employ more complex indexing schemes to achieve more scalability for various user distributions. We also plan to focus on the social aspects of PBS such as relaxing the single group affiliation assumption. 161 References [Ack] Wireless location privacy: Law and policy in the U.S., EU and Japan. http://www.isoc.org/briefings/015/briefing15.pdf. [AF] Dmitri Asonov and Johann Christoph Freytag. Almost optimal private information retrieval. In PET’02, San Francisco, CA. [AMCK + ] Jalal Al-Muhtadi, Roy H. Campbell, Apu Kapadia, M. Dennis Mickunas, and Seung Yi. Routing through the mist: Privacy preserving communica- tion in ubiquitous computing environments. In ICDCS’02, Austria. [Aso04] Dmitri Asonov. Querying Databases Privately: A New Approach to Private Information Retrieval, volume 3128 of Lecture Notes in Computer Science. Springer, 2004. [AvD04] Todd W. Arnold and Leendert van Doorn. The IBM PCIXCC: A new cryp- tographic coprocessor for the IBM eServer. IBM Journal of Research and Development, 48(3-4):475–488, 2004. [BAG + ] Bishwaranjan Bhattacharjee, Naoki Abe, Kenneth Goldman, Bianca Zadrozny, Vamsavardhana R. Chillakuru, Marysabel del Carpio, and Chid Apte. Using secure coprocessors for privacy preserving collaborative data mining and analysis. In DaMoN’06, Chicago, IL. [BBCa] Privacy feares over google tracker. http://news.bbc.co.uk/. [BBCb] Privacy worry over location data. http://news.bbc.co.uk/. [BD] Louise Barkhuus and Anind K. Dey. Location-based services for mobile telephony: a study of users’ privacy concerns. In INTERACT’03, Zurich, Switzerland. [BKSS] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The r*-tree: an efficient and robust access method for points and rectangles. In SIGMOD’90, pages 322–331. 162 [BLPW08] Bhuvan Bamba, Ling Liu, P´ eter Pesti, and Ting Wang. Supporting anony- mous location queries in mobile environments with privacygrid. In 17th International Conference on World Wide Web, WWW’08, pages 237–246, Beijing, China, 2008. [Blu] Bluetooth tracking. http://www.bluetoothtracking.org/. [BP] Luc Bouganim and Philippe Pucheral. Chip-secured data access: confiden- tial data on untrusted servers. In VLDB ’2002, pages 131–142. [Bri02] Thomas Brinkhoff. A framework for generating network-based moving objects. GeoInformatica, 6(2):153–180, 2002. [BS03] Alastair R. Beresford and Frank Stajano. Location privacy in pervasive computing. IEEE Pervasive Computing, 2(1):46–55, 2003. [BWJ] Claudio Bettini, Xiaoyang Sean Wang, and Sushil Jajodia. Protecting pri- vacy against location-based personal identification. In SDM’05, Trond- heim, Norway. [CDM] Landon P. Cox, Angela Dalton, and Varun Marupadi. Smokescreen: flexi- ble privacy controls for presence-sharing. In MobiSys’07, pages 233–245. [CKGS98] Benny Chor, Eyal Kushilevitz, Oded Goldreich, and Madhu Sudan. Private information retrieval. J. ACM, 45(6):965–981, 1998. [CS] Scott Counts and Marc Smith. Where we were: Communities for sharing space-time trails. In ACM GIS’07, Seattle, WA. [CSM + ] Sunny Consolvo, Ian E. Smith, Tara Matthews, Anthony LaMarca, Jason Tabert, and Pauline Powledge. Location disclosure to social relations: why, when, & what people want to share. In CHI’05, pages 81–90. [CTH00] Kuo-Liang Chung, Yao-Hong Tsai, and Fei-Ching Hu. Space-filling approach for fast window query on compressed images. IEEE Transac- tions on Image Processing, 9(12):2109–2116, 2000. [DHPS05] Paolo Dell’Olmo, Pierre Hansen, Stefano Pallottino, and Giovanni Storchi. On uniform k-partition problems. Discrete Applied Mathematics, 150(1- 3):121–139, 2005. [DMS] Roger Dingledine, Nick Mathewson, and Paul F. Syverson. Tor: The second-generation onion router. In USENIX’04, pages 303–320. [FJM97] Christos Faloutsos, H.V . Jagadish, and Yannis Manolopoulos. Analysis of the n-dimensional quadtree decomposition for arbitrary hyperectangles. IEEE Transactions Knowledge and Data Engineering, 9(3):373–383, 1997. 163 [for] Online data gets personal: Cell phone records for sale. http://www.washingtonpost.com/. [FR89] C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In PODS ’89: Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 247–252, New York, NY , USA, 1989. ACM Press. [GG] Marco Gruteser and Dirk Grunwald. Anonymous usage of location-based services through spatial and temporal cloaking. In MobiSys’03. [GKK + ] Gabriel Ghinita, Panos Kalnis, Ali Khoshgozaran, Cyrus Shahabi, and Kian-Lee Tan. Private queries in location based services: anonymizers are not necessary. In SIGMOD’08, pages 121–132. [GKS07] Gabriel Ghinita, Panos Kalnis, and Spiros Skiadopoulos. Prive: anonymous location-based queries in distributed mobile systems. In 16th International Conference on World Wide Web, WWW’07, pages 371–380, Banff, Canada, May 2007. [GL] Bugra Gedik and Ling Liu. A customizable k-anonymity model for pro- tecting location privacy. In ICDS’05, Columbus, OH. [GL04] Marco Gruteser and Xuan Liu. Protecting privacy in continuous location- tracking applications. IEEE Security & Privacy, 2(2):28–34, 2004. [GLC + ] Shravan Gaonkar, Jack Li, Romit Roy Choudhury, Landon Cox, and Al Schmidt. Micro-blog: sharing and querying content through mobile phones and social participation. In MobiSys’08, pages 174–186. [GRB08] M. C. Gonzalez, Cesar A. Hidalgo R., and Albert-L´ aszl´ o Barab´ asi. Under- standing individual human mobility patterns. CoRR, abs/0806.1256, 2008. [Gut] Antonin Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD’84, pages 47–57. [Hen] Urs Hengartner. Hiding location information from location-based services. In MDM’07, pages 268–272. [Hil91] David Hilbert. Uber die stetige abbildung einer linie auf ein flachenstuck. In Math. Ann. 38, pages 459–460, 1891. [HL] Susan Hohenberger and Anna Lysyanskaya. How to securely outsource cryptographic computations. In TCC’05, pages 264–282. 164 [IS] Alexander Iliev and Sean Smith. More efficient secure function evalua- tion using tiny trusted third parties. Technical report, Dartmouth College, TR2005-551. [IS04] Alexander Iliev and Sean W. Smith. Private information storage with logarithm-space secure hardware. In International Information Security Workshops, Toulouse, France, 2004. [IS05] Alexander Iliev and Sean W. Smith. Protecting client privacy with trusted computing at the server. IEEE Security & Privacy, 3(2):20–28, 2005. [IW] Piotr Indyk and David P. Woodruff. Polylogarithmic private approximations and efficient matching. In TCC’06. [Jag90] H. V . Jagadish. Linear clustering of objects with multiple atributes. In Proceedings of the 1990 ACM SIGMOD International Conference on Man- agement of Data, pages 332–342, Atlantic City, NJ, 1990. ACM Press. [Jag97] H. V . Jagadish. Analysis of the hilbert curve for representing two- dimensional space. Inf. Process. Lett., 62(1):17–22, 1997. [JSM] S. Jiang, S. Smith, and K. Minami. Securing web servers against insider attack. In ACSAC’01, Washington, DC, USA. [Jup] Newly emerged advertising tactics and technolo- gies short-term strategies for automotive advertisers. http://www.jupiterresearch.com/bin/item.pl/research:concept/93/id=96787/. [KGMP06] Panos Kalnis, Gabriel Ghinita, Kyriakos Mouratidis, and Dimitris Papadias. Preserving anonymity in location based services. A Technical Report, 2006. [KO] E. Kushilevitz and R. Ostrovsky. Replication is not needed: single database, computationally-private information retrieval. In FOCS’97, pages 364– 373. [KPH04] Dmitri V . Kalashnikov, Sunil Prabhakar, and Susanne E. Hambrusch. Main memory evaluation of monitoring queries over moving objects. Distrib. Parallel Databases, 15(2):117–135, 2004. [KRS + ] Mahesh Kallahalla, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. Plutus: Scalable secure file sharing on untrusted storage. In FAST’03. [KS] Ali Khoshgozaran and Cyrus Shahabi. Blind evaluation of nearest neigh- bor queries using space transformation to preserve location privacy. In SSTD’07, pages 239–257. 165 [KSMS] Ali Khoshgozaran, Houtan Shirani-Mehr, and Cyrus Shahabi. SPIRAL, a scalable private information retrieval approach to location privacy. In PALMS’08, In conjunction with MDM’08. [KSSM10] Ali Khoshgozaran, Cyrus Shahabi, and Houtan Shirani-Mehr. Location privacy: going beyond k-anonymity, cloaking and anonymizers. to appear, March 2010. Knowledge and Information Systems. [LC04] Ping Lin and K. Selc ¸uk Candan. Secure and privacy preserving outsourcing of tree structured data. In Secure Data Management, VLDB Workshop, SDM, pages 1–17, 2004. [LC05] Ping Lin and K. Selc ¸uk Candan. Hiding tree structured data and queries from untrusted data stores. Information Systems Security, 14(4):10–26, 2005. [LK01] Jonathan K. Lawder and Peter J. H. King. Querying multi-dimensional data indexed using the hilbert space-filling curve. SIGMOD Record, 30(1):19– 24, 2001. [MCA] Mohamed F. Mokbel, Chi-Yin Chow, and Walid G. Aref. The new casper: Query processing for location services without compromising privacy. In VLDB’06, pages 763–774. [MT] E. Mykletun and G. Tsudik. Incorporating a secure coprocessor in the database-as-a-service model. In IWIA05, pages 38–44, College Park, MD. [MvJFS01] Bongki Moon, H. v. Jagadish, Christos Faloutsos, and Joel H. Saltz. Anal- ysis of the clustering properties of the hilbert space-filling curve. IEEE Transactions on Knowledge and Data Engineering, 13(1):124–141, 2001. [NYC] Cabbies Threaten Strike over GPS Systems. www.cnn.com/2007/TECH/08/01/gps.taxi.strike.ap/index.html. [OBSC00] Atsuyuki Okabe, Barry Boots, Kokichi Sugihara, and Sung Nok Chiu. Spa- tial Tessellations, Concepts and Applications of Voronoi Diagrams. John Wiley and Sons Ltd., 2nd edition, 2000. [Pew] Pew internet & american life project, first cut at cell survey data march 17, 2006. http://www.pewinternet.org/pdfs/pip cell phone study.pdf. [Pre03] Bart Preneel. Analysis and design of cryptographic hash functions. PhD thesis, 2003. 166 [QLW08] Ling Qiu, Yingjiu Li, and Xintao Wu. Protecting business intelligence and customer privacy while outsourcing data mining tasks. Knowl. Inf. Syst., 17(1):99–120, 2008. [RSV00] Philippe Rigaux, Michel Scholl, and Agn` es V oisard. Introduction to Spatial Databases: Applications to GIS. Morgan Kaufmann, 2000. [Sag94] Hans Sagan. Space-Filling Curves. Springer-Verlag, 1994. [Sah76] Sartaj K. Sahni. Algorithms for scheduling independent tasks. J. ACM, 23(1):116–127, 1976. [SC] Radu Sion and Bogdan Carbunar. On the computational practicality of private information retrieval. In NDSS’07, San Diego, CA. [Sch84] Manfred Robert Schroeder. Number Theory in Science and Communica- tion. Springer-Verlag, 1984. [SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing informa- tion: k-anonymity and its enforcement through generalization and suppres- sion. Technical report, 1998. [SS00] Sean W. Smith and Dave Safford. Practical private information retrieval with secure coprocessors. Technical report, IBM, August 2000. [SS01] S. W. Smith and D. Safford. Practical server privacy with secure coproces- sors. IBM Syst. J., 40(3):683–695, 2001. [Swe02] L. Sweeney. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):557–570, 2002. [tax] Taxi strike begins; city says it is prepared. http://cityroom.blogs.nytimes.com/. [TCC04] Yao-Hong Tsai, Kuo-Liang Chung, and Wan-Yu Chen. A strip-splitting- based optimal algorithm for decomposing a query window into maximal quadtree blocks. IEEE Transactions on Knowledge and Data Engineering, 16(4):519–523, 2004. [TD09] Zhouxuan Teng and Wenliang Du. A hybrid multi-group approach for privacy-preserving data mining. Knowl. Inf. Syst., 19(2):133–157, 2009. [TIPXCC] April 2008. The IBM 4764 PCI-X Cryptographic Coprocessor. http://www- 03.ibm.com/security/cryptocards/pcixcc/overperformance.shtml. Techni- cal report. 167 [USA] Californian gets 16 months for stalking by satellite. http://www.usatoday.com/tech/news/surveillance/2005-01-29-gps- stalking x.htm. [WDDB] Shuhong Wang, Xuhua Ding, Robert H. Deng, and Feng Bao. Private infor- mation retrieval using trusted hardware. In ESORICS’06, Germany. [WFY07] Ke Wang, Benjamin C. M. Fung, and Philip S. Yu. Handicapping attacker’s confidence: an alternative to -anonymization. Knowl. Inf. Syst., 11(3):345– 368, 2007. [WMM03] J. Warrior, E. McHenry, and K. McGee. They know where you are. In IEEE Spectrum, pages 20–25, 2003. [WSC08] Peter Williams, Radu Sion, and Bogdan Carbunar. Building castles out of mud: practical access pattern privacy and correctness on untrusted storage. In CCS’08, pages 139–148, 2008. [XMA] Xiaopeng Xiong, Mohamed F. Mokbel, and Walid G. Aref. Sea-cnn: Scal- able processing of continuous k-nearest neighbor queries in spatio-temporal databases. In ICDE’05. [YGJK] Man Lung Yiu, Gabriel Ghinita, Christian S. Jensen, and Panos Kalnis. Outsourcing search services on private spatial data. In ICDE’09, pages 1140–1143. [YJHLa] Man Lung Yiu, Christian S. Jensen, Xuegang Huang, and Hua Lu. Spacetwist: Managing the trade-offs among location privacy, query per- formance, and query accuracy in mobile services. In ICDE’08, pages 366– 375. [YJHLb] Man Lung Yiu, Christian S. Jensen, Xuegang Huang, and Hua Lu. Spacetwist: Managing the trade-offs among location privacy, query per- formance, and query accuracy in mobile services. In ICDE’08, pages 366– 375, Canc´ un, M´ exico. [YPK] Xiaohui Yu, Ken Q. Pu, and Nick Koudas. Monitoring k-nearest neighbor queries over moving objects. In ICDE’05, pages 631–642. 168
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Location privacy in spatial crowdsourcing
PDF
Privacy-aware geo-marketplaces
PDF
Mechanisms for co-location privacy
PDF
Generalized optimal location planning
PDF
Location-based spatial queries in mobile environments
PDF
Differentially private learned models for location services
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Query processing in time-dependent spatial networks
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Responsible AI in spatio-temporal data processing
PDF
Efficient updates for continuous queries over moving objects
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
A data integration approach to dynamically fusing geospatial sources
PDF
Scalable processing of spatial queries
PDF
Efficient crowd-based visual learning for edge devices
PDF
Gradient-based active query routing in wireless sensor networks
PDF
Dynamic pricing and task assignment in real-time spatial crowdsourcing platforms
PDF
Scalable evacuation routing in dynamic environments
Asset Metadata
Creator
Khoshgozran, Jaffar
(author)
Core Title
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/10/2010
Defense Date
05/12/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
anonymity,anonymizers,cloaking,encryption,geospatial information management,location-based services,OAI-PMH Harvest,privacy,Security,social networking,spatial databases
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Hashemi, Hossein (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
jafkhosh@usc.edu,khoshgozaran@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3367
Unique identifier
UC1130731
Identifier
etd-Khoshgozran-3833 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-383951 (legacy record id),usctheses-m3367 (legacy record id)
Legacy Identifier
etd-Khoshgozran-3833.pdf
Dmrecord
383951
Document Type
Dissertation
Rights
Khoshgozran, Jaffar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
anonymity
anonymizers
cloaking
encryption
geospatial information management
location-based services
social networking
spatial databases