Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Privacy-aware geo-marketplaces
(USC Thesis Other)
Privacy-aware geo-marketplaces
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PRIV ACY-AWARE GEO-MARKETPLACES by Kien Duy Nguyen A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2021 Copyright 2021 Kien Duy Nguyen Acknowledgements First and foremost, I would like to thank my advisor, Professor Cyrus Shahabi, for his advice and support before and during my PhD journey. Meeting him as a friend in Hanoi and I already started to receive his advice on research and life. During my PhD study, he is always patient with my steps, guiding me through my most difficult times. He would give me some perspectives that summarize perfectly what I have learned but could not put in words. Our formal and informal meetings have always been productive and encouraging. He helps me make invaluable connections for my research. I have learned so much from his way of doing research, defining and promoting ideas, validating and presenting my work in comprehensive and persuasive ways. It has been my great pleasure to work with him. I would also like to thank Dr. John Krumm, who has given me a refreshing, interesting, and productive point of view on doing research. We have always been talking to each other more like a friend than mentor and mentee. There has never been any pressure in our talks and he can handle hours of me telling all confusing thoughts before we can find anything concrete. Also his humor really follows him anywhere he goes. I would also give my special thanks to Professor Peter Kuhn, who gives me a chance to work with an amazing team at USC CSI-Cancer and with clinical trials, where I can bring and learn many different skill sets. His ideas have led to several of my successful projects. Also, the team at CSI-Cancer is really amazing, smart, hard-working and friendly. Thank you, Anand, Sara, Paul, Elvia, Liz, Allie, Stephanie, and many others. There are many things, many people working on different things there that I believe anyone joining CSI-Cancer can find something interesting and valuable to work on. It has been my pleasure working with Peter and CSI-Cancer. ii I am very proud and grateful to be a member of InfoLab. I had a fun time talking about re- search, work and life with all members in the lab. I have received tremendous support from them throughout the years: Luan Tran, Mingxuan Yue, Hien To, Ritesh Ahuja, Giorgos Constantinou, Abdullah Alfarrarjeh, Dimitrios Stripelis, Chrysovalantis Anastasiou, Haowen Lin, Jiao Sun, Sep- anta Zeighami, Arvin Hekmati, Dingxiong Deng, Ying Lu, Minh Nguyen, Tian Xie, Chaoyang He, Sina Shaham, Yaguang Li, and Daisy Tang. I want to thank Yao-Yi Chiang and Lucina Nocera for their support for various projects. Last but not least, my sincere thanks to my family for their understanding and support before and during my graduate study. I cannot go this far without them. I want to thank my wife privately. iii Table of Contents Acknowledgements ii List of Tables viii List of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2: Background and Related Work 10 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Geo-Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Location Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 3: Differentially Private Data Point Release with Free Query 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.1 Location entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 Location entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Private Publication of Location Entropy . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1 Global Sensitivity of Location Entropy and Baseline Algorithm . . . . . . 23 3.5.1.1 Global Sensitivity of Location Entropy . . . . . . . . . . . . . . 23 3.5.1.2 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Reducing the Global Sensitivity of LE . . . . . . . . . . . . . . . . . . . . 24 3.5.2.1 Limit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iv 3.5.2.2 Privacy Guarantee of the Limit Algorithm . . . . . . . . . . . . 27 3.6 Relaxation of Private Location Entropy . . . . . . . . . . . . . . . . . . . . . . . 27 3.6.1 Relaxation with Smooth Sensitivity . . . . . . . . . . . . . . . . . . . . . 28 3.6.1.1 LIMIT-SS Algorithm . . . . . . . . . . . . . . . . . . . . . . . 28 3.6.1.2 Privacy Guarantee of LIMIT-SS . . . . . . . . . . . . . . . . . . 29 3.6.1.3 Precomputation of Smooth Sensitivity . . . . . . . . . . . . . . 29 3.6.2 Relaxation with Crowd-Blending Privacy . . . . . . . . . . . . . . . . . . 30 3.6.2.1 LIMIT-CB Algorithm . . . . . . . . . . . . . . . . . . . . . . . 30 3.6.2.2 Privacy Guarantee of LIMIT-CB . . . . . . . . . . . . . . . . . 32 3.7 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.7.2.1 Overall Evaluation of the Proposed Algorithms . . . . . . . . . . 35 3.7.2.2 Privacy-Utility Trade-off (Varyinge) . . . . . . . . . . . . . . . 36 3.7.2.3 The Effect of M and C . . . . . . . . . . . . . . . . . . . . . . . 37 3.7.2.4 Results on the Gowalla Dataset . . . . . . . . . . . . . . . . . . 39 3.7.2.5 Recommendations for Data Releases . . . . . . . . . . . . . . . 40 3.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 4: Encrypted Data Point Release with Fixed-Price Query 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.1 Searchable Symmetric Encryption . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Hidden Vector Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 Vector Digital Commitments . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.4 Blockchain, Smart Contracts, and Bulk Storage . . . . . . . . . . . . . . . 50 4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Private Geo-marketplace with SSE . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Private Geo-marketplace with HVE . . . . . . . . . . . . . . . . . . . . . 56 4.5 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 Symmetric Encryption Search . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.2 Asymmetric Encryption Search . . . . . . . . . . . . . . . . . . . . . . . 61 4.5.3 Owner Accountability and Spam Resilience . . . . . . . . . . . . . . . . . 64 4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.1 Experiment Setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.2 SSE Approach Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.6.3 HVE Approach Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.4 Financial Cost Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 v Chapter 5: Noisy Data Point Release with Variable-Price Query 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.1 Users’ Privacy Valuation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2 Privacy Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.3 The Buyer’s Profit Maximization Problem . . . . . . . . . . . . . . . . . . 78 5.2.4 Formal Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 The Buyer’s Strategy to Make Open/Cancel Decision . . . . . . . . . . . . 82 5.4.2 The Expected Incremental Profit (EIP) . . . . . . . . . . . . . . . . . . . . 83 5.4.3 The Spatial Information Probing Algorithms . . . . . . . . . . . . . . . . 85 5.4.3.1 The SIP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.3.2 The SIP-T Algorithm . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5.4 Parameter Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.6 The effect of profit model parameters . . . . . . . . . . . . . . . . . . . . 94 5.5.6.1 The minimum customer threshold n 0 . . . . . . . . . . . . . . . 94 5.5.6.2 The profit per userb . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.7 The effect of query parameters and user’s data . . . . . . . . . . . . . . . . 97 5.5.7.1 The target region L . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.7.2 The scale d of users’ privacy distributions . . . . . . . . . . . . 99 5.5.8 The effect of algorithmic parameters . . . . . . . . . . . . . . . . . . . . . 99 5.5.8.1 The starting price q 0 . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.8.2 The price increment factor h . . . . . . . . . . . . . . . . . . . . 99 5.6 Discussions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter 6: Degraded Trajectory Release with Variable-Price Query 105 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2.1 Trajectory Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2.2 Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.3 Trajectory Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.1 Fixed Value Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.2 Size-based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.3 Duration-based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3.4 Travel Distance Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.5 Entropy-based Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.6 Spatial Privacy Pricing-based Method . . . . . . . . . . . . . . . . . . . . 117 vi 6.3.7 Correctness-based Method . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.3.8 Other Potential Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.4 The Information Gain Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.1 Trajectory Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.2 Reconstruction Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.5 Evaluations of the IG framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.5.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.5.3 IG Capturing Trajectory Characteristics . . . . . . . . . . . . . . . . . . . 128 6.5.4 IG Capturing Prior Knowledge and Degradation . . . . . . . . . . . . . . 130 6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Chapter 7: Conclusion and Future Work 135 References 137 Appendices 147 A Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 vii List of Tables 3.1 Summary of notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Statistics of the datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1 HVE token generation time and size . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Parallel execution speedup for HVE matching . . . . . . . . . . . . . . . . . . . . 71 4.3 On-chain cost of one-time, setup operations . . . . . . . . . . . . . . . . . . . . . 71 4.4 On-chain cost breakdown for every transaction . . . . . . . . . . . . . . . . . . . 72 5.1 The average execution time (in seconds) of SIP and SIP-T for different values of the price increment factor h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.1 Potential methods to quantify the intrinsic VOI of a trajectory and the characteris- tics they can or cannot capture. Our proposed method based on information gain can capture almost all of the desirable characteristics. . . . . . . . . . . . . . . . . 113 viii List of Figures 1.1 An overview of the general structure of geo-marketplaces . . . . . . . . . . . . . . 3 3.1 Global sensitivity bound of location entropy when varying C. . . . . . . . . . . . . 25 3.2 Noise magnitude in natural log scale. . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Visit statistics from Gowalla dataset in New York. . . . . . . . . . . . . . . . . . . 26 3.4 Sparsity of location visits (Gowalla, New York). . . . . . . . . . . . . . . . . . . . 31 3.5 Global sensitivity bound when varying n. . . . . . . . . . . . . . . . . . . . . . . 31 3.6 Comparison of the distributions of noisy vs. actual location entropy on the dense and sparse datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Published ratio of LIMIT-CB when varying k (K= 20). . . . . . . . . . . . . . . . 37 3.8 Varyinge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.9 Varying M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.10 Varying C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.11 Comparison of the distributions of noisy vs. actual location entropy on Gowalla, M= 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.12 Varyinge and M (Gowalla). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 SSE-based System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 HVE-based System Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3 Mapping 1D domain using best range cover . . . . . . . . . . . . . . . . . . . . . 59 4.4 Data and query encoding with HVE . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 HXT index generation performance . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 HXT query performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 Analysis of query restriction effect on performance . . . . . . . . . . . . . . . . . 68 ix 4.8 HVE encryption time and ciphertext size per object . . . . . . . . . . . . . . . . . 69 4.9 HVE matching time per ciphertext . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 The true location x x x (black dot) can be sold as one of the noisy data points z 1 ;z 2 (white dots). The standard deviation s 1 of noise of z 1 is smaller than that of z 2 . Hence, z 1 is more expensive than z 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2 Check-ins of Gowalla users in Los Angeles converted to local Euclidean coordi- nates, 1 check-in per user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 The effect of the minimum user threshold n 0 . . . . . . . . . . . . . . . . . . . . . 93 5.4 Illustrations for the amount spent by the SIP . . . . . . . . . . . . . . . . . . . . . 95 5.5 The effect of the gross margin per userb . . . . . . . . . . . . . . . . . . . . . . . 96 5.6 The effect of the size L of the target region . . . . . . . . . . . . . . . . . . . . . . 98 5.7 The effect of the scale d of the privacy distributions . . . . . . . . . . . . . . . . . 100 5.8 The effect of the starting price q 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.9 The effect of the price increment factor h . . . . . . . . . . . . . . . . . . . . . . . 103 6.1 Illustrations of information gain at a single timestamp and over a time period. The reduced uncertainty is shown in the red area minus the blue area. . . . . . . . . . . 122 6.2 An illustration of how IG can capture trajectory characteristics. The figures from left to right showing the effect of measurement uncertainty, duration, size, tempo- ral, and spatial distribution of a trajectory. . . . . . . . . . . . . . . . . . . . . . . 126 6.3 Histogram of IG and size, duration, spatial, temporal entropy, and their correlation coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4 Absolute IG and percentage IG change compared to original trajectories for differ- ent values of total noise and different types of prior knowledge. . . . . . . . . . . . 130 6.5 Absolute IG and percentage IG change compared to original trajectories for differ- ent truncation ratios and types of prior knowledge. . . . . . . . . . . . . . . . . . . 131 6.6 Absolute IG and percentage IG change compared to original trajectories for differ- ent subsampling ratios and types of prior knowledge. . . . . . . . . . . . . . . . . 132 6.7 An example of VOI equivalence classes of different types of degradation for a trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 x Abstract The advance of modern mobile communication devices has enabled people to easily create, con- sume, and share information about every aspect of their lives from almost everywhere at any time. The location-tracking capability of these devices allows individuals to generate real-time geospa- tial data, which enables or plays an important role in many different applications. However, the current practice of geospatial data collection and sharing is often that individuals provide their data for free in order to use services. The massive amount of collected geospatial data is then used to pay for service development through other methods such as targeted advertising or even selling the data to other businesses. However, information about locations of individuals can have serious privacy implications be- cause by linking locations with other data sources, important, and potentially sensitive, information about individuals can be revealed such as home, health condition, religious or political preferences. Without proper privacy protection, malicious adversaries can execute a wide range of physical and virtual attacks such as physical surveillance or stalking. While there was some regulation of lo- cation data collection and sharing, current practice is far from ideal for individuals to safely share their location information to service providers. An emerging alternative framework for current practice is to allow individuals to offer their location data through data marketplaces. We called these marketplaces geo-marketplaces. Geo- marketplaces raise a number of interesting issues about data ownership, utility, pricing and pri- vacy. In this thesis, we focus on the interplay between utility, privacy and pricing of geospatial data in various settings of geo-marketplaces. More specifically, two important aspects of geo- marketplaces are at the center of interest: location privacy and pricing for various types of location xi data. Location privacy is essential for geo-marketplaces due to the sensitivity of location data. Pric- ing is crucial to make geo-marketplace viable, especially when geospatial data can come in many different forms. Thus, geo-marketplaces require efficient algorithms for selling different types of geospatial data with alternative pricing strategies while protecting sellers’ location privacy. This thesis aims to enable geo-marketplaces with those requirements by investigating different settings of data types and pricing strategies in a geo-marketplace along with privacy considerations. These settings include (a) differentially private data point release with free query where some quan- tities derived from individuals’ data points can be released for free with strong privacy protection, (b) encrypted data point release with fixed-price query where location data points or geo-tagged data objects for a fixed price with their locations advertised in encrypted space, (c) noisy data point release with variable price query where a location data point can be sold at different prices de- pending on how much noise is added, and (d) degraded trajectory release with variable-price query where a trajectory can be released or sold at different prices depending on how it is degraded. In each setting, we design the marketplace and develop principled methods to price data, protect pri- vacy of owners and maintain utility of data for buyers in order to enable a viable geo-marketplace. In all settings, our proposed design and methods are evaluated by extensive experiments on large real-world datasets to show its practicality. With a wide range of novel settings and practical applications of geo-marketplaces developed in this thesis, we make important steps to the realization and adoption of geo-marketplaces. This will enable new ways for individuals to interact with Internet and mobile services, take control of the collection, usage and sharing of their geospatial data, receive appropriate reward for their contribution; and for service providers and other organizations to leverage individuals’ valuable data while respecting their privacy. xii Chapter 1 Introduction 1.1 Motivations The availability of mobile devices with location-tracking capability has enabled individuals to generate an ever-increasing amount of geospatial data (e.g., check-ins, trajectories, or geo-tagged images) from various types of location signals (e.g., GPS, Wi-Fi, or cell service). Geospatial data plays an important role in various applications. Examples includes spatial crowdsourcing where individuals need to visit certain locations to perform certain actions such as taking photos of an event or pick-up then drop-off a person [63, 119]; transportation such as traffic prediction where movement of people are collected over time to make real-time prediction[15]; urban planning such as classifying land or pollution reduction [89]; disease control such as contact tracking or hotspot detection [25]; marketing such as location-based advertising [28]; or social networks [136]. In spite of its importance, currently geospatial data of individuals are often provided for free to service providers in exchange for free usage of services. For example, weather apps getting locations of users to give them weather of their current location; social networks getting locations of users to give near-by friend recommendations; car sharing apps getting current locations of users to find near-by drivers. The reason that location data can be collected for free is because the current practice of many major technology companies is to provide services to their users for free in return for data of users, which often involves geospatial data. These massive and freely acquired datasets are then often used for user profiling, which is an important piece of information 1 for various targeted advertising purposes. Revenue from advertising is then used for the services developed and offered by the company. In addition, sometimes the companies simply collect, aggregate, and sell the users’ data to third parties and make a profit. These third parties may be interested in gathering information from certain locations for various purposes. For example, public health authorities can use the data to identify potential pandemic clusters; city authorities are interested in travel patterns during heavy traffic; or advertisers are interested in the popularity of various locations at different times. However, location data can reveal sensitive information about individuals, thus, can have seri- ous privacy implications [45, 27, 117]. For example, the location trace of a person can reveal that the person often stays for long at a certain place, visits a certain hospital, a certain church, or a cer- tain restaurant. These points of interest can then reveal the home location of the person, her health situation, her religious belief, or her diet. A malicious adversary can leverage location data that do not have proper privacy protection to perform a wide range of attacks (e.g., physical surveillance, stalking) or infer different sensitive information (e.g., health issues, political or religious prefer- ences). Addressing privacy implications of location data is challenging because location data often includes actual physical coordinates of individuals. There has been some recognition of privacy threats from geospatial data from regulators and some regulation has been put in place in order to protect location privacy of individuals. One prominent example is the Location Privacy Protection Act of 2012 [1] that aimed to address this data collecting and sharing practice. The act requires that any company that obtains location infor- mation from a customer’s smartphone to obtain that customer’s express consent before collecting the location data, as well as before sharing the location data with third parties. However, the cur- rent practice is that if a user wants to hide her location data from a service provider, she has to turn off her location-detection device and (temporarily) unsubscribe from the service. Recently, an alternative framework is emerging to create data marketplaces through which data owners offer their location data to potential buyers [61], dubbed a geo-marketplace [83]. 2 Figure 1.1: An overview of the general structure of geo-marketplaces. Data owners (or sellers) use the marketplace to advertise various types of location data to interested buyers and receive rewards. An overview of the general structure of geo-marketplaces is shown in Figure 1.1. In these geo- marketplaces, a data owner is often the person whose devices collected location data and can become a seller that offers their data to interested parties through the geo-marketplace. Potential buyers can be anyone who are interested in the data such as service providers, public agencies or researchers, and may want to receive data for a price or for free if possible. The marketplace processes any transactions and returns reward to owners from those transactions, which is often at the form of monetary reward. Data marketplaces raise a number of interesting issues about data ownership, utility, pricing and privacy. Focusing on geo-marketplaces, this thesis studies the interplay between location data utility, privacy and value (i.e., pricing) in various settings of geo- marketplaces. More specifically, this thesis focuses on two important aspects of geo-marketplaces: location privacy and pricing for various types of location data. Location privacy is crucial for geo-marketplaces because of the sensitivity of location data. Thus, geo-marketplaces need to consider location privacy protection as a priority in the design. 3 However, this creates conflicting requirements from data owners and buyers. One the one hand, data owners do not want to reveal their location data before receiving rewards because of the privacy implications of their data. On the other hand, the buyers need some location information to guide their purchases because buying all data would be too expensive. It is challenging to balance these requirements while keeping practical performance for the marketplace. Another essential part of geo-marketplaces is pricing, i.e., setting a price for data. Pricing location data is challenging because location data can be in different forms such as GPS points, trajectories, or geo-tagged images, and a geo-marketplace may require different types of pricing strategies such as a fixed-price strategy where all data have the same price or a variable price strategy where the price changes according to specific characteristics of the data. We aim to make geo-marketplaces a viable alternative for current business models of geospatial data collection. In this thesis, we present important steps towards that goal. 1.2 Thesis Statement Given the availability, importance, and sensitivity of location data, along with the potential and challenges of geo-marketplaces, geo-marketplaces require efficient algorithms for selling differ- ent types of geospatial data with alternative pricing strategies while protecting sellers’ location privacy. This thesis aims to enable geo-marketplaces to satisfy these requirements. 1.3 Contributions With the goal of enabling geo-marketplaces to meet those requirements, this thesis investigates different settings of geospatial data types and pricing strategies of a geo-marketplace, presenting their challenges and proposed approaches, which are all evaluated with extensive experiments on real-world datasets. These geo-marketplace settings include: 4 • Differentially Private Data Point Release with Free Query. We first consider a setting similar to traditional data marketplaces or data release where some quantities derived from individuals’ data points can be released for free and show that important quantities can be released while still protecting individuals’ privacy. In this setting, location data points of data owners are aggregated at a central marketplace which can release different quantities derived from the data with strong privacy protection. These data points show locations that their owner visited. We focus on releasing location popularity, which is an important quan- tity in various domains such as public health, criminology, urban planning, policy, and social studies. Location entropy (LE) is a well-known metric to measure location popularity, so we aim to release location entropies of locations in a dataset. We also want to provide strong pri- vacy protection for individuals in the dataset. Thus, we chose to release LE under differential privacy (DP) [32] protection, which is the state-of-the-art privacy protection framework with strong theoretical guarantee. To achieve DP protection, LE can be perturbed by some noise with the magnitude proportional to its sensitivity which is intuitively the maximum amount that data of one individual can impact the value of LE. In general, the higher the sensitivity, the more noise must be injected. The main challenge to achieve DP for LE is that because an individual can visit many locations and visit one location many times, adding or dropping an individual from the dataset would impact multiple entries of the database, resulting in a high sensitivity of LE. Such high sensitivity can require extremely large noise to be added to the actual value of LE. Thus, straightforward approaches to apply DP to LE would result in extremely noisy output. Our contributions include computing a non-trivial tight bound for the global sensitivity of LE which makes DP for LE possible. Even with that tight bound, a naive approach would still produce excessively noisy LE output. Therefore, we developed several approaches to efficiently achieve DP protection for LE publication with accurate out- put, with different trade-offs for different characteristics of the dataset. With our proposed approaches, location entropy can be released with useful information about popularity of locations while still providing strong privacy protection for data owners. 5 • Encrypted Data Point Release with Fixed-Price Query. Next, we consider a setting where data owners (or sellers) can sell their location data points or geo-tagged data objects for a fixed price. Examples of geo-tagged data objects are an image or video with location data. The marketplace helps owners advertise locations, prices, and other metadata of their geo- tagged data objects to potential buyers who can pay the advertised price to get the actual data from the owner. However, because of privacy concerns, the owner would not want to reveal locations of their data before the buyer purchases, which means actual locations of their data should not be advertised in raw format. On the other hand, the buyer also wants to make sure that they receive data objects satisfying their spatial requirements, e.g., geo- tagged photos should be within a specified area. Furthermore, the marketplace also needs to provide strong disincentives to prevent spam behavior. It is challenging to simultaneously achieve these requirements. Therefore, we proposed a novel geo-marketplace setting that achieves privacy, accountability, and spam resilience by combining searchable encryption, digital commitments, and block-chain. For privacy, state-of-the-art searchable encryption techniques to protect locations, and the matching between buyer interests and advertised ob- jects is efficiently performed on encrypted space. Because searchable encryption techniques were not designed for geospatial data, we used different techniques to transform locations to other formats that can be used by searchable encryption techniques. We also utilized differ- ent searchable encryption techniques for different market designs with different trade-offs of privacy, security, and efficiency. For accountability, data owners are held accountable for their advertisements, which means they cannot change their locations after being advertised, by cryptographic commitments and blockchain technology. For spam resilience, a public blockchain, where writing to the ledger requires a transaction fee, is leveraged to discour- age spam behavior. To the best of our knowledge, this is the first attempt to design and implement such a geo-marketplace. • Noisy Data Point Release with Variable-Price Query. Next, we considered another geo- marketplace setting where a data owner is allowed to sell their location data point at different 6 prices where the price is determined by the magnitude of noise added to the data point. We believe such flexibility would enable new ways of advertising and purchasing location data. So, we study the interplay between utility of buyers, privacy of data owners and value (i.e., pricing) of location data in this geo-marketplace by considering a specific application: a buyer is interested in checking if the number of people inside a target region is large enough to, say, open a restaurant in that target region. The buyer can purchase data from multiple owners. And unlike traditional settings where the buyer needs to either buy or not buy data of a certain seller, with our new setting, the buyer can buy a location of an owner multiple times at different levels of accuracy, which requires different prices. The decision of opening or not opening a restaurant gives the buyer a certain profit based on their own profit model, minus the cost of purchasing data. The buyer’s objective is to maximize this profit. This profit maximization problem is a challenging problem because the locations of owners are uncertain, the purchasing actions are irrevocable, and possible locations of owners are extremely large. Thus, the buyer needs to balance between the amount spent on purchasing data and the profit or loss they may get. To help the buyer make reasonable decisions, we develop adaptive algorithms that are capable of taking into account multiple obstacles such as uncertainty of the data, the irrevocability of the collection process and the large range of possible locations of owners. Our algorithms adaptively buy different data points at different prices based on principled guidance and the geometry of the purchased data, effectively balancing cost and benefit in the buyer’s decision-making problem. • Degraded Trajectory Release with Variable-Price Query. Finally, we consider another geo-marketplace where a trajectory can be released or sold at different prices depending on how it is degraded. There can be different ways to degrade a trajectory such as perturbation (i.e., adding noise to each point in the trajectory), truncation (i.e., only keeping the first data points and removing the rest), or subsampling (i.e., randomly removing a subset of points). One big challenge is how the price should be set so that it is possible to derive reasonable prices and allow comparison for different trajectories with vastly different characteristics 7 such as short vs. long, dense vs. sparse, accurate vs. noisy. This brought us to a broader, more generic problem of quantifying the value of information (VOI) of trajectories. Quan- tifying the VOI of a trajectory is not only directly related to the price of the trajectory if it is sold, but also is an important step towards understanding the privacy implication of a trajectory (e.g., a trajectory with higher VOI tends to contain more information about the owner, thus, potentially having more serious privacy implication). While the VOI can vary depending on the use and the entity that uses the data, a trajectory also contains an intrinsic, formulaic VOI, since it contains quantified locations of an individual over space and time. Thus, our goal is to find a principled approach to quantify the intrinsic VOI of trajectories from the owner’s perspective. This is a challenging problem because a trajectory has many characteristics contributing to its VOI, can be degraded for different purposes, and the owner may assume different prior knowledge about the trajectory, which may change its VOI from the owner’s perspective. We then define characteristics contributing to the intrinsic VOI that should be captured by an appropriate quantification method. Realizing that trajectories in their discrete form are not appropriate to quantify their intrinsic VOI, we introduce a method to transform each trajectory to a canonical representation, i.e., continuous in time with con- tinuous mean and variance. Such canonical representation allows us to examine the VOI with a continuous location reconstruction and to compare trajectories with widely different characteristics. Finally, we develop an information gain framework over transformed tra- jectories as a principled approach that is capable of effectively capturing various trajectory characteristics, prior knowledge, and degradation. With the proposed framework, we enable a deeper understanding of trajectories and their utility for a wide variety of purposes. Our work covers a wide range of novel settings for geo-marketplaces with real-world applica- tions. There are important steps towards practical usage of geo-marketplaces. While there can be other settings, especially when other types of data are included such as health information, daily activities or disease symptoms, these are important steps towards privacy-aware geo-marketplaces that can efficiently support various types of geospatial data, allow flexibility pricing strategies, and 8 protect location privacy of data owners. Future work can surely make the current designs more robust, secure, efficient, or can consider other marketplace settings as well. 1.4 Thesis Outline In Chapter 2, we introduce some background and related work for geo-marketplaces discussed in this thesis. We then discuss the differentially private data point release with free query in Chapter 3. Next, Chapter 4 studies encrypted data point release with fixed-price query. Chapter 5 discusses noisy data point release with variable-price query. Next, Chapter 6 examines degraded trajectory release with variable-price query with the focus on quantifying intrinsic value of information of trajectories from the owner’s perspective. Finally, Chapter 7 summarizes the thesis and presents some future directions. 9 Chapter 2 Background and Related Work This chapter presents some background on geo-marketplaces and terminology used in this thesis. We then discuss previous work related to the general setting of all geo-marketplace settings pre- sented in this thesis. More detail of previous work related to each setting is discussed in their corresponding chapter. 2.1 Background We consider a general structure of geo-marketplaces where data owners (or sellers) can offer their geospatial data to interested buyers through a central marketplace. Figure 1.1 illustrates the general marketplace structure. The owner can be anyone having access and right to use the data, e.g., one whose phone recorded this data, or a data collector who aggregates data from individuals. Without loss of generality and to ease the discussion, from here on, we assume the owner is the individual whose device recorded the data. The owner can also be called a user or a seller. The owner can have different types of geospatial data such as a single data point (a check-in or location measurement), a geo-tagged data object, or a trajectory. A data point x x x (bold symbol) is a tuple x x x =< lon;lat;t;s > where x x x:lon and x x x:lat are the longitude and latitude, x x x:t is the timestamp, and x x x:s is the accuracy or uncertainty of the location measurement. The accuracy information may or may not be available. 10 A geo-tagged data object is a tuple < x x x;bulk> where location data point x x x is the geo-tag and bulk is the bulk data of the data object such as an image or video data. The geo-tag is often recorded as a part of the metadata and can be extracted from the bulk data itself. A trajectory S is a sequence of data points S=fx x x 1 ;x x x 2 ;:::;x x x jSj g. Data points in S are ordered by their timestamps, i.e,8x x x i ;x x x j 2 S;1 i jjSj;x x x i :t x x x j :t. Buyers are ones interested in purchasing geospatial data of owners in the marketplace for var- ious applications such as to find popularity of a location to guide task assignment in car sharing, to obtain evidence of a car crash for a court case, to determine if there are enough people within a region for a potential restaurant opening, or to train machine learning models. The buyer is as- sumed to be honest-but-curious, which means they follow the protocol of the marketplace but try to infer other potentially sensitive information about data owners from available data. The specific role and application of buyers for different marketplace settings are described in the detail in each corresponding chapter. The central marketplace facilitates transactions between owners and buyers. We assume a cen- tral marketplace where geospatial data can be gathered, processed, and advertised to potential buy- ers. The marketplace is assumed to be trusted although depending on specific setting, the degree of trust can vary. The main privacy concern of the owner is from dishonest buyers. The marketplace can employ different privacy protect mechanisms and/or pricing strategies for different types of geospatial data. The role of the central marketplace is discussed in detail in each chapter. 2.2 Related Work We provided a brief summary of related work on geo-marketplaces and location privacy. More detail of previous work related to each geo-marketplace setting is provided in their corresponding section. 11 2.2.1 Geo-Marketplaces The concept of marketplaces for personal data has been proposed before, for example, by Ada and Huberman in [3]. In recent years, marketplaces for data has become popular with many existing businesses working on different types of data and business models. Schomm et. al. [100, 109] conducted surveys of existing data marketplaces on the web. They analyzed dozens of data vendors including marketplaces, web crawlers, raw data vendors and others. Data marketplaces has been also an active line of research, especially for Internet of Things [80, 93]. Marketplaces focusing on geospatial data was first discussed in [61]. We call these market- places geo-marketplaces. There have been several work on different aspects of geo-marketplaces such as surveys on how individuals value their location data [24, 110], how the value of a loca- tion data point can be quantified from the buyer’s perspective [5, 6], privacy loss for location data streams [135], or data models [98]. This thesis presents different settings for geo-marketplaces, discusses important problems related to privacy, utility and pricing in those settings, and proposes novel approaches to solve those problems and enable flexible and efficient geo-marketplaces. 2.2.2 Location Privacy We focus on the computational aspect of location privacy. In this regard, location privacy has largely been studied in the context of location-based services [59, 65, 131, 107, 126, 53, 91, 51], participatory sensing [124, 62, 54, 26, 19, 55, 42, 41, 4] and spatial crowdsourcing [117, 103, 77, 118, 128]. Most studies use the model of spatial k-anonymity [114], where the location of a user is hidden among k other users [49, 81]. However, there are known attacks on k-anonymity, e.g., when all k users are at the same location. Nevertheless, such techniques assume a centralized architecture with a trusted third party, which is a single point of attack. Consequently, a technique that makes use of cryptographic techniques such as private information retrieval is proposed that does not rely on a trusted third party to anonymize locations [45]. Recent studies on location privacy have focused on leveraging differential privacy [32] and geo-indistinguishability [7] to protect the privacy of users [117, 129]. Krumm [69] also surveyed a variety of computational 12 location privacy schemes. Primault et. al. [95] provided a more detail survey on computational location privacy. 13 Chapter 3 Differentially Private Data Point Release with Free Query In this chapter, we start exploring privacy-aware geo-marketplaces by considering a more tradi- tional settings where some statistics derived from location data of individuals are released for free but with strong privacy protection. This chapter presents the method to release location entropy (LE), which is a popular metric for measuring the popularity of various locations (e.g., points-of- interest). Unlike other metrics computed from only the number of (unique) visits to a location, namely frequency, LE also captures the diversity of the users’ visits, and is thus more accurate than other metrics. This chapter discusses the problem of perturbing LE for a set of locations ac- cording to differential privacy (DP). This problem is challenging because removing a single user from the dataset will impact multiple records of the database; i.e., all the visits made by that user to various locations. Towards this end, we first derive non-trivial, tight bounds for both local and global sensitivity of LE, and show that to satisfy e-DP, a large amount of noise must be intro- duced, rendering the published results useless. Hence, we present a thresholding technique to limit the number of users’ visits, which significantly reduces the perturbation error but introduces an approximation error. To achieve better utility, we extend the technique by adopting two weaker notions of privacy: smooth sensitivity (slightly weaker) and crowd-blending (strictly weaker). We present extensive experiments on synthetic and real-world datasets, which show that our proposed techniques preserve original data distribution without compromising location privacy. 14 3.1 Introduction One example of using location data is to measure the popularity of a location that can be used in many application domains such as public health, criminology, urban planning, policy, and social studies. One accepted metric to measure the popularity of a location is location entropy (or LE for short). LE captures both the frequency of visits (how many times each user visited a location) as well as the diversity of visits (how many unique users visited a location) without looking at the functionality of that location; e.g., is it a private home or a coffee shop? Hence, LE has shown that it is able to better quantify the popularity of a location as compared to the number of unique visits or the number of check-ins to the location [22]. For example, [22] shows that LE is more successful in accurately predicting friendship from location trails over simpler models based only on the number of visits. LE is also used to improve online task assignment in spatial crowdsourcing [63, 119] by giving priority to workers situated in less popular locations because there may be no available worker visiting those locations in the future. Obviously, LE can be computed from raw location data collected by various industries; how- ever, the raw data cannot be published due to serious location privacy implications [45, 27, 117]. Hence, in this work, we propose an approach based on differential privacy (DP) [32] to publish LE for a set of locations without compromising users’ raw location data. DP has emerged as the de facto standard with strong protection guarantees for publishing aggregate data. It has been adapted by major industries for various tasks without compromising individual privacy, e.g., data analytics with Microsoft [15], discovering users’ usage patterns with Apple 1 , or crowdsourcing statistics from end-user client software [35] and training of deep neural networks [2] with Google. DP en- sures that an adversary is not able to reliably learn from the published sanitized data whether or not a particular individual is present in the original data, regardless of the adversary’s prior knowledge. It is sufficient to achievee-DP (e is privacy loss) by adding Laplace noise with mean zero and scale proportional to the sensitivity of the query (LE in this study) [32]. The sensitivity of LE is intuitively the maximum amount that one individual can impact the value of LE. The higher the 1 https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/ 15 sensitivity, the more noise must be injected to guarantee e-DP. Even though DP has been used before to compute Shannon Entropy [9] (the formulation adapted in LE), the main challenge in differentially private publication of LE is that adding (or dropping) a single user from the dataset would impact multiple entries of the database, resulting in a high sensitivity of LE. To illustrate, consider a user that has contributed many visits to a single location; thus, adding or removing this user would significantly change the value of LE for that location. Alternatively, a user may contribute visits to multiple locations and hence impact the entropy of all those visited locations. Another unique challenge in publishing LE (vs. simply computing the Shannon Entropy) is due to the presence of skewness and sparseness in real-world location datasets where the majority of locations have small numbers of visits. Towards this end, we first compute a non-trivial tight bound for the global sensitivity of LE. Given the bound, a sufficient amount of noise is introduced to guarantee e-DP. However, the in- jected noise linearly increases with the maximum number of locations visited by a user (denoted by M) and monotonically increases with the maximum number of visits a user contributes to a location (denoted by C), and such an excessive amount of noise often renders the published results useless. We refer to this algorithm as BASELINE. Accordingly, we propose a technique, termed LIMIT, to limit user activity by thresholding M and C, which significantly reduces the perturbation error. Nevertheless, limiting an individual’s activity entails an approximation error in calculating LE. These two conflicting factors require the derivation of appropriate values for M and C to obtain satisfactory results. We empirically find such optimal values. Furthermore, to achieve a better utility, we extend LIMIT by adopting two weaker notions of privacy: smooth sensitivity [86] and crowd-blending [44] (strictly weaker). We denote the tech- niques as LIMIT-SS and LIMIT-CB, respectively. LIMIT-SS provides a slightly weaker privacy guarantee, i.e.,(e;d)-differential privacy by using local sensitivity with much smaller noise mag- nitude. We propose an efficient algorithm to compute the local sensitivity of a particular location that depends on C and the number of users visiting the location (represented by n) such that the local sensitivity of all locations can be precomputed, regardless of the dataset. Thus far, we publish 16 entropy for all locations; however, the ratio of noise to the true value of LE (noise-to-true-entropy ratio) is often excessively high when the number of users visiting a location n is small (i.e., the entropy of a location is bounded by log(n)). For example, given a location visited by only two users with an equal number of visits (LE is log2), removing one user from the database drops the entropy of the location to zero. To further reduce the noise-to-true-entropy ratio, LIMIT-CB aims to publish the entropy of locations with at least k users (n k) and suppress the other locations. By thresholding n, the global sensitivity of LE significantly drops, implying much less noise. We prove that LIMIT-CB satisfies (k;e)-crowd-blending privacy. We conduct an extensive set of experiments on both synthetic and real-world datasets. We first show that the truncation technique (LIMIT) reduces the global sensitivity of LE by two orders of magnitude, thus greatly enhancing the utility of the perturbed results. We also demonstrate that LIMIT preserves the original data distribution after adding noise. Thereafter, we show the superiority of LIMIT-SS and LIMIT-CB over LIMIT in terms of achieving higher utility (measured by KL-divergence and mean squared error metrics). Particularly, LIMIT-CB performs best on sparse datasets while LIMIT-SS is recommended over LIMIT-CB on dense datasets. We also provide insights on the effects of various parameters: e;C;M;k on the effectiveness and utility of our proposed algorithms. Based on the insights, we provide a set of guidelines for choosing appropriate algorithms and parameters. The remainder of this chapter is organized as follows. In Section 3.2, we define the problem of publishing LE according to differential privacy. Section 3.3 presents the preliminaries. Sec- tion 3.5 introduces the baseline solution and our thresholding technique. Section 3.6 presents our utility enhancements by adopting weaker notions of privacy. Experimental results are presented in Section 3.7, and conclusions in Section 3.8. 3.2 Problem Definition In this section we present the notations and the formal definition of the problem. 17 Each location l is represented by a point in two-dimensional space and a unique identifier l (180 l lat 180) and(90 l lon 90) 2 . Hereafter, l refers to both the location and its unique identifier. For a given location l, let O l be the set of visits to that location. Thus, c l =jO l j is the total number of visits to l. Also, let U l be the set of distinct users that visited l, and O l;u be the set of visits that user u has made to the location l. Thus, c l;u =jO l;u j denotes the number of visits of user u to location l. The probability that a random draw from O l belongs to O l;u is p l;u = jc l;u j jc l j , which is the fraction of total visits to l that belongs to user u. The location entropy for l is computed from Shannon entropy [102] as follows: H(l)= H(p l;u 1 ; p l;u 2 ;:::; p l;u jU l j )= å u2U l p l;u log p l;u (3.1) In our study the natural logarithm is used. A location has a higher entropy when the visits are distributed more evenly among visiting users, and vice versa. Our goal is to publish location entropy of all locations L=fl 1 ;l 2 ;:::;l jLj g, where each location is visited by a set of users U = fu 1 ;u 2 ;:::;u jUj g, while preserving the location privacy of users. Table 3.1 summarizes the notations used in this chapter. 3.3 Preliminaries In this section, we present Shannon entropy properties and the differential privacy notion that will be used throughout this chapter. 2 l lat ;l lon are real numbers with ten digits after the decimal point. 18 l;L;jLj a location, the set of all locations and its cardinality H(l) location entropy of location l ˆ H(l) noisy location entropy of location l DH l sensitivity of location entropy for location l DH sensitivity of location entropy for all locations O l the set of visits to location l u;U;jUj a user, the set of all users and its cardinality U l the set of distinct users who visits l O l;u the set of visits that user u has made to location l c l the total number of visits to l c l;u the number of visits that user u has made to location l C maximum number of visits of a user to a location M maximum number of locations visited by a user p l;u the fraction of total visits to l that belongs to user u Table 3.1: Summary of notations. 3.3.1 Shannon Entropy Shannon [102] introduces entropy as a measure of the uncertainty in a random variable with a probability distribution U =(p 1 ; p 2 ;:::; p jUj ): H(U)= å i p i log p i (3.2) whereå i p i = 1. H(U) is maximal if all the outcomes are equally likely: H(U) H( 1 jUj ;:::; 1 jUj )= logjUj (3.3) Additivity Property of Entropy: Let U 1 and U 2 be non-overlapping partitions of a database U including users who contribute visits to a location l, andf 1 andf 2 are probabilities that a partic- ular visit belongs to partition U 1 and U 2 , respectively. Shannon discovered that using logarithmic function preserves the additivity property of entropy: H(U)=f 1 H(U 1 )+f 2 H(U 2 )+ H(f 1 ;f 2 ) 19 Subsequently, adding a new person u into U changes its entropy to: H(U + )= c l c l + c l;u H(U)+ H c l;u c l + c l;u ; c l c l + c l;u (3.4) where U + = U[ u and c l is the total number of visits to l, and c l;u is the number of visits to l that is contributed by user u. Equation (3.4) can be derived from Equation (3.4) if we consider U + includes two non-overlapping partitions u and U with associated probabilities c l;u c l +c l;u and c l c l +c l;u . We note that the entropy of a single user is zero, i.e., H(u)= 0. Similarly, removing a person u from U changes its entropy as follows: H(U )= c l c l c l;u H(U) H c l;u c l ; c l c l;u c l (3.5) where U = Unfug. 3.3.2 Differential Privacy Differential privacy (DP) [32] has emerged as the de facto standard in data privacy, thanks to its strong protection guarantees rooted in statistical analysis. DP is a semantic model which provides protection against realistic adversaries with background information. Releasing data according to DP ensures that an adversary’s chance of inferring any information about an individual from the sanitized data will not substantially increase, regardless of the adversary’s prior knowledge. DP ensures that the adversary does not know whether an individual is present or not in the original data. DP is formally defined as follows. Definition 1. e-INDISTINGUISHABILITY [33] Consider that a database produces a set of query results ˆ D on the set of queries Q =fq 1 ;q 2 ;:::;q jQj g, and let e > 0 be an arbitrarily small real 20 constant. Then, transcript U produced by a randomized algorithm A satisfiese-indistinguishability if for every pair of sibling datasets D 1 , D 2 that differ in only one record, it holds that ln Pr[Q(D 1 )= U] Pr[Q(D 2 )= U] e In other words, an attacker cannot reliably learn whether the transcript was obtained by an- swering the query set Q on dataset D 1 or D 2 . Parameter e is called privacy budget, and specifies the amount of protection required, with smaller values corresponding to stricter privacy protection. To achievee-indistinguishability, DP injects noise into each query result, and the amount of noise required is proportional to the sensitivity of the query set Q, formally defined as: Definition 2 (L 1 -Sensitivity). [33] Given any arbitrary sibling datasets D 1 and D 2 , the sensitivity of query set Q is the maximum change in their query results. s(Q)= max D 1 ;D 2 jjQ(D 1 ) Q(D 2 )jj 1 An essential result from [33] shows that a sufficient condition to achieve DP with parametere is to add to each query result randomly distributed Laplace noise with mean 0 and scalel =s(Q)=e. 3.4 Related Work 3.4.1 Location entropy Location privacy has largely been studied in the context of location-based services, participatory sensing and spatial crowdsourcing. Most studies use the model of spatial k-anonymity [114], where the location of a user is hidden among k other users [49, 81]. However, there are known attacks on k-anonymity, e.g., when all k users are at the same location. Nevertheless, such techniques assume a centralized architecture with a trusted third party, which is a single point of attack. Consequently, a technique that makes use of cryptographic techniques such as private information retrieval is 21 proposed that does not rely on a trusted third party to anonymize locations [45]. Recent studies on location privacy have focused on leveraging differential privacy (DP) to protect the privacy of users [117, 129]. 3.4.2 Location entropy Location entropy has been extensively used in various areas of research, including multi-agent systems [121], wireless sensor networks [125], geosocial networks [22, 18, 92], personalized web search [74], image retrieval [132] and spatial crowdsourcing [63, 119, 116], etc. The study that most closely relates to ours focuses on privacy-preserving location-based services in which loca- tion entropy is used as the measure of privacy or the attacker’s uncertainty [131, 120]. In [131], a privacy model is proposed that discloses a location on behalf of a user only if the location has at least the same popularity (quantified by location entropy) as a public region specified by a user. In fact, locations with high entropy are more likely to be shared (checked-in) than places with low entropy [120]. However, directly using location entropy compromises the privacy of individuals. For example, an adversary certainly knows whether people visiting a location based on its entropy value, e.g., low value means a small number of people visit the location, and if they are all in a small geographical area, their privacy is compromised. To the best of our knowledge, there is no study that uses differential privacy for publishing location entropy, despite its various applications that can be highly instrumental in protecting the privacy of individuals. 3.5 Private Publication of Location Entropy In this section we present a baseline algorithm based on a global sensitivity of LE [33] and then introduce a thresholding technique to reduce the global sensitivity by limiting an individual’s ac- tivity. 22 3.5.1 Global Sensitivity of Location Entropy and Baseline Algorithm 3.5.1.1 Global Sensitivity of Location Entropy To achievee-differential privacy, we must add noise proportional to the global sensitivity (or sen- sitivity for short) of LE. Thus, to minimize the amount of injected noise, we first propose a tight bound for the sensitivity of LE, denoted byDH.DH represents the maximum change of LE across all locations when the data of one user is added (or removed) from the dataset. With the following theorem, the sensitivity bound is a function of the maximum number of visits a user contributes to a location, denoted by C (C 1). Theorem 1. Global sensitivity of location entropy is DH = maxflog2;logC log(logC) 1g where C is the maximum number of visits a user contributes to a location (C 1). Proof. We prove this theorem by first deriving a tight bound for the sensitivity of a particular location l (visited by n users), denoted byDH l (Theorem 2). The bound is a function of C and n. Thereafter, we generalize the bound to hold for all locations as follows. We take the derivative of the bound derived forDH l with respect to variable n and find the extremal point where the bound is maximized. The detailed proof can be found in Appendix A.1. Theorem 2. Local sensitivity of a particular location l with n users is: • log2 when n= 1 • log n+1 n when C= 1 • maxflog n1 n1+C + C n1+C logC;log n n+C + C n+C logC; log(1+ 1 exp(H(Cnc u )) )g where C 1 is the maximum number of visits a user can contribute to a location and H(Cn c u ) = log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. 23 Proof. We prove the theorem considering both cases—when a user is added (or removed) from the database. We first derive a proof for the adding case by using the additivity property of entropy from Equation 3.4. Similarly, the proof for the removing case can be derived from Equation 3.5. The detailed proofs can be found in Appendix A.2. 3.5.1.2 Baseline Algorithm In this section we present a baseline algorithm that publishes location entropy for all locations (see Algorithm 1). Since adding (or removing) a single user from the dataset would impact the entropy of all locations he visited, the change of adding (or removing) a user to all locations is bounded by M max DH, where M max is the maximum number of locations visited by a user. Thus, Line 5 adds randomly distributed Laplace noise with mean zero and scale l = M max DH e to the actual value of location entropy H(l). It has been proved [33] that this is sufficient to achieve differential privacy with such simple mechanism. Algorithm 1 BASELINE ALGORITHM Input: Privacy budget e, a set of locations L=fl 1 ;l 2 ;:::;l jLj g, maximum number of visits of a user to a location C max , maximum number of locations a user visits M max . Output: Noisy location entropy ˆ H(l) of each location l2 L 1: Compute sensitivityDH from Theorem 1 for C= C max . 2: for l2 L do 3: Count #visits c l;u each user made to l and compute p l;u 4: Compute H(l)=å u2U l p l;u log p l;u 5: Publish noisy LE: ˆ H(l)= H(l)+ Lap( M max DH e ) 6: end for 3.5.2 Reducing the Global Sensitivity of LE 3.5.2.1 Limit Algorithm Limitation of the Baseline Algorithm. Algorithm 1 provides privacy; however, the added noise is excessively high, rendering the results useless. To illustrate, Figure 3.1 shows the bounds of the global sensitivity (Theorem 1) when C varies. The figure shows that the bound monotonically 24 increases when C grows. Therefore, the noise introduced by Algorithm 1 increases as C and M increase. In practice, C and M can be large because a user may have visited either many locations or a single location many times, resulting in large sensitivity. Furthermore, Figure 3.2 depicts different values of noise magnitude (in log scale) used in our various algorithms by varying the number of users n visiting a location with e = 5, C max =1000, M max =100, C=20, M=5, d=10 8 , k=25. The graph shows that the noise magnitude of the baseline is too high to be useful (see Table 3.2). 0 10 20 30 40 0.7 0.8 0.9 1 1.1 1.2 1.3 C Global Sensitivity Figure 3.1: Global sensitivity bound of lo- cation entropy when varying C. 0 20 40 60 80 100 0 1 2 3 4 5 6 Noise magnitude Number of users (n) Baseline Limit Limit−SS Limit−CB Figure 3.2: Noise magnitude in natural log scale. Improving Baseline by Limiting User Activity. To reduce the global sensitivity of LE, and inspired by [67], we propose a thresholding technique, named LIMIT, to limit an individual’s activity by truncating C and M. Our technique is based on the following two observations. First, Figure 3.3b shows the maximum number of visits a user contributes to a location in the Gowalla dataset that will be used in Section 3.7 for evaluation. Although most users have one and only one visit, the sensitivity of LE is determined by the worst-case scenario—the maximum number of visits 3 . Second, Figure 3.3a shows the number of locations visited by a user. The figure confirms that there are many users who contribute to more than ten locations. 3 This suggests that users tend not to check-in at places that they visit the most, e.g., their homes, because if they did, the peak of the graph would not be at 1. 25 (a) A user may visit many locations (b) Largest #visits a user contributes to a location Figure 3.3: Visit statistics from Gowalla dataset in New York. Since the introduced noise linearly increases with M and monotonically increases with C, the noise can be reduced by capping them. First, to truncate M, we keep the first M location visits of the users who visit more than M locations and throw away the rest of the locations’ visits. As a result, adding or removing a single user in the dataset affects at most M locations. Second, we set the number of visits of the users who have contributed more than C visits to a particular location to C. Figure 3.2 shows that the noise magnitude used in LIMIT drops by two orders of magnitude when compared with the baseline’s sensitivity. At a high-level, LIMIT (Algorithm 2) works as follows. Line 2 limits user activity across lo- cations, while Line 7 limits user activity to a location. The impact of Line 2 is the introduction of approximation error on the published data. This is because the number of users visiting some locations may be reduced, which alters their actual LE values. Subsequently, some locations may be thrown away without being published. Furthermore, Line 7 also alters the value of location entropy, but by trimming the number of visits of a user to a location. The actual LE value of location l (after thresholding M and C) is computed in Line 8. Consequently, the noisy LE is pub- lished in Line 9, where Lap( MDH e ) denotes a random variable drawn independently from Laplace distribution with mean zero and scale parameter MDH e . The performance of Algorithm 2 depends on how we set C and M. There is a trade-off on the choice of values for C and M. Small values of C and M introduce small perturbation error but large 26 Algorithm 2 LIMIT ALGORITHM Input: Privacy budgete, a set of locations L=fl 1 ;l 2 ;:::;l jLj g, maximum threshold on the number of visits of a user to a location C, maximum threshold on the number of locations a user visits M Output: Noisy location entropy of each location ˆ H(l) 1: for each user u in U do 2: Truncate M: keep the first M locations’ visits of the users who visit more than M locations 3: end for 4: Compute sensitivityDH from Theorem 1. 5: for each location l in L do 6: Count #visits each user made to l: c l;u and compute p l;u 7: Threshold C: ¯ c l;u = min(C;c l;u ), then compute ¯ p l;u 8: Compute ¯ H(l)=å u2U l ¯ p l;u log ¯ p l;u 9: Publish noisy LE: ˆ H(l)= ¯ H(l)+ Lap( MDH e ) 10: end for approximation error and vice versa. Hence, in Section 3.7, we empirically find the values of M and C that strike a balance between noise and approximation errors. 3.5.2.2 Privacy Guarantee of the Limit Algorithm The following theorem shows that Algorithm 2 is differentially private. Theorem 3. Algorithm 2 satisfiese-differential privacy. Proof. For all locations, let L 1 be any subset of L. Let T =ft 1 ;t 2 ;:::;t jL 1 j g2Range(A ) denote an arbitrary possible output. Then we need to prove the following: Pr[A(O 1(org) ;:::;O jL 1 j(org) )= T] Pr[A(O 1(org) n O l;u(org) ;:::;O jL 1 j(org) n O l;u(org) )= T] exp(e) The details of the proof and notations used can be found in Appendix A.3. 3.6 Relaxation of Private Location Entropy This section presents our utility enhancements by adopting two weaker notions of privacy: smooth sensitivity [86] (slightly weaker) and crowd-blending [44] (strictly weaker). 27 3.6.1 Relaxation with Smooth Sensitivity We aim to extend LIMIT to publish location entropy with smooth sensitivity (or SS for short). We first present the notions of smooth sensitivity and the LIMIT-SS algorithm. We then show how to precompute the SS of location entropy. 3.6.1.1 LIMIT-SS Algorithm Smooth sensitivity is a technique that allows one to compute noise magnitude—not only by the function one wants to release (i.e., location entropy), but also by the database itself. The idea is to use the local sensitivity bound of each location rather than the global sensitivity bound, resulting in small injected noise. However, simply adopting the local sensitivity to calibrate noise may leak the information about the number of users visiting that location. Smooth sensitivity is stated as follows. Let x;y2 D N denote two databases, where N is the number of users. Let l x ;l y denote the location l in database x and y, respectively. Let d(l x ;l y ) be the Hamming distance between l x and l y , which is the number of users at location l on which x and y differ; i.e., d(l x ;l y )=jfi : l x i 6= l y i gj; l x i represents information contributed by one individual. The local sensitivity of location l x , denoted by LS(l x ), is the maximum change of location entropy when a user is added or removed. Definition 3. Smooth sensitivity [86] Forb > 0,b-smooth sensitivity of location entropy is: SS b (l x )= max l y 2D N LS(l y ) e bd(l x ;l y ) = max k=0;1;:::;N e kb max y:d(l x ;l y )=k LS(l y ) Smooth sensitivity of LE of location l x can be interpreted as the maximum of LS(l x ) and LS(l y ) where the effect of y at distance k from x is dropped by a factor of e kb . Thereafter, the smooth sensitivity of LE can be plugged into Line 9 of Algorithm 2, producing the LIMIT-SS algorithm. 28 Algorithm 3 LIMIT-SS ALGORITHM Input: Privacy budgete, privacy parameterd, L=fl 1 ;l 2 ;:::;l jLj g, C;M Output: Noisy location entropy of each location ˆ H(l) 1: for each user u in U do 2: Truncate M: keep the first M locations’ visits of the users who visit more than M locations 3: end for 4: Compute sensitivityDH from Theorem 1. 5: for each location l in L do 6: Count #visits each user made to l: c l;u and compute p l;u 7: Threshold C: ¯ c l;u = min(C;c l;u ), then compute ¯ p l;u 8: Compute ¯ H(l)=å u2U l ¯ p l;u log ¯ p l;u 9: Publish noisy LE ˆ H(l)= ¯ H(l)+ M2SS b (l) e h, whereh Lap(1), whereb = e 2ln( 2 d ) 10: end for 3.6.1.2 Privacy Guarantee of LIMIT-SS The noise of LIMIT-SS is specific to a particular location as opposed to those of the BASELINE and LIMIT algorithms. LIMIT-SS has a slightly weaker privacy guarantee. It satisfies (e;d)- differential privacy, whered is a privacy parameter, d = 0 in the case of Definition 1. The choice ofd is generally left to the data releaser. Typically,d < 1 number of users (see [86] for details). Theorem 4. Calibrating noise to smooth sensitivity [86] Ifb e 2ln( 2 d ) andd2(0;1), the algorithm l7! H(l)+ 2SS b (l) e h, whereh Lap(1), is(e;d)-differentially private. Theorem 5. LIMIT-SS is(e;d)-differentially private. Proof. Using Theorem 4,A l satisfies (0)-differential privacy when l = 2 L 1 \ L(u), and satisfies ( e M ; d M )-differential privacy when l2 L 1 \ L(u). 3.6.1.3 Precomputation of Smooth Sensitivity This section shows that the smooth sensitivity of a location visited by n users can be effectively precomputed. Figure 3.2 illustrates the precomputed local sensitivity for a fixed value of C. Let LS(C;n);SS(C;n) be the local sensitivity and the smooth sensitivity of all locations that vis- ited by n users, respectively. LS(C;n) is defined in Theorem 2. Let GS(C) be the global sensitivity of the location entropy given C, which is defined in Theorem 1. Algorithm 4 computes SS(C;n). 29 At a high level, the algorithm computes the effect of all locations at every possible distance k from n, which is non-trivial. Thus, to speed up computations, we propose two stopping conditions based on the following observations. Let n x ;n y be the number of users visited l x ;l y , respectively. If n x > n y , Algorithm 4 stops when e kb GS(C) is less than the current value of smooth sensitivity (Line 6). If n x < n y , given the fact that LS(l y ) starts to decrease when n y > C logC1 + 1, and e kb also decreases when k increases, Algorithm 4 also terminates when n y > C logC1 + 1 (Line 8). In addition, the algorithm tolerates a small value of smooth sensitivity x . Thus, when n is greater than n 0 such that LS(C;n 0 )<x , the precomputation of SS(C;n) is stopped and SS(C;n) is considered asx for all n> n 0 (Line 8). Algorithm 4 PRECOMPUTE SMOOTH SENSITIVITY Input: Privacy parameters: e;d;x ; C, maximum number of possible users N Output: Precomputed smooth sensitivity of LE 1: Setb = e 2ln( 2 d ) 2: for n=[1;:::;N] do 3: SS(C;n)= 0 4: for k=[1;:::;N] do 5: SS(C;n)= max SS(C;n);e kb max(LS(C;n k);LS(C;n+ k)) 6: Stop when e kb GS(C;n k)< SS(C;n) and n+ k> C logC1 + 1 7: end for 8: Stop when n> C logC1 + 1 and LS(C;n)<x 9: end for 3.6.2 Relaxation with Crowd-Blending Privacy 3.6.2.1 LIMIT-CB Algorithm Thus far, we publish entropy for all locations; however, the ratio of noise to the true value of LE (noise-to-true-entropy ratio) is often excessively high when the number of users visiting a location n is small (i.e., Equation 3.3 shows that entropy of a location is bounded by log(n)). The large noise-to-true-entropy ratio would render the published results useless since the introduced noise outweighs the actual value of LE. This is an inherent issue with the sparsity of the real-world datasets. For example, Figure 3.4 summarizes the number of users contributing visits to each 30 location in the Gowalla dataset. The figure shows that most locations have check-ins from fewer than ten users. These locations have LE values of less than log(10), which are particularly prone to the noise-adding mechanism in differential privacy. Figure 3.4: Sparsity of location visits (Gowalla, New York). 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Number of users (n) Global sensitivity C = 10 C = 20 Figure 3.5: Global sensitivity bound when varying n. Therefore, to reduce the noise-to-true-entropy ratio, we propose a small sensitivity bound of location entropy that depends on the minimum number of users visiting a location, denoted by k. Subsequently, we present Algorithm 5 that satisfies(k;e)-crowd-blending privacy [44]. We prove this in Section 3.6.2.2. The algorithm aims to publish entropy of locations with at least k users (n k) and throw away the other locations. We refer to the algorithm as LIMIT-CB. Lines 2-5 publish the entropy of each location according to (k;e)-crowd-blending privacy. That is, we publish the entropy of the locations with at least k users and suppress the others. The following lemma shows that for the locations with at least k users we have a tighter bound onDH, which depends on C and k. Figure 3.2 shows that the sensitivity used in LIMIT-CB is significantly smaller than LIMIT’s sensitivity. Theorem 6. Global sensitivity of location entropy for locations with at least k users, k C logC1 + 1, where C is the maximum number of visits a user contributes to a location, is the local sensitivity at n= k. 31 Proof. We prove the theorem by showing that local sensitivity decreases when the number of users n C logC1 + 1. Thus, when n C logC1 + 1, the global sensitivity equals to the local sensitivity at the smallest value of n, i.e, n= k. The detailed proof can be found in Appendix A.2. Algorithm 5 LIMIT-CB ALGORITHM Input: All users U, privacy budgete; C;M;k Output: Noisy location entropy of each location ˆ H(l) 1: Compute global sensitivityDH based on Theorem 6. 2: for each location l2 L do 3: Count number of users who visit l, n l 4: If n l k, publish ˆ H(l) according to Algorithm 2 with budgete using a tighter bound onDH 5: Otherwise, do not publish the data 6: end for 3.6.2.2 Privacy Guarantee of LIMIT-CB Before proving the privacy guarantee of LIMIT-CB, we first present the notion of crowd-blending privacy, a strict relaxation of differential privacy [44]. k-crowd blending private sanitization of a database requires each individual in the database to blend with k other individuals in the database. This concept is related to k-anonymity [114] since both are based on the notion of “blending in a crowd.” However, unlike k-anonymity that only restricts the published data, crowd-blending privacy imposes restrictions on the noise-adding mechanism. Crowd-blending privacy is defined as follows. Definition 4 (Crowd-blending privacy). An algorithm A is (k;e)-crowd-blending private if for every database D and every individual t2 D, either t e-blends in a crowd of k people in D, or A(D) e A(Dnftg) (or both). A result from [44] shows that differential privacy implies crowd-blending privacy. Theorem 7. DP! CROWD-BLENDING PRIVACY [33] Let A be any e-differentially private algorithm. Then, A is (k;e)-crowd-blending private for every integer k 1. The following theorem shows that Algorithm 5 is (k;e)-crowd-blending private. 32 Theorem 8. Algorithm 5 is (k;e)-crowd-blending private. Proof. First, if there are at least k people in a location, then individual u e-blends with k people in U. This is because Line 4 of the algorithm satisfies e-differential privacy, which infers (k;e)- crowd-blending private (Theorem 7). Otherwise, we have A(D) 0 A(Dnftg) since A suppresses each location with less than k users. 3.7 Performance Evaluation We conduct several experiments on real-world and synthetic datasets to compare the effectiveness and utility of our proposed algorithms. Below, we first discuss our experimental setup. Next, we present our experimental results. 3.7.1 Experimental Setup 3.7.1.1 Datasets We conduct experiments on one real-world (Gowalla) and two synthetic datasets (Sparse and Dense). The statistics of the datasets are shown in Table 3.2. Gowalla contains the check-in history of users in a location-based social network. For our experiments, we use the check-in data in an area covering the city of New York. For synthetic data generation, in order to study the impact of the density of the dataset, we consider two cases: Sparse and Dense. Sparse contains 100,000 users while Dense has 10 million users. The Gowalla dataset is sparse as well. We add the Dense synthetic dataset to emulate the case for large industries, such as Google, who have access to large- and fine-granule user location data. To generate visits, without loss of generality, the location with id x2[1;2;:::;10;000] has a probability 1=x of being visited by a user. This means that locations with smaller ids tend to have higher location entropy since more users would visit these locations. In the same fashion, the user with id y2f1;2;:::;100;000g (Sparse) is selected with probability 1=y. This follows the 33 Sparse Dense Gow. # of locations 10,000 10,000 14,058 # of users 100K 10M 5,800 Max LE 9.93 14.53 6.45 Min LE 1.19 6.70 0.04 Avg. LE 3.19 7.79 1.45 Variance of LE 1.01 0.98 0.6 Max #locations per user 100 100 1407 Avg. #locations per user 19.28 19.28 13.5 Max #visits to a loc. per user 20,813 24,035 162 Avg. #visits to a loc. per user 2578.0 2575.8 7.2 Avg. #users per loc. 192.9 19,278 5.6 Table 3.2: Statistics of the datasets. real-world characteristic of location data where a small number of locations are very popular and then many locations have a small number of visits. In all of our experiments, we use five values of privacy budget e2f0:1;0:5;1;5;10g. We vary the maximum number of visits a user contributes to a location C2f1;2;:::;5;:::;50g and the maximum number of locations a user visits M2f1;2;5;10;20;30g. We vary threshold k2 f10;20;30;40, 50g. We also set x = 10 3 ;d = 10 8 , and be=2 ln(2=d). Default values are shown in boldface. 3.7.1.2 Metrics We use KL-divergence as one measure of preserving the original data distribution after adding noise. Given two discrete probability distributions P and Q, the KL-divergence of Q from P is defined as follows: D KL (PjjQ)= å i P(i)log P(i) Q(i) (3.6) In this chapter the location entropy of location l is the probability that l is chosen when a location is randomly selected from the set of all locations; P and Q are respectively the published and the actual LE of locations after normalization; i.e., normalized values must sum to unity. 34 We also use mean squared error (MSE) over a set of locations L as the metric of accuracy using Equation 3.7. MSE= 1 jLj å l2L LE a (l) LE n (l) 2 (3.7) where LE a (l) and LE n (l) are the actual and noisy entropy of the location l, respectively. Since LIMIT-CB discards more locations as compared to LIMIT and LIMIT-SS, we consider both cases: 1) KL-divergence and MSE metrics are computed on all locations L, where the entropy of the suppressed locations are set to zero (default case); 2) the metrics are computed on the subset of locations that LIMIT-CB publishes, termed Throwaway. 3.7.2 Experimental Results We first evaluate our algorithms on the synthetic datasets. 3.7.2.1 Overall Evaluation of the Proposed Algorithms We evaluate the performance of LIMIT from Section 3.5.2.1 and its variants (LIMIT-SS and LIMIT- CB). We do not include the results for BASELINE since the excessively high amount of injected noise renders the perturbed data useless. Figure 3.6 illustrates the distributions of noisy vs. actual LE on Dense and Sparse. The actual distributions of the dense (Figure 3.6a) and sparse (Figure 3.6e) datasets confirm our method of generating the synthetic datasets; locations with smaller ids have higher entropy, and entropy of locations in Dense are higher than that in Sparse. We observe that LIMIT-SS generally performs best in preserving the original data distribution for Dense (Figure 3.6c), while LIMIT-CB performs best for Sparse (Figure 3.6h). Note that as we show later, LIMIT-CB performs better than LIMIT- SS and LIMIT given a small budgete (see Section 3.7.2.2). Due to the truncation technique, some locations may be discarded. Thus, we report the percent- age of perturbed locations, named published ratio. The published ratio is computed as the number 35 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Actual entropy (a) Actual (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (b) LIMIT (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (c) LIMIT-SS (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (d) LIMIT-CB (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Actual entropy (e) Actual (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (f) LIMIT (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (g) LIMIT-SS (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (h) LIMIT-CB (Sparse) Figure 3.6: Comparison of the distributions of noisy vs. actual location entropy on the dense and sparse datasets. of perturbed locations divided by the total number of eligible locations. A location is eligible for publication if it contains check-ins from at least K users (K 1). Figure 3.7 shows the effect of k on the published ratio of LIMIT-CB. Note that the published ratio of LIMIT and LIMIT-SS is the same as LIMIT-CB when k= K. The figure shows that the ratio is 100% with Dense, while that of Sparse is less than 10%. The reason is that with Dense, each location is visited by a large number of users on average (see Table 3.2); thus, limiting M and C would reduce the average num- ber of users visiting a location but not to the point where the locations are suppressed. This result suggests that our truncation technique performs well on large datasets. 3.7.2.2 Privacy-Utility Trade-off (Varyinge) We compare the trade-off between privacy and utility by varying the privacy budgete. The utility is captured by the KL-divergence metric introduced in Section 3.7.1. We also use the MSE metric. Figure 3.8 illustrates the results. As expected, whene increases, less noise is injected, and values of KL-divergence and MSE decrease. Interestingly though, KL-divergence and MSE saturate at e = 5, where reducing privacy level (increase e) only marginally increases utility. This can be explained through a significant approximation error in our thresholding technique that outweighs 36 20 25 30 35 40 45 50 0 20 40 60 80 100 Published ratio k Sparse dataset Dense dataset Figure 3.7: Published ratio of LIMIT-CB when varying k (K= 20). the impact of having smaller perturbation error. Note that the approximation errors are constant in this set of experiments since the parameters C, M and k are fixed. Another observation is that the observed errors incurred are generally higher for Dense (Fig- ures 3.8b vs. 3.8c), which is surprising, as differentially private algorithms often perform better on dense datasets. The reason for this is because limiting M and C has a larger impact on Dense, resulting in a large perturbation error. Furthermore, we observe that the improvements of LIMIT- SS and LIMIT-CB over LIMIT are more significant with small e. In other words, LIMIT-SS and LIMIT-CB would have more impact with a higher level of privacy protection. Note that these enhancements come at the cost of weaker privacy protection. 3.7.2.3 The Effect of M and C We first evaluate the performance of our proposed techniques by varying threshold M. For brevity, we present the results only for MSE, as similar results have been observed for KL-divergence. Figure 3.9 indicates the trade-off between the approximation error and the perturbation error. Our thresholding technique decreases M to reduce the perturbation error, but at the cost of increasing the approximation error. As a result, at a particular value of M, the technique balances the two types of errors and thus minimizes the total error. For example, in Figure 3.9a, LIMIT performs 37 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KL−Divergence epsilon LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 2 4 6 8 10 0 1000 2000 3000 4000 5000 6000 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (b) Dense 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 3500 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (c) Sparse 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 3500 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (d) Sparse, Throwaway Figure 3.8: Varyinge best at M = 5, while LIMIT-SS and LIMIT-CB work best at M 30. In Figure 3.9b, however, LIMIT-SS performs best at M= 10 and LIMIT-CB performs best at M= 20. We then evaluate the performance of our techniques by varying threshold C. Figure 3.10 shows the results. For brevity, we only include KL-divergence results (MSE metric shows similar trends). The graphs show that KL-divergence increases as C grows. This observation suggests that C should be set to a small number (less than 10). By comparing the effect of varying M and C, we conclude that M has more impact on the trade-off between the approximation error and the perturbation error. 38 0 10 20 30 0 500 1000 1500 2000 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 10 20 30 0 200 400 600 800 1000 1200 1400 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (b) Sparse, Throwaway Figure 3.9: Varying M 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 KL−Divergence C LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 10 20 30 40 50 0 0.05 0.1 0.15 0.2 KL−Divergence C LIMIT LIMIT−SS LIMIT−CB (b) Sparse Figure 3.10: Varying C 3.7.2.4 Results on the Gowalla Dataset In this section we evaluate the performance of our algorithms on the Gowalla dataset. Figure 3.11 shows the distributions of noisy vs. actual location entropy. Note that we sort the locations based on their actual values of LE as depicted in Figure 3.11a. As expected, due to the sparseness of Gowalla (see Table 3.2), the published values of LE in LIMIT and LIMIT-SS are scattered while those in LIMIT-CB preserve the trend in the actual data but at the cost of throwing away more locations (Figure 3.11d). Furthermore, we conduct experiments on varying various parameters 39 (i.e.,e;C;M;k) and observe trends similar to the Sparse dataset; nevertheless, for brevity, we only show the impact of varyinge and M in Figure 3.12. 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Actual entropy (a) Actual 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (b) LIMIT 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (c) LIMIT-SS 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (d) LIMIT-CB Figure 3.11: Comparison of the distributions of noisy vs. actual location entropy on Gowalla, M= 5. 3.7.2.5 Recommendations for Data Releases We summarize our observations and provide guidelines for choosing appropriate techniques and parameters. LIMIT-CB generally performs best on sparse datasets because it only focuses on pub- lishing the locations with large visits. Alternatively, if the dataset is dense, LIMIT-SS is recom- mended over LIMIT-CB since there are sufficient locations with large visits. A dataset is dense if 40 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (a) Varye 0 10 20 30 0 200 400 600 800 1000 1200 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (b) Vary M Figure 3.12: Varyinge and M (Gowalla). most locations (e.g., 90%) have at least n CB users, where n CB is the threshold for choosing LIMIT- CB. Particularly, given fixed parameters C;e;d;k—n CB can be found by comparing the global sensitivity of LIMIT-CB and the precomputed smooth sensitivity. In Figure 3.2, n CB is a particular value of n where SS(C;n CB ) is smaller than the global sensitivity of LIMIT-CB. In other words, the noise magnitude required for LIMIT-SS is smaller than that for LIMIT-CB. Regarding the choice of parameters, to guarantee strong privacy protection, e should be as small as possible, while the measured utility metrics are practical. Finally, the value of C should be small ( 10), while the value of M can be tuned to achieve maximum utility. 3.8 Chapter Summary We explored the setting of releasing data points with free query by introducing the problem of pub- lishing the entropy of a set of locations according to differential privacy. A baseline algorithm was proposed based on the derived tight bound for global sensitivity of the location entropy. We showed that the baseline solution requires an excessively high amount of noise to satisfye-differential pri- vacy, which renders the published results useless. A simple yet effective truncation technique was then proposed to reduce the sensitivity bound by two orders of magnitude, and this enabled pub- lication of location entropy with reasonable utility. The utility was further enhanced by adopting 41 smooth sensitivity and crowd-blending. We conducted extensive experiments and concluded that the proposed techniques are practical. 42 Chapter 4 Encrypted Data Point Release with Fixed-Price Query In this chapter, we considered a different settings where not only location data points but other geo-tagged objects such as images or videos can be advertised and purchased through a geo- marketplace for a fixed price. Such a marketplace presents significant challenges. First, if owners upload data with revealed geo-tags, they expose themselves to serious privacy risks. Second, own- ers must be accountable for advertised data, and must not be allowed to subsequently alter geo- tags. Third, such a system may be vulnerable to intensive spam activities, where dishonest owners flood the system with fake advertisements. We propose a geo-marketplace that addresses all these concerns. We employ searchable encryption, digital commitments, and blockchain to protect the location privacy of owners while at the same time incorporating accountability and spam-resilience mechanisms. We implement a prototype with two alternative designs that obtain distinct trade-offs between trust assumptions and performance. Our experiments on real location data show that one can achieve the above design goals with practical performance and reasonable financial overhead. 4.1 Introduction The mobile computing landscape is witnessing an unprecedented number of devices that can ac- quire geo-tagged data, e.g., mobile phones, wearable sensors, in-vehicle dashcams, and IoT sen- sors. These devices, owned by a diverse set of entities, can collect large amounts of data such as images, videos, movement parameters, or environmental measurements. The data may be useful to 43 third-party entities interested in gathering information from a certain location. For example, jour- nalists may want to gather images around an event of interest for their newspaper; law enforcement may seek images taken soon before or after a crime occurred; and city authorities may be interested in travel patterns during heavy traffic. Currently, data collected by individuals are often discarded or archived, due to lack of stor- age space. Even when data are shared, owners are seldom rewarded for their contributions. An emerging trend is to create data marketplaces where owners advertise their data objects to poten- tial buyers. We emphasize that marketplaces differ from crowdsourcing services such as Amazon Mechanical Turk 1 . In crowdsourcing, data are owned by the service provider, and the user receives a small reward for a task, e.g., a few cents for classifying an image. In contrast, with data mar- ketplaces users own the data and advertise them to buyers. If an object is appealing (e.g., a photo purchased by a newspaper), the buyer may pay a higher price (in the order of tens of dollars or more), resulting in different cost and scalability considerations. Geo-marketplaces, where entities trade geo-tagged data objects, raise unique concerns. Pub- lishing geo-tags in clear reveals owners’ whereabouts, which may lead to serious privacy breaches such as leakage of one’s health status or political orientation. In addition, one must also protect the interests of buyers, and ensure they receive data objects satisfying their spatial requirements. Owners must be held accountable for their advertised data and not be able to change the geo-tag of an object after its initial advertisement. This can prevent situations where owners change geo-tags to reflect ongoing trends in buyers’ interest. For example, when a certain high-profile event occurs at a location, dishonest owners may attempt to change their geo-tags closer to that location in order to sell their images at higher prices. Furthermore, the system must provide strong disincentives to prevent spam behavior, where dishonest participants flood the system with fake advertisements. We propose a geo-marketplace with three key features: 1 https://www.mturk.com/ 44 • Privacy. We adapt state-of-the-art searchable encryption (SE) techniques to protect loca- tions, and we perform matching between buyer interests and advertised objects on encrypted geo-tags. • Accountability. To hold owners accountable for their advertisements, we use cryptographic commitments and blockchain technology. We store a compact digital commitment on the blockchain to prevent owners from altering object geo-tags after publication. • Spam-Resilience. To discourage spam, we employ the use of a public blockchain, where writing to the ledger requires a transaction fee. We control the cost such that legitimate users only pay negligible fees relative to the value of their objects, whereas dishonest users who flood the system with fake advertisements are strongly disincentivized. In our design, the data owner generates a metadata item which includes the object’s geo-tag. The bulk data (e.g., image or video), is either stored by the owner (e.g., flash drive), or encrypted with conventional encryption at a bulk storage service, such as Swarm [37] or InterPlanet File System [36]. The low-footprint geo-tag metadata is encrypted using SE. The owner then creates a digital commitment of the metadata and stores it on the blockchain. Commitments can be stored either individually, or batched together for better blockchain efficiency and cost. Buyers search objects based on geo-tags by querying the encrypted metadata. They must first obtain a search token that allows them to identify encrypted objects that match their spatial range query. Since processing on encrypted data is computationally expensive, if the buyers decide to use other services to perform the task, they often need to pay for the search token and its processing. In our system model, different strategies are investigated to ensure that the performance and financial cost of the search process are practical. It is important to remark that the encrypted search reveals neither the exact whereabouts of the objects, nor the owner’s identity. The buyer learns only pseudonymous owner identifiers for matching objects, e.g., a blockchain public key, through which the transaction can be anonymously completed. Once matching objects are identified, the owner 45 and buyer enter a smart contract through the blockchain. As a result, the owner receives payment, and the buyer receives the actual data objects, and the corresponding conventional decryption keys. Achieving the three aforementioned objectives is challenging. First, SE techniques incur signif- icant overhead compared to the search on plaintexts, especially with asymmetric encryption. Thus, carefully designing data and query encodings is essential to obtain efficient solutions that can be scaled to large datasets. Second, the cost of privacy and accountability should not be too high; otherwise, it may interfere with the financial operation of the marketplace, resulting in prohibitive costs. An acceptable financial cost should only account for a small percentage of the transaction value. Our specific contributions include: • We propose a novel architecture for a geo-marketplace that achieves privacy, accountability, and spam resilience by combining searchable encryption, digital commitments, and block- chain. To the best of our knowledge, this is the first work aiming to accomplish these objec- tives. • We propose protocols for owner-buyer matching with both symmetric and asymmetric SE. These approaches offer an interesting trade-off between trust assumptions and performance, facilitating adoption in a wide range of scenarios. • We develop optimization techniques to address the high computational cost of encrypted search. We also consider techniques to decrease the financial cost of blockchain operations by reducing the amount of on-chain storage. • We perform an extensive experimental evaluation to measure system performance, in terms of computational overhead, storage, and financial cost incurred. Sec. 4.2 provides background information on the different components of the system, followed by an overview of the system model and operations workflow in Sec. 4.4. We present technical details in Sec. 4.5 and experimental results in Sec. 4.6. We conclude with directions for future research in Sec. 4.7. 46 4.2 Background 4.2.1 Searchable Symmetric Encryption Searchable Symmetric Encryption (SSE) allows a client to search and selectively retrieve her en- crypted documents outsourced to a server. SSE was first proposed in [108] and further refined in [48, 23]. The first efficient sub-linear SSE scheme that supports Boolean queries was proposed in [14]. Later on, [111] proposed a scheme that achieves forward security by protecting access patterns at the time of document addition. State-of-the-art SSE schemes are efficient, but at the ex- pense of some leakage in the form of access patterns. In our system, we use the recently-proposed HXT technique [73] which supports conjunctive keyword queries. Let d 0 ;:::;d n1 be the client’s documents and I an inverted index that maps a keyword w to the list of document identifiers containing w. We denote the list of document identifiers that contain w as I(w). An SSE scheme consists of the following four algorithms: 1. Setup is run by the client and takes as input security parameter k and documents d 0 ;:::;d n1 . It generates two secret keys K I and K D . It parses all the documents and forms an inverted index I that maps to each keyword w a list of document identifiers (I(w)) that contain w. The client encrypts this index using a special encryption algorithm specified by the particular SSE scheme and generates an encrypted inverted index using the key K I . It also encrypts each document d i ;80 i< n, with conventional symmetric encryption (e.g., AES) using the key K D and assigns it a unique identifier that is independent of the document contents. It outputs the keys K I and K D that are stored locally at the client and the encrypted index that is sent to the server. 2. Token Generation, run by the client, takes as input secret key K I and a keyword w. Using secret key K I , it creates a search token tk w which is sent to the server. 3. Search is run by the server and uses as input the token tk w and the encrypted index. It searches the encrypted index and retrieves the list of document identifiers that contain the 47 keyword w, namely I(w). The server retrieves the encrypted documents d w 0 ;:::;d w jI(w)j1 using the identifiers in I(w) and sends the documents to the client. 4. Decryption is run by the client and uses the secret key K D to decrypt the documents received from the server. 4.2.2 Hidden Vector Encryption Hidden Vector Encryption (HVE) [11, 12] is an asymmetric searchable encryption technique sup- porting conjunctive equality, range and subset queries. Search on ciphertexts can be performed with respect to a number of index attributes. HVE represents an attribute as a bit vector (each ele- ment has value 0 or 1), and the search predicate as a pattern vector where each element can be 0, 1 or ’*’ (i.e., wildcard value). Let l denote the HVE width, which is the bit length of the attribute, and consequently that of the search predicate. A predicate evaluates to True for a ciphertext C if the attribute vector I used to encrypt C has the same values as the pattern vector of the predicate in all positions that are not ’*’ in the latter. HVE is built on top of a symmetric bilinear map of composite order [10], which is a function e :GG!G T such that8a;b2 G and8u;v2Z it holds that e(a u ;b v )= e(a;b) uv .G andG T are cyclic multiplicative groups of composite order n= p q where p and q are large primes of equal bit length. We emphasize that the application of function e, which is called a bilinear pairing, is expensive to compute, so the number of pairings must be minimized. We denote byG p ,G q the subgroups ofG of orders p and q, respectively. HVE consists of the following four algorithms: 1. Setup. The private/public key pair (SK/PK) are as follows: SK=(g q 2G q ; a2Z p ; 8i2[1::l] : u i ;h i ;w i ;g;v2G p ) PK=(g q ; V = vR v ; A= e(g;v) a ; 8i2[1::l] : U i = u i R u;i ; H i = h i R h;i ; W i = w i R w;i ) 48 with random R u;i ;R h;i , R w;i 2G q ;8i2[1::l] and R v 2G q 2. Encryption uses PK and takes as parameters index attribute I and message M2G T . The following random elements are generated: Z;Z i;1 ;Z i;2 2G q and s2Z n . The ciphertext is: C=(C 0 = MA s ; C 0 = V s Z; 8i2[1::l] : C i;1 =(U I i i H i ) s Z i;1 ; C i;2 = W s i Z i;2 ) 3. Token Generation. Using SK, and given a search predicate encoded as pattern vector I , the TA generates a search token T K as follows: let J be the set of all indices i where I [i]6=. TA randomly generates r i;1 and r i;2 2Z p ;8i2 J. Then T K=(I ;K 0 = g a Õ i2J (u I [i] i h i ) r i;1 w r i;2 i ; 8i2[1::l] : K i;1 = v r i ;1 ; K i;2 = v r i ;2 ) Query is executed at the server, and evaluates if the predicate represented by T K holds for ciphertext C. The server attempts to determine the value of M as M= C 0 =(e(C 0 ;K 0 )= Õ i2J e(C i;1 ;K i;1 )e(C i;2 ;K i;2 ) (4.1) If the index I on which C was computed satisfies T K, the value of M is returned, otherwise a nil value? is obtained. 4.2.3 Vector Digital Commitments Cryptographic commitments [90] allow a party S to commit to a message m by creating a com- mitment CC, such that CC is binding (i.e., S cannot change the message m) and hiding (i.e., CC does not leak any information about m). In this work, we use vector commitments [16], which 49 allow party S to commit to an ordered sequence of messages(m 0 ;:::;m q1 ), such that it can later open the commitment for a specific message, e.g., to prove that m i is the i-th message in the se- quence. Vector commitments are space-efficient because their size is independent of the number of committed values. A vector commitment scheme is defined by the following four algorithms: 1. KeyGen takes as input security parameter k and size q of committed vector and outputs a public parameter pp. 2. Commit takes as input a sequence of q messages V = m 0 ;:::;m q1 , the public parameter pp and outputs a commitment string CC and an auxiliary information aux. 3. Open takes as input a message m2 m 0 ;:::;m q1 , a position i, and the auxiliary information aux and is run by the committer to produce a proof P i that m is the i-th message in the committed message vector V . 4. Verify takes as input the commitment string CC, a message m, a position i, and the proof P i , to verify that P i is a valid proof that CC was created to a sequence m 0 ;:::;m q1 , where m= m i . 4.2.4 Blockchain, Smart Contracts, and Bulk Storage Blockchain was first introduced in [82] as a decentralized public ledger that records transac- tions among entities without a trusted party. A blockchain is a sequence of transaction blocks cryptographically-linked through the hash value of the predecessor. A transaction typically moves cryptocurrency from one account to another. An account is defined as the public key of an en- tity, which provides pseudonymity. System nodes called miners compete to create new blocks by solving proof-of-work puzzles. The miner who finds the puzzle solution first is rewarded with cryptocurrency. Some blockchain platforms (e.g., Ethereum), have the ability to execute smart contracts [115], which are sophisticated agreements among entities that utilize transactions on the blockchain. 50 Smart contracts are expressed in a high-level programming language (e.g., Solidity) interpreted by a blockchain virtual machine. One limitation when storing data on the blockchain is size. Due to the competitive nature of block creation, the growth rate of the blockchain is limited. Recently, decentralized storage systems have been proposed that interface with the blockchain and allow large amounts of storage (e.g., Swarm [37]). Such systems provide a distributed hash table (DHT) interface [112] to store and retrieve data. Participating peers receive incentives for the contributed storage. 4.3 Related Work Blockchain has been adopted in many areas such as healthcare [58, 72], Internet of Things [38, 70], smart vehicles [75, 66], real-world asset trading [88], finance [57], or logistics [30]. Mod- elChain [72] is a decentralized framework for privacy-preserving healthcare predictive modeling based on a private blockchain. CreditCoin [75] uses the blockchain and threshold ring signatures to achieve anonymity for smart vehicles. The work in [58] proposed a blockchain-based loca- tion sharing scheme for telecare medical information systems. However, in their system, locations are encrypted using an order-preserving encryption scheme, which is known to incur significant leakage. Closer to our work, [66] uses the blockchain-based model to protect locations of smart vehicles; however, their privacy model relies on random identifiers and enlargement of reported areas providing only ad-hoc protection, as opposed to our solution that inherits the strong pro- tection of encryption. In the context of data exchange and marketplaces for location data using blockchain, [138] proposed a system based on conventional encryption, which does not support search on ciphertexts. Fysical [78] is a blockchain-based marketplace where suppliers can sell plaintext or aggregated location data, which raises serious privacy issues. There are some important lines of works orthogonal to our approach. One direction focuses on creating proofs of location using blockchain [13, 40, 76, 8]. The recently-proposed Hawk [68] system is a blockchain model that provides transactional privacy such that private bids and financial 51 data are hidden from public view. Zhang et. al [134] proposed GEM 2 -tree for authenticated range queries with off-chain storage. Such approaches can be integrated into our system to allow validation of the geo-tags, hiding transactions, or validate results from our searchable encrypted indices. 4.4 System Model The central component in our design is the blockchain, and its associated on-chain operations. On- chain storage is financially expensive, since write operations to the chain translate into transaction blocks added to the ledger. Our objective is to minimize the amount of on-chain storage. Only digital commitments and minimal addressing information is stored on-chain. For all other data structures, we employ bulk storage (Swarm). Another challenging part of the system is matching owners’ data to buyers’ requests. This process involves search on encrypted location metadata, which is computationally expensive, especially in the case of asymmetric searchable encryption. Searchable ciphertexts also tend to be large in size compared to corresponding plaintexts, in order to support conjunctive queries and hide data patterns. In the case of SSE, metadata ciphertexts and associated indexes are also placed in bulk storage. We present evaluation metrics in Section 4.4.1, followed by two alternative system designs: in Section 4.4.2 we present a solution based on SSE, which achieves sub-linear search performance, thanks to the use of an encrypted index. However, this approach requires a trusted curator (TC), which holds the secret encryption key, and has access to the plaintext locations of all object geo- tags. In Section 4.4.3, we propose an asymmetric encryption design, where each owner has the public key of a private/public key pair. Owners encrypt locations using Hidden Vector Encryption (HVE). There is still need for a trusted authority (TA) that holds the private key and generates search tokens at runtime, but this entity does not have access to plaintext locations. While in principle it is possible for the TA to collude with buyers and issue numerous search tokens that may reveal all object locations, such an attack is more difficult to stage. We assume that the TA is 52 non-colluding. In addition, it is possible to use multiple TAs, so the amount of disclosure in the case of collusion is limited. The disadvantage of HVE is that it does not allow the construction of an index, so a linear search is required. 4.4.1 Evaluation Metrics We consider computation time, storage size and financial cost as performance metrics. The latter is measured in Ethereum using the concept of gas. Each on-chain transaction requires spending a certain amount of gas to complete. The cost of one unit of gas is linked to the Ethereum price, which is market-driven. One unit of gas costs 1e-9 ether. At the time of writing, 1ether= $133 . Since on-chain operations are dominated by the cost of gas, which far exceeds the corresponding storage and computational overhead, we exclusively use gas to measure on-chain operations. For off-chain operations, we use computation time to evaluate the performance of: location encryp- tion and index generation (indexes are used only for SSE), cryptographic commitment generation, search token generation and token-object matching. In terms of storage cost, we focus on cipher- text size and search token size. Commitment size is taken into account using gas, as it is stored on-chain. 4.4.2 Private Geo-marketplace with SSE We employ HXT encryption [73], the state-of-the-art in conjunctive keyword search. HXT builds an index which allows sub-linear search. Fig. 4.1 shows the system architecture, with three types of entities: owners, buyers and a trusted curator (TC). TC collects plaintext locations of all objects and builds an index with SSE-encrypted locations. In practice, the curator role can be fulfilled by an entity that already has an established relationship of trust with owners, e.g., a cell operator that already has access to customer locations. The TC has financial incentives to operate the service: it can charge a small fee for location indexing and search token generation. The system can have more than one curator, each serving a subset of owners. While this reduces the amount of location 53 Figure 4.1: SSE-based System Workflow exposure to a single entity, it results in multiple indexes, which may reduce search efficiency. In the rest of the chapter, we assume a single curator. The TC initializes the index (Step 0 in Fig. 4.1), assigns it a unique index identifier (IID) and stores it in bulk storage. The IID is also stored on the blockchain, and later used by buyers to boot- strap the search. For example, the IID may include TC contact information such as URI. Each data owner is represented by a pseudonymous identifier, e.g., public key in the blockchain system. To advertise an object, the owner randomly generates a unique object identifier (OID) and computes a digital commitment that covers the geo-tag and OID (Step 1). The digital commitment is stored on the blockchain. Since the commitment is hiding, placing it on the blockchain will not disclose the object’s location. The binding property of commitments, combined with the unmalleable stor- age property of the blockchain, guarantees that the owner can be held accountable if it turns out that the advertised object is collected at another location than the advertised one. Next (Step 2), the owner submits the plaintext location along with the OID to the TC, and uploads the bulk data (encrypted with conventional AES) to bulk storage (Step 3) having the OID as key (recall that, the 54 bulk storage offers a DHT-like interface and stores key-value pairs). Then, the curator inserts the object in the encrypted HXT index (Step 4). We emphasize that location proofs are orthogonal to our approach, and existing solutions [13, 40, 76, 8] can be adopted in our system. Buyers search objects based on geo-tags. We assume buyers specify search predicates in the form of spatial range queries. The buyer locates the TC bootstrap information (Step 5), and con- tacts the TC with the search predicate in plaintext. The TC, who holds the master secret key of the SE instance, generates a search token to evaluate the spatial predicate. The token is sent to the buyer (Step 6). The curator may charge the buyer a fee for each token. Flexible pricing policies may be implemented by the curator: for instance, tokens that cover a larger area, or that cover denser areas where one is expected to find more objects, may be more expensive. Next (Step 7), the actual search is performed using the HXT index. The index is stored in distributed fashion on bulk storage, so the search process can be completed by the buyer through repeated interaction with the DHT interface using the IID and index pointers as request keys. Alternatively, the buyer can employ another service that performs the search directly on the storage nodes. The details of this process are orthogonal to our approach, and we consider as performance metric the total computational cost incurred by the search, which is the same whether it is executed on the buyer’s machine (e.g., in the case of an institutional buyer with significant resources), or on the Swarm nodes (e.g., in the case of a private buyer who performs the transaction using its mobile phone and pays an additional fee for the search). When search completes, the buyer learns the pseudo-identities of matching owners, and decides which data object(s) to purchase. The purchase is completed through a smart contract between the owner and the buyer (Step 8), following which the owner receives payment, and the buyer receives the AES key used to encrypt the object in Swarm. The buyer downloads the object (Step 9) and decrypts it locally, at which point the transaction is finalized. If it turns out that the data does not satisfy the advertised geospatial attributes, the buyer can contest the transaction, and use the digital commitment to prove that the owner is dishonest. The payment is reversed, and additional 55 Figure 4.2: HVE-based System Workflow punitive measures (e.g., reputation penalties) can be taken against the owner. The transactions on the blockchain are evidence of the purchase and provide accountability. 4.4.3 Private Geo-marketplace with HVE Fig. 4.2 illustrates the system workflow when using HVE encryption. The main difference com- pared to the SSE case is the absence of the curator. Instead, there is a Trusted Authority (TA), which holds the private (or master) key of the HVE instance and issues search tokens. Most of the steps remain the same, with a few exceptions. In Step 0, instead of building an index, the TA ini- tializes a flat file that will contain all the encrypted object locations. In Step 2, the owners encrypt object locations by themselves, using the public HVE key. This reduces considerably disclosure compared to the SSE case. The rest of the workflow remains unchanged. However, the counter of Steps 4-8 in Fig. 4.2 is less by one compared to their SSE counterparts, due to the absence of the index update step. 56 4.5 Technical Approach Section 4.5.1 illustrates the search process using SSE, whereas Section 4.5.2 focuses on HVE. Section 4.5.3 discusses accountability and SPAM resilience. 4.5.1 Symmetric Encryption Search SSE techniques [73, 111] support keyword (i.e., exact match) queries, and conjunctions thereof, over an arbitrary domain. Our objective is to support range queries on top of geospatial data. For simplicity, we consider a two-dimensional (2D) space (although our results can be easily gener- alized to 3D). SSE schemes assume a database of documents, where each document is associated with a set of keywords. In our setting, each object is a document, and its keywords are derived based on the geo-tag of the object. First, we discuss the case of data and range queries for a one-dimensional (1D) domain. Con- sider a domain A of integers from 0 to L 1, where L is a power of 2 (i.e., the domain of(logL)- bit integers). Even though spatial coordinates are real numbers, we can represent them using an integer-valued domain with good precision. We construct a full binary tree over domain A, where the value in each node represents a domain range. Each node can be uniquely identified using the path leading to it from the root node. Along that path, left branches are labeled with 0 and right branches with 1. A node identifier (id) is the unique string that concatenates all edge labels on the path from the root to that node. Figure 4.3 shows the resulting binary tree for a 3-bit domain A=[0;:::;7]. For example, the node id of N 2;3 in Fig. 4.3 is ”01”. We adopt a domain encoding called best range cover (BRC) [64]. Given a range r, BRC selects the minimal set of nodes that cover r. The ids of the nodes in this set represent the keywords associ- ated to r. For example, the range[2;7] is minimally covered by nodes N 2;3 and N 4;7 (shown shaded in Fig. 4.3), with node ids (i.e., keywords) ”01” and ”1”; whereas the range [2;6] is minimally covered by nodes N 2;3 ;N 4;5 and N 6 , with node ids ”01”, ”10” and ”110”. The ranges covering a leaf node can be identified by traversing the tree upward from that leaf node. Given a data value, 57 the associated keywords are represented by the node ids from the upward traversal that starts at the leaf node representing the value’s binary representation. For instance, leaf node N 3 is covered by nodes N 3 , N 2;3 , N 0;3 , and N 0;7 (encircled in Fig. 4.3), with node ids ”011”;”01”;”0” and ”/ 0” as keywords (the root node is encoded as / 0 since its path has length 0). For a 2D domain with dimensions Ox and Oy, a separate binary tree is constructed for each dimension, and the node ids are prefixed with ”x” and ”y”, respectively, to distinguish values in each coordinate. For example, the ids of node N x 2;3 and N y 2;3 on the tree of Ox and Oy are represented as ”x01” and ”y01”, respectively. Each location is a single cell in the grid of size LL covering the entire geospatial domain. For an object positioned at cell(i; j) corresponding to leaf nodes N x i ;N y j in their respective domain trees, the union of node ids covering N x i ;N y j is used as the keyword set for the object. The id of the root is omitted since it appears in every object. The total number of keywords of each object is 2logL. For instance, the keyword set of an object in cell(3;4) contains 2log8= 6 nodes: N x 3 , N x 2;3 , N x 0;3 , N y 4 , N y 4;5 , N y 4;7 , with labels ”x011”, ”x01”, ”x0”, ”y100”, ”y10”, ”y1”. A range query in the 2D domain is a cross join of node ids in each dimension. For example, the 2D range query defined by x2[2;7] and y2[2;6] is expressed as(N x 2;3 ^N y 2;3 )_(N x 2;3 ^N y 2;3 )__ (N x 2;7 ^ N y 6 ) or in our specific encoding(”x01”^ ”y01”)_(”x01”^ ”y10”)__(”x1”^ ”y110”). Each term in the expression is a conjunctive keyword query, which is directly supported by HXT. HXT query time depends on the number of documents containing the first keyword [73]. We sort the keywords in a query so that the node closest to the leaf level is in the first position, since that node covers a smaller range, hence query cost is decreased. Given a geographical area of interest, the trusted curator (TC) divides the area into a L L grid with an appropriate domain granularity L (e.g., one can choose L such that one unit corre- sponds to a distance of one meter). Then, using the range covering technique above, TC builds an HXT-encrypted index for all objects. When a buyer initiates the search process, a range query is performed as a series of conjunctive keyword queries of length two (due to the 2D domain). If any term evaluates to true for an object, the object matches the query. 58 Figure 4.3: Mapping 1D domain using best range cover Algorithm 6 shows the pseudocode to obtain covering nodes ids for each object coordinate. Algorithm 7 shows how to generate the object-keyword database DDB, which is subsequently encrypted using HXT’s Setup procedure (Section 4.2). Setup outputs master key mk, public pa- rameters pub, and encrypted database EDB. Algorithm 8 shows the pseudocode for answering encrypted spatial range queries on top of EDB. Algorithm 6 Get 1D Covering Nodes Input: Domain length L; Position pos : 0 pos< L; Output: Node ids covering position pos node ids / 0 current id ”” for i= logL to 0 do append i th -bit of pos to current id node ids = node ids[ current id end for return node ids Limiting Query Size and Placement. With the proposed technique, a buyer can issue range queries of arbitrary size, shape, and placement. Arbitrary queries are decomposed into a set of queries that are precisely covered by a domain tree node, and a disjunctive expression is formed, where each term is a conjunctive HXT query. However, such flexibility can decrease performance, 59 Algorithm 7 Convert Object Locations Input: Grid length L; Location database LDB where LDB[i]=(x i ;y i ),8i : 0 x i ;y i < L; Output: Document database DDB for all object id i2 LDB do x words Get 1D Covering Nodes of x i y words Get 1D Covering Nodes of y i for all w2 x words do DDB[i] = DDB[i][ concatenate( 0 x 0 ;w) end for for all w2 y words do DDB[i] = DDB[i][ concatenate( 0 y 0 ;w) end for end for return DDB Algorithm 8 EncryptedSpatialRangeQuery Input: Grid size L; EDB; range R; HXT parameters pub, mk Output: Matched object identifiers x words Get1DCover[R.bottom right x ;R.top left x ] y words Get1DCover[R.bottom right y ;R.top left y ] matches / 0 for all pairs kq=(w x ;w y )2 x words y words do matches matches[ HXTSearch(param, mk, kq, EDB) end for return matches 60 since there may be numerous sub-queries in the decomposition. In practice, query sizes are likely to be small compared to the data domain (e.g., no more than 1km 2 within a city). In addition, one can slightly restrict query placement, requiring that a range aligns precisely with a tree node. First, we consider limiting query size. Specifically, if query size is limited to a maximum of L=2 h max in each dimension, then one needs to consider only nodes up to a level h max of the domain tree. The larger h max is, the fewer node levels are considered. Nodes at higher levels (i.e., level < h max ) can be ignored when constructing the keyword set for each object, resulting in smaller ciphertexts and faster processing. Index creation time is also significantly boosted. Second, we restrict queries to areas that are precisely covered by one node of each tree. For example, for the 1D domain in Fig. 4.3, although both ranges [3;6] and [4;7] have size 4, range [3;6] is covered exactly by three nodes(N 3 ;N 4;5 ;N 6 ) while range[4;7] can be covered with a single node(N 4;7 ). With this alignment restriction, query decomposition is no longer necessary, and each range query can be encoded as a single conjunctive keyword pair. 4.5.2 Asymmetric Encryption Search The symmetric encryption approach described in the previous section is quite efficient, but it re- quires a trusted curator that has access to all plaintext locations of advertised objects. This can lead to excessive disclosure, and in some cases, it may be unrealistic to assume that system users are willing to place so much trust in a centralized component. The HVE-based approach described next uses asymmetric encryption and allows object owners to encrypt locations by themselves. However, this comes at additional performance overhead. Previous work that focused on location-based queries on top of HVE-encrypted data considered hierarchical or Gray encodings [46]. In this context, each object location is snapped to a LL grid. An attribute vector is constructed for each object, which has width l = 2logL. Next, queries are expressed with respect to groups of neighbor cells with similar encoded values. This may lead to excessive computation time, given that the number of expensive bilinear pairing operations required to evaluate a single token is proportional to the HVE index vector width (as discussed 61 in Section 4.2). Furthermore, a range query in [46] often requires more than a single token to evaluate, increasing computation time even more. Another problem with using multiple tokens for a single query is excessive leakage. To improve performance, the work in [46] uses single-cell token aggregation and may end up with several tokens for each query predicate, corresponding to sub-ranges of the query. Based on the individual evaluation for each sub-query, an adversary may pinpoint the object’s location to an area that is more compact than the actual query of the buyer, which may result in significant privacy leakage. To address these issues, we propose a new approach of encoding range queries using HVE. Es- sentially, our approach first transforms range queries to keyword queries, using a domain mapping similar to the one used for SSE. Then, HVE is used to assemble a ciphertext that allows HVE eval- uation of a conjunctive exact match query, one for each spatial dimension. Recall from Section 4.2 that the elements of an attribute vector can take values inS =Z m [fg for an integer m< p;q. Also, a conjunctive formula of length two formed with respect to node identifiers on Ox and Oy axes represents a 2D range. For example,[N x 2;3 ;N y 4;5 ] indicates range[2;3][4;5] in 2D domain. We utilize the domain mapping technique described in Section 4.5.1 to encode both the location of an object and a range query. Then, for each coordinate of an object’s location, there are logL nodes in the upward path from the leaf node of the coordinate to the root. To capture all covering ranges on both coordinates, we would need to capture within each ciphertext log 2 L pairs of(x;y) coordinates. Even for moderate domain representation granularities (e.g., L= 65536;logL= 16), this would result in a significant storage overhead (e.g., 256 values for the logL= 16 setting). In addition to the storage overhead, there is also increased processing time when performing queries, since all pairs are potential candidates for matching. To prevent performance deterioration, in the case of HVE we choose to adopt a further query limitation compared to SSE: specifically, we consider only query ranges with square shape. These queries can still occur at each level of the domain tree, but the range spans in each dimension are equal. As a result, we only need to store logL pairs of encrypted coordinates in a HVE ciphertext. In fact, when combined with the maximum query size limit discussed for SSE, the overhead decreases to logL h max + 1. 62 Even with this additional constraint, buyers are still able to formulate useful queries. If one considers the families of all regular grids with granularity increasing in powers of two super- imposed on top of the data domain, then our encoding still allows a query to express any possible cell within one of these grids. While clearly more restrictive compared to arbitrary queries, the approach achieves a good trade-off between flexibility and performance. Fig. 4.4 exemplifies the modified encoding. For each object, the ids of nodes in the domain binary tree on one dimension are combined with the ids of nodes at the same level in the other dimension to form for each ciphertext logL vectors of length one. Each vector corresponds to a square area that is covered exactly by one node of each tree. Range queries are also restricted in the same fashion, i.e., a square area covered exactly by one node of each tree, to allow conjunctive exact match evaluation. Thus, both the location of an object and a range query are encoded as a single scalar value, which can be used directly in the HVE matching phase. In our implementation, the order of the node in a pre-order traversal of the tree, identified by the path from the root to the leaf, is used as the node identifier. Then, to ensure each area has a unique value, the final value is calculated as N x 2 L+ N y where N x ;N y are the node ids of the object in the tree of Ox, Oy, respectively. For instance, consider an object at location (3;4) and a range query [2;3] [4;5]. As shown in Fig. 4.4, the location (3;4) is covered by nodes N 0;3 ;N 2;3 ;N 3 in the tree of Ox and N 4;7 ;N 4;5 ;N 4 in the tree of Oy (the superscripts are omitted for simplicity). Hence, the ranges covering the object are represented as the combinations of a node in the tree of Ox and a node in the same level in the tree of Oy (e.g. [N 0;3 ^ N 4;7 ];[N 2;3 ^ N 4;5 ];[N 3 ^ N 5 ]). Then, these combinations are transformed into length-two vectors using pre-order traversal of the elements, which are [1;8];[5;9];[7;10]. The final calculated values for each case are [24];[89] and [122], respectively. The same encoding is also applied to the range query, resulting in value[89]. Compared to the approach in [46] which results in HVE vectors of length l= 2logL, we always generate length-one vectors, where each vector is constructed from nodes of the same level in each tree. In addition, following the query size limitation, the size of the queries is maximum L=2 h max in each dimension. At runtime, when a buyer issues a query, the resulting token generated by the TA 63 Figure 4.4: Data and query encoding with HVE must be compared with all vectors of the ciphertext, and there is exactly one match if the object is within the query range, and no matches if the object falls outside the range. This results in maximum logL h max + 1 evaluations. With a simple modification, we can reduce this overhead to a single evaluation, by also including the domain tree level within the query. This way, the matching procedure only needs to evaluate the token against the ciphertext at the same level as the one specified in the query, reducing computational overhead of matching by a factor of logL. Storage requirements for ciphertexts do not change, since there could be a potential query received for any level of the domain tree. We refer to this search variation as SingleLevel. 4.5.3 Owner Accountability and Spam Resilience Two major desiderata of our proposed system, accountability and spam resilience, are achieved by storing for each advertised object a digital commitment on the blockchain. Since write operations to the chain are expensive, and the amount of information that can be stored is small, it is crucial to reduce the size of commitment strings, in order to reduce transaction costs. We employ the use of vector commitments [16] which allow the commitment of a sequence of values in a compact way (the length of the commitment string is independent of the number of committed values). This way, an owner can submit at the same time commitments for a batch of objects and pay the on-chain 64 write price for just one. The length of a commitment is equal in size to that of an RSA encryption modulus for a similar amount of security: in our implementation, we use 1024-bit commitments. To advertise one or more objects, the owner creates a vector digital commitment where each component corresponds to a location. Locations are typically hashed, and then the commitment is created by performing a modulo exponentiation operation with the hashed value in a composite order group. The commitment vector is published on the blockchain. Due to the difficulty of extracting logarithms in the group, an adversary is not able to recover the committed value, so location privacy for the owner is achieved. On the other hand, the owner cannot later change the value of the location without being detected, due to the binding property of commitments. After a transaction is completed, if the buyer finds that the actual location of the data object differs from the advertised one, the on-chain commitment is sufficient to prove the owner’s dishonesty and reverse the payment (additional penalties can be imposed by the marketplace). Spam-resilience is also achieved as a result of using the blockchain. Due to the non-negligible cost of writing to the blockchain, it is not economically viable for a dishonest party to advertise a large number of objects. Various policies can be put in place to control the trade-off between the cost per transaction incurred by legitimate users, and the resilience to spam. For instance, one can enforce a limit on the count of elements in vector commitments, which in effect determines the number of objects that can be advertised with a fixed cost (in the experimental evaluation of Section 4.6 we measure the cost of writing a commitment to the chain in the order of USD$0.02). In addition, the system can enforce a policy that mandates a deposit for each commitment. The deposit can be refunded back to the owner after a transaction is completed, or after a pre-defined time threshold. For example, if the cost is $0.02 to submit a commitment for 20 objects (which translates to $1 for 1000 objects), the system can enforce an additional deposit of $1 per commit- ment (i.e., $50 for 1000 objects) refundable after one day. This policy can be easily implemented through a smart contract. Another policy that can be easily implemented as a smart contract is to enforce a hard limit, e.g., 50 commitments per owner per day. While the deposit requirement policy focuses on the 65 financial aspect, such a thresholding policy directly restricts the number of commitments. Finally, when SSE is used, one can also enforce an anti-spam mechanism at the TC, by limiting the number of objects that are encrypted for each owner. Since the TC will not encrypt more objects than the limit, the index size will be kept under control, and search performance will not be affected. 4.6 Experimental Evaluation 4.6.1 Experiment Setup. We evaluate the proposed approaches using the SNAP project [106] location dataset, containing check-ins of users in the Gowalla geo-social network. We assume that owner objects’ geo-tags coincide with check-in locations. We select check-ins in the Los Angeles area, spanning latitude range[33:699675;34:342324] North and longitude range[118:684687;118:144458] West (an area of 3500km 2 with a total of 110;312 check-ins). We randomly select from this set four object datasets D 1 ;D 2 ;D 5 and D 10 having cardinality 10000;20000;50000 and 100000, respectively, such that D 1 D 2 D 5 D 10 . The area is partitioned into a LL grid with granularity: G= logL= 10, 12, 14 and 16 (ranging from 2300m 2 down to less than 1m 2 per cell). Object locations were converted to keywords for HXT, and to length-two vectors for HVE. Search requests of buyers were randomly generated by choosing an anchor location from the dataset, then constructing a range around it with three sizes: 400m 550m, 800m 1100m, and 1600m 2200m, ranging from 1% to 3:5% of data domain side length. We implemented Python prototypes of the proposed approaches. For HXT, our implementation employed 1024-bit key length for pairing groups, while for HVE we used the instantiation from [12] and varied key length as 768, 1024, 1536 and 2048 bits. All experiments were run on a Intel Core i7 3.60GHz CPU machine with 4 cores and 16GB RAM, running Ubuntu 18.04. We used a single core for all experiments, except the HVE parallel processing test for which we used all four available cores. For blockchain tests, we built a private Ethereum network using Go Ethereum version 1.8.20-stable. 66 10k 20k 50k 100k Number of objects 0 500 1000 1500 2000 2500 Time (s) G = 12, h max = 0 G = 12, h max = 6 G = 16, h max = 0 G = 16, h max = 6 (a) Index Build Time 10k 20k 50k 100k Number of objects 0 1000 2000 3000 4000 5000 6000 Index size (MB) G = 12, h max = 0 G = 12, h max = 6 G = 16, h max = 0 G = 16, h max = 6 (b) Index Size Figure 4.5: HXT index generation performance 10k 20k 50k 100k Number of objects 0 2 4 6 8 10 12 Time (s) G=12, 1600x2200 G=12, 800x1100 G=12, 400x550 G=16, 1600x2200 G=16, 800x1100 G=16, 400x550 (a) Arbitrary query placement 10k 20k 50k 100k Number of objects 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time (s) G=12, 1600x2200 G=12, 800x1100 G=12, 400x550 G=16, 1600x2200 G=16, 800x1100 G=16, 400x550 (b) Restricted query placement Figure 4.6: HXT query performance 67 4.6.2 SSE Approach Evaluation. First, we measure the HXT index build time at the trusted curator (TC) and the index size (Fig. 4.5). Each graph line corresponds to a combination of granularity (L) and maximum tree levels h max . The index build time grows linearly with the number of objects. A finer granularity and a lower value of h max increase the build time, since they generate a larger number of keywords. For the 50k cardinality the index creation time never exceeds 20 minutes, whereas in the worst case it takes 40 minutes for the 100k case. For moderate granularity and height settings, build time is below 10 minutes. The index size varies between 1 and 5:5GB. 10 12 14 16 Domain granularity G 20 40 60 80 100 120 Average #keyword queries qs = 1600x2200 qs = 800x1100 qs = 400x550 (a) Keyword queries count 10 12 14 16 Domain granularity G 250 500 750 1000 1250 1500 1750 Average size of DB(w 1 ) qs=1600x2200, arbitrary qs=1600x2200, restricted qs=400x550, arbitrary qs=400x550, restricted (b) #Obj. containing first keyword Figure 4.7: Analysis of query restriction effect on performance Next, we focus on query execution. Fig. 4.6a shows the average query time for arbitrarily placed queries. The performance overhead is considerably higher for the queries with larger span and is less influenced by granularity. In the worst case, a query takes 12sec, and if we exclude the largest query range, less than 6sec for 100k objects. Fig. 4.6b shows the results when restricting query placement. Clearly, there is a significant gain in performance, resulting in a query time re- duction between 2:5 and 8 times. The query time is always below 4sec. To better understand the performance gain due to query restrictions, we measure the count of individual conjunctive queries resulting from the decomposition of arbitrary ranges (Fig. 4.7a). A single range may be decom- posed into as many as 120 conjunctive HXT queries, and the number of such queries increases 68 768 1024 1536 2048 Key size (bits) 0 10 20 30 40 Time (s) G=12, h max = 0 G=12, h max = 6 G=16, h max = 0 G=16, h max = 6 (a) Ciphertext generation time 768 1024 1536 2048 Key size (bits) 0 5 10 15 20 25 30 Cipher size (KB) G=12, h max = 0 G=12, h max = 6 G=16, h max = 0 G=16, h max = 6 (b) Ciphertext size Figure 4.8: HVE encryption time and ciphertext size per object with G. As a result, there is a much higher number of documents in the database containing the first keywords (DB(w 1 )). Fig. 4.7b shows the cardinality of DB(w 1 ), which increases significantly with the span of the query range. 4.6.3 HVE Approach Evaluation First, we focus on ciphertext generation. With HVE, owners encrypt data by themselves, hence security is improved since locations are not shared with a trusted entity. However, the owner’s device may have less computational power, so we must ensure the client overhead is low. Fig. 4.8 shows the HVE ciphertext generation time and size. Current security guidelines specify 1024-bit protection as sufficient for individual data; in this setting, encryption time is usually below 5sec. Even for higher security requirements, encryption never exceeds 40sec. Ciphertext size is under 30KB. Key size 768 1024 1536 2048 Generation time (s) 0.019 0.036 0.085 0.165 Size (bytes) 402 534 786 1050 Table 4.1: HVE token generation time and size 69 768 1024 1536 2048 Key size (bits) 0 1 2 3 4 5 6 Time (s) G=12, h max = 0 G=12, h max = 6 G=16, h max = 0 G=16, h max = 6 Single Level (a) Match time vs key length 10 12 14 16 Domain Granularity G 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Time (s) key size=1024 key size=2048 key size=1024, Single Level key size=2048, Single Level (b) Match time vs domain granularity Figure 4.9: HVE matching time per ciphertext Table 4.1 shows the token generation overhead at the trusted authority (TA). Token generation time is short, 0:2sec in the worst case. Token size of is also negligible (at most 1KB). The results justify our claim that there is a strong business case for TAs (e.g., cell operators) to participate in the system: the overhead is small, and no significant infrastructure investment is necessary to support such a service. By far the most significant performance concern with HVE is the query time, due to the use of expensive bilinear pairing operations. Fig. 4.9 shows query time (i.e., HVE match) per to- ken/ciphertext pair. Query time for 1024-bit security, G= 12 (approximately 10 meters per grid cell side) and query size is limited to h max = 6, is 370msec, and reaching close to 6sec for higher security requirements (i.e. key size = 2048), larger G= 16, and no query size limit (i.e. h max = 0). With the SingleLevel optimization proposed at the end of Section 4.5.2, the query time is consid- erably reduced, to under 52msec. This is still significant though, considering that the marketplace may have millions of objects. HVE is also expensive due to the absence of an index: every ciphertext is matched against the token. However, the search is embarrassingly parallel. One can distribute the ciphertexts over many compute nodes and achieve near-linear speedup. Since token size is so small, it can easily be broadcast at query time to a large number of nodes. In fact, each Swarm node can perform the 70 matching itself, and charge the buyer for the compute cycles. Table 4.2 shows parallel execution time when running HVE search for 10000 objects on four cores, key size 1024 bits, G= 16, and h max = 7, with the SingleLevel optimization. The speedup obtained is virtually linear. Assuming a similar speedup, using 100 cores one could execute the query for a database of 10;000 objects in 10 seconds. Using a conservative cost estimation from cloud providers, this translates to a financial cost per query of roughly $0:03, easily absorbed in the cost of a single marketplace transaction. #processes 1 2 3 4 Query time (s) 525 332 221 170 Speedup 1 1.6 2.4 3.1 Table 4.2: Parallel execution speedup for HVE matching 4.6.4 Financial Cost Evaluation Our aim is to prove the financial viability of the proposed privacy and accountability scheme for the marketplace, by showing it represents only a small percentage of transaction cost. Financial cost includes the cost of writing to blockchain, and the cost of storage (index and metadata). For brevity, we omit commitment computation time, which we measured to be always under 0:2sec; also, financial cost of outsourced search was quantified in the HVE evaluation section. For blockchain operations, we measure both the gas and USD amount (at the rate of 1ether= $133). On-chain operation Gas Cost ($) Owner registration 42150 0.02 Owner sets commitment public params 327590 0.10 TA/TC submits SSE/HVE index/file info 177160 0.06 Table 4.3: On-chain cost of one-time, setup operations Table 4.3 shows the one-time set-up cost when joining the marketplace. This includes owner registration, owner setting up public parameters for digital commitments, and TA/TC submitting bootstrap information about SSE index or HVE flat file. We use 1024 key size for digital commit- ments; the message space of each commitment is 64 bits (32 bits for each coordinate). The total set-up cost for each owner is about $0.12, and for each TA in the HVE-based marketplace is $0.06. 71 On-chain operation Gas Cost ($) Owner submits commitment 83092 0.02 Buyer makes an offer 297478 0.08 Owner withdraws payment 40649 0.01 Table 4.4: On-chain cost breakdown for every transaction Table 4.4 shows the cost of on-chain operations for an end-to-end purchase transaction between owner and buyer: an owner submits commitments for locations of her objects; a buyer makes an offer to purchase objects; and optionally, the buyer withdraws his payment in case of discovering a fraudulent advertisement although this last step rarely occurs. The total cost for a purchase is approximately $0.11. This is a relatively low cost: considering an average of $10 price per purchase, this represents a 1:1% fee. 4.7 Chapter Summary In this chapter, we proposed a blockchain-based privacy-preserving, accountable and spam-resilient marketplace for geospatial data which allows owners and buyers to be matched using only en- crypted location information. To the best of our knowledge, this is the first approach to achieve these important desiderata. 72 Chapter 5 Noisy Data Point Release with Variable-Price Query In this chapter, we consider another setting of geo-marketplaces where sellers can degrade their data points by adding some noise and then sell their noisy data with potentially a price correspond- ing to their amount of noise added. A buyer would prefer to minimize data costs, but may have to spend more to get the necessary level of accuracy. We call this interplay between privacy, util- ity, and price spatial privacy pricing. We formalize the issues mathematically with an example problem of a buyer deciding whether or not to open a restaurant by purchasing location data to determine if the potential number of customers is sufficient to open. The problem is expressed as a sequential decision making problem, where the buyer first makes a series of decisions about which data to buy and concludes with a decision about opening the restaurant or not. We present two algorithms to solve this problem, including experiments that show they perform better than baselines. 5.1 Introduction To illustrate the interplay between location data utility, privacy and value (i.e., pricing), this chapter considers a running example scenario: a buyer is interested in checking if the number of people inside a target region R is large enough to, say, open a restaurant in R (utility). Determining whether the population is sufficiently large in a given area is also important for a number of decision making applications such as identifying pandemic hotspots, opening COVID-19 test centers, expanding 73 public transportation, opening new community services (e.g. youth centers) and representative government. Figure 5.1: The true location x x x (black dot) can be sold as one of the noisy data points z 1 ;z 2 (white dots). The standard deviation s 1 of noise of z 1 is smaller than that of z 2 . Hence, z 1 is more expensive than z 2 . Suppose the geo-marketplace has locations of many users, where each user’s location can be sold at different levels of accuracy corresponding to her privacy vs. price preferences. There are different approaches to capture privacy or inaccuracy. For our purpose, and to simplify the discussion, we assume privacy or inaccuracy is captured as a data point with a Gaussian noise level represented by the standard deviation of the noise distribution. 1 In Figure 5.1, for example, a user’s true location at x x x can be sold as noisy points with mean and standard deviation(z 1 ;s 1 ) for $1 or(z 2 ;s 2 ) for $2. The buyer can purchase data from multiple users and/or multiple times from one user at differ- ent levels of accuracy. After making a decision of “open” the restaurant or “cancel” (i.e. not open), the buyer would receive a net profit equal to the corresponding revenue of the decision minus the cost of purchasing data. The buyer’s objective is to maximize this profit. Maximizing the profit is challenging for the buyer for several reasons. First, the locations of users are uncertain, and the buyer can only reduce this uncertainty by purchasing more data. Second, the purchasing actions are irrevocable (i.e. the buyer cannot ask for reimbursement after 1 This can be simply extended to use more sophisticated location privacy mechanisms such as Geo- Indistinguishablity [7]. 74 already buying the data), so the purchasing action may not be optimal in hindsight. And third, al- though the problem can be modelled as a partially observable Markov decision process (POMDP), with locations as state and observations and possibly prices as actions, the large number of users and the continuous nature of the domain spaces render the standard POMDP solutions impractical due to the explosion of the number of states. Some seemingly obvious solutions are not always effective. For example, buying the most accurate data from all users may exceed the payoff for making the correct open/cancel decision, resulting in a negative profit. Alternatively, spending a fixed, prespecified amount on purchas- ing raises the issue of predetermining that fixed amount. Therefore, there is a need for an adap- tive approach to this problem. Other studies have attempted to find adaptive solutions to similar problems. However, they either considered a simpler version of this problem [105] or had differ- ent objectives [50, 47, 87, 39], thus, resulting in inapplicable solutions. Finally, the approaches for location privacy protection such as spatial cloaking [123], differential privacy [34] or Geo- Indistinguishablity [7] are relevant but orthogonal to our work as we discuss in Section 5.2.2. To the best of our knowledge, we are the first to consider the interplay between privacy, utility and price in a data marketplace, particularly in geo-marketplaces, with the focus on the profit of buyers. We develop two adaptive algorithms to help buyers optimize the buying actions to obtain nec- essary data for a decision while striving to reduce the data acquisition cost, called the spatial infor- mation probing (SIP) algorithm and the SIP algorithm with terminals (SIP-T). Our algorithms take into account the uncertainty in the data, the irrevocability of the collection process and the large number of users’ possible locations. Both algorithms start by buying data at a lower price (dubbed probing) in order to gather information, then continue to buy at higher prices the data points that have high potential to give high profit. These algorithms use the expected incremental profit (EIP), which intuitively is the expected increment of the expected profit when purchasing a data point at a price, to choose the next data point and the next price at which to buy. SIP-T enhances SIP by taking into account the distance of purchased data from the target region and focusing more on distinguishing whether a data point is inside or outside the target region. 75 Specifically, our contributions are: • Proposing the problem of balancing the benefits for users and buyers in data marketplaces in terms of privacy, utility and price. In the spatial data marketplace, we called this problem spatial privacy pricing. • Presenting a specific application in which buyers optimize their purchasing actions to obtain necessary data for a decision while trying to reduce the data acquisition cost. • Developing adaptive algorithms which take into account multiple obstacles such as uncer- tainty of the data, the irrevocability of the collection process and the large range of possible locations of users. Our algorithms adaptively buy different data points at difference prices based on the EIP and the geometry of the purchased data. • Extensive experiments comparing our algorithms to baselines over different settings for users’ data, buyers’ decisions and algorithmic parameters. 5.2 Problem Setting In this section, we formalize the problem of the buyer deciding to open or cancel a restaurant in a target region R in order to demonstrate different aspects of the spatial privacy pricing problem: privacy, utility and value (i.e. pricing). We start by defining the notion of privacy valuation of users. We then introduce privacy pricing, which serves as the fundamental mechanism to balance the benefits for users and buyers. Then, we present the decision problem of the buyer, which involves purchasing data from users using the privacy pricing mechanism. Finally, we formalize our problem setting. 5.2.1 Users’ Privacy Valuation In a data marketplace, the privacy concern or valuation of users can be quantified with three compo- nents: their general privacy concern, their concern for a specific data point and their concern about 76 a specific buyer. We present a simple model that captures these three aspects in a straightforward way. For general privacy valuation, we assume each user u i has an overall privacy level r i 0 which reflects the user’s own valuation of their privacy and is independent between users. This reflects Westin’s series of privacy surveys, where he categorized privacy concerns of people as high, moderate and low [71]. Each data point x x x i; j of user u i can have its own sensitivityn i; j 0, which is independent ofr i and independent between data points. Sensitivityn i; j reflects how sensitive u i feels about x x x i; j , e.g. a gas station vs. a hospital. For different buyers, users might also have a different level of trust z b , e.g., an unknown de- veloper may deserve less trust than the Centers for Disease Control and Prevention. In this work, since we start with a single buyer with a specific query, we considerz b = 1. Subsequently, the total privacy valuationl i; j;b of user u i for their data x x x i; j for a buyer b would be a function ofr i ,n i; j andz b l i; j;b =r i n i; j z b (5.1) When the user has only one data point, we simply use l i =l i; j;b . For simplicity, we use l i in the remaining discussion. A user with a higher privacy valuation would expect to receive higher value for their location information. Next we discuss howl i affects the price of the users’ location data. 5.2.2 Privacy Pricing One popular approach to protect users’ privacy when releasing data is by adding noise, either directly to the data or to some components of the data release process. In this work, Gaussian noise is added with the magnitude represented by the standard deviations dependent on the price paid. The noise magnitude or standard deviations is considered as the noise level. 77 When a data point x x x i is traded at a price q i , the noise magnitudes i is given as a function of q i andl i as follows s i = 8 > > < > > : l i q i k s : 0< q i <l i 0 : l i < q i (5.2) where k s is a scaling factor to make the resultings i match with the real-world values of the noise magnitude, discussed in Section 5.5.4. This is one possible model that captures the essence of privacy and price. A higher price q i reduces s i , while a higher privacy valuation l i increases s i . The privacy valuation l i can be viewed as the price the user u i wants in order to sell their unperturbed data. Although our users’ pricing and privacy models need real-world evaluation, they capture the essence of our model and permit the full development and testing of our proposed algorithms. The functioning of the algorithms is unaffected by the particular pricing and privacy models. For exam- ple, other privacy preserving frameworks, such as spatial cloaking [123], differential privacy [34] or Geo-Indistinguishability [7], could also be used instead of the Gaussian noise. In that case, the parameters in those frameworks, e.g. e in differential privacy, can be derived from a function e = f 1 (q i ) of the price q i with a similar or more complicated pricing model. The noise level s i then can be derived from a functions i = f 2 (e) of those parameters. Eventually the noise levels i can still be considered as a function s i = f(q i )= f 1 f 2 (q i ) of the price. The work on specific privacy-preserving techniques or pricing models is orthogonal to our work. Consequently, in this presentation, we assume the privacy valuation of users and the pricing function are fixed, and we focus on the problem of the buyers maximizing their total reward (i.e. profit) with the strategy to make the final open or cancel decision described in the next section. 5.2.3 The Buyer’s Profit Maximization Problem In order to finally decide to open or cancel (i.e. not open) a restaurant in the target region R, the buyer can take a sequence of actions a=fa 1 ;a 2 ;:::;a T g, where taking an action a would give the 78 buyer a reward or profit r(a). The goal of the buyer is to maximize the expected total profit of the chosen actions maxE[r(a)]= maxE[ T å t=1 r(a t )] (5.3) where the expectation is over the (possibly) noisy data from users and the randomness of the process of choosing actions. There are two types of actions that a t can represent: open/cancel a o actions and buying a b ac- tions. The buyer can take multiple buying actions to gather information before ultimately deciding to take an open/cancel action. Therefore, the last action a T must be an open/cancel action a o and other actions a 1:T1 =fa 1 ;a 2 ;:::;a T1 g must be buying actions a b . An open/cancel action a o can take one of the two possible values in the set A o =fopen;cancelg, which mean open or not open a restaurant, respectively. The profit of an opening action a o is r(a o )= 8 > > < > > : bng i f a o = open 0 i f a o = cancel (5.4) where b is the profit per user (or the gross margin), n is the number of users in the region R that may visit the restaurant and g is some fixed cost for opening the restaurant, for example, the monthly rent or the cost of operation. Naturally, more users leads to a higher profit. This form of the profit function in Equation 5.4 can capture other variations of the buyer’s profit in our problem. For example, the buyer may decide to open only if r(open)> r 0 , theng can be set asg g+r 0 . In another example, when the buyer decides to cancel, one may consider the buyer losing a fraction b=k of the profit for each user inside R as some opportunity cost. This variation can be captured by settingb (1+ 1 k )b. Each buying action a b requires the buyer to purchase a data point x x x i at some price q> 0. For clarity, we denote this buying action as a b (i;q). The set A b of all possible buying actions includes 79 all data points at all possible prices. The profit for this action a b (i;q) is the negative of the price the buyer needs to pay, which means r(a b (i;q))=q (5.5) Thus r(a 1:T1 ) can be considered as the cost of buying data. In addition to r(a b (i;q)), after taking a b (i;q), the buyer will also receive a noisy data point z i whose coordinates are the true coordinate of x x x i perturbed by independent Gaussian noise with s i derived from the price q using Equation 5.2, which means z i = x x x i +h; hN (0;s 2 i I) (5.6) where I is the 2 2 identity matrix . In this work, the number of actions T the buyer can take is not restricted. In addition, the buyer focuses on their business problem and is honest with the use of data. The buyer is also not restricted by any budget for purchasing data because an excessive cost of purchasing data will eventually decrease the net profit. 5.2.4 Formal Definition For simplicity, we consider a snapshot in time of users’ locations and assume each user has only one data point x x x i at that time with sensitivityn i . (Note that this same point could be sold multiple times at different levels of accuracy.) The setting would be similar to the case of multiple data points with minor changes. With all the quantities defined, we can formalize the problem setting as follows: Given a snapshot in time of locations of N users where each user u i is at location x x x i 2R 2 , has privacy valuation l i and is willing to trade x x x i at different prices q i , the buyer’s objective is to 80 maximize the expected total profit of an action sequence a=fa 1 ;a 2 ;:::;a T g where a T 2 A o ;a i 2 A b 8i= 1;::;T 1: maxE[r(a)]= maxE[ T å t=1 r(a t )] (5.7) 5.3 Related Work The concept of marketplaces for geosocial data was proposed in [61]. In [6], the authors in- vestigated the value of spatial information to guide the purchases of the buyer. In [83], a geo- marketplace was proposed where location data are protected using searchable encryption. How- ever, their setting only consider buying data points at full price, while in our setting one data point can be sold multiple times at different levels of accuracy for different prices. Singla [105] also studied the problem of maximizing the buyer’s profit when information is available at a price, but also can only be purchased at full price, which can be considered as a simpler form of our problem and is used as a baseline in our experiments. In [50], Gupta et. al extends on the capability of pur- chasing the same data multiple times. However, they focus on maximizing the total profit, while the objective in our problem is to make a binary decision. In addition, although their method pro- vides some theoretical guarantees of the performance, it relies on the grades of states of Markov decision processes that are infeasible to compute in our problem. A partially observable Markov decision process (POMDP) is a powerful framework for mod- elling sequential decision making processes under uncertainty with the goal of maximizing total reward. POMDPs have been an active research area and considerable progress has been made on solving POMDPs [113, 101, 20]. While our problem setting can be considered as a POMDP with continuous state, observation and action spaces, our problem also includes a large number of users. Since the number of possible states in this POMDP increases exponentially with the number of users, standard POMDP algorithms are computationally infeasible for our problem. There are also several studies on selling privacy [47, 87, 39]. However, they focus on the accuracy of the inferred number of users compared to the true number, while our focus is on the 81 sufficiency of that number for making a binary decision which may incur some cost itself. In their setting, data points are also only purchased once, compared to possibly multiple times in our setting. Many privacy-preserving techniques have been proposed for location privacy such as spatial cloaking [123], differential privacy [34, 130] or Geo-Indistinguishablity [7]. These techniques can be applied as the privacy protection mechanism for users’ locations in our framework and a pricing model can be derived based on the parameters of these mechanisms, as discussed in Section 5.2.2. Therefore, while relevant, these techniques are orthogonal to our work. 5.4 Methodology In this section, we first describe the strategy the buyer would employ to make the open/cancel de- cision. Given such an open/cancel strategy, we define the expected incremental profit (EIP) which serves as the basis for our spatial information probing techniques, the SIP and SIP-T algorithms. 5.4.1 The Buyer’s Strategy to Make Open/Cancel Decision Recall that each buying action a b (i;q) gives the buyer a noisy data point z i . The noisy data obtained from the buying actions a 1:T1 can help the buyer to make the open/cancel action a T as follows. Assume that after taking actions a 1:T1 , the buyer owns a set of noisy dataZ . If there is more than one version of z i from user i, the buyer keeps only the most accurate. Then, from the buyer’s perspective, given z i , the true location x x x i has the distribution x x x i N (z i ;s 2 i I) (5.8) The probability p i of the user u i being inside R is p i = Z R P(x x x i )dx x x i (5.9) 82 Considering each p i as the success probability of an independent Bernoulli trial, the probability distribution P(njZ) of the number of users inside region R would be a Poisson binomial distribu- tion [127] with mean as follows: m n = å i p i (5.10) Given this distribution, the current expected profit for open/cancel actions areE[r(open)]=bm n g andE[r(cancel)]= 0. If the buyer decides to make an open/cancel decision at time T , i.e. a T is an open/cancel action a o , the best action a T would be a T = 8 > > < > > : open if bm n g > 0 cancel if otherwise (5.11) Assume that the buyer starts with an empty set of data, which givesm n = 0. Then,E[r(open)]= g. Adding an additional p i to m n in Equation 5.10 increases m n . Thus by buying more data, the estimate of m n increases, thus increasingE[r(open)]. Therefore, the buyer would want to buy data to increaseE[r(open)] while also trying to decrease the cost å T1 t=1 r(a t ). The buyer would stop buying when they can be confident of making the open decision, i.e. bm n g > 0 or when buying more data, in expectation, would not give more benefit. Thus, bm n g > 0 can be considered as the open trigger for the buyer. We callbm n g > 0 the opening condition. 5.4.2 The Expected Incremental Profit (EIP) With the strategy to make the open/cancel decision established, we introduce the expected incre- mental profit EIP(Z;a) of taking a buying action a with the current setZ of purchased noisy data. EIP(Z;a) would then serve as the criteria for the buyer to choose a buying action in our algorithms. 83 To develop EIP(Z;a), we first calculate the expected profit if the buyer takes a o = open im- mediately given the current noisy dataZ E[r(open)jZ]=b å i p i g (5.12) If the buyer takes a buying action a= a b (i;q) and obtains a new noisy data point z 0 i from the same user with the new probability p 0 i of x x x i 2 R, replacing the current noisy data z i inZ with z 0 i , the expected profit of open is E[r(open)jZ;z 0 i ]=b p 0 i + å j6=i p j g q (5.13) and the incremental profit is: IP(Z;a;z 0 i )=E[r(open)jZ;z 0 i ]E[r(open)jZ]=b(p 0 i p i ) q If z i does not exist inZ (i.e. x x x i was never purchased before), p i = 0. Since p 0 i is unknown before we actually perform the buying action a= a b (i;q), we can calculate the expected incremental profit for taking the buying action a= a b (i;q) given the current noisy data Z as follows: EIP(Z;a)=E z 0 i [IP(Z;a;z 0 i )] (5.14) =E z 0 i [b(p 0 i p i ) q] (5.15) =bE z 0 i [p 0 i jz i ;s i ;q]b p i q (5.16) =b Z R 2 P(z 0 i jz i ;s i ;q)p 0 i (z 0 i )dz 0 i b p i q (5.17) 84 whereR is the real domain and p 0 i (z 0 i )= Z R P(x x x i )dx x x i (5.18) x x x i N (z 0 i ;s 0 2 i I) (5.19) The distribution P(z 0 i jz i ;s i ;q) of the new noisy point z 0 i if we takes action a= a b (i;q) is the convolution of two distributions: the distribution x x x i jz i ;s i N (z i ;s 2 i I) of the true location x x x i given the current noisy data z i and the distribution z 0 i jx x x i ;qN (0;s 0 2 i I) of the next noisy data z 0 i generated from the true location x x x i where s 0 i is the noise magnitude for x x x i at the price q. This means the distribution P(z 0 i jz i ;s i ;q) is z 0 i jz i ;s i ;qN z i ;(s 2 i +s 0 2 i )I (5.20) The buyer can then choose the next buying action as the action that maximizes the expected incre- mental profit: a (Z)= argmax a EIP(Z;a) (5.21) With EIP(Z;a), we develop two algorithms to help the buyer maximize their profit while main- taining a low purchasing cost. 5.4.3 The Spatial Information Probing Algorithms We present two greedy algorithms that are based on the probing (or information gathering) tech- nique. The general idea is to utilize the continuity of the price to obtain noisy data at low cost first in order to quickly eliminate uninteresting data before paying a higher price for more accurate data. The two algorithms are called spatial information probing (SIP) and SIP with terminals (SIP-T). 85 5.4.3.1 The SIP Algorithm The pseudo-code for SIP is shown in Algorithm 9. SIP has two phases: the first phase is a pure exploration phase and the second phase is an exploration-exploitation phase. In the first phase, the buyer would buy all available data points in a large bounding box around the target region R at a small starting price q 0 (lines [2-9]). Then in the second phase (lines [10-30]), the buyer repeatedly calculates EIPs (lines [10-13]), takes a buying action based on the high potential EIPs (line [15, 27]) and uses the newly purchased noisy data, if any, to guide the next action (line [19, 29]). After each purchase, the buyer also checks the opening condition described in Section 5.4.1. The buyer would stop buying if the best EIP value is negative (line 16) which means that in expectation, buying more data would not increase the profit. Because EIP(Z;a) only depends on z i , not any other points, we use EIP(z i ;a)= EIP(Z;a) as a b (i;q) clearly specifies z i . Also, to reduce computation time, whenever the buyer calculates EIPs, the buyer keeps track of the best next buying action a for each noisy data point z i and its corresponding value of EIP (line 12 and 29). This best next action is calculated using Algorithm 10. Algorithm 10 chooses the best next buying action for a noisy data point z i given the most recent purchased price q and a price increment factor h. Even though one can try to calculate the EIP(z i ;a) for all possible prices in [q;l i ], it would be computationally impractical. In addition, the next price which gives the highest EIP(z i ;a) may be too close to q to be informative enough (i.e. significant change in p i ) to make good progress. To overcome these obstacles, Algorithm 10 operates in multiple rounds and increases the potential price in each round by a factor h until the potential price is at least l i . The factor h allows the next purchase for x x x i to be significantly more informative than the current z i , besides reducing computation. 86 Algorithm 9 The SIP Algorithm Input: UsersU ; region R;b;g; starting price q 0 ; price increment factor h; Output: open or cancel 1: Z / 0; V 0;E fg; 2: for x x x i 2U do 3: z i Buy x x x i at price q 0 4: Z Z[ z i ; p i P(x x x i 2 Rjz i ) 5: V V+b p i 6: if Vg > 0 then 7: return open 8: end if 9: end for 10: for z i 2Z do 11: a argmax a EIP(z i ;a) (Algorithm 10) 12: E[a] EIP(z i ;a) 13: end for 14: while true do 15: a b (i;q) argmax a E[a] 16: ifE[a b (i;q)] 0 then 17: return cancel 18: end if 19: DeleteE[a b (i;q)] 20: z 0 i Buy x x x i at price q 21: Z (Zn z i )[ z 0 i 22: p i P(x x x i 2 Rjz i ) 23: p 0 i P(x x x i 2 Rjz 0 i ) 24: V Vb p i +b p 0 i 25: if Vg > 0 then 26: return open 27: end if 28: a argmax a EIP(z i ;a) (Algorithm 10) 29: E[a] EIP(z i ;a) 30: end while 87 Algorithm 10 Find The Best Next Buying Action Input: z;l i ; current price q; price increment factor h; Output: Best next buying action a 1: v ¥ 2: while q<l i do 3: q min(q h;l i ) 4: if v< EIP(z;a b (i;q)) then 5: a a b (i;l i ) 6: v EIP(z;a ) 7: end if 8: end while 9: return a 5.4.3.2 The SIP-T Algorithm One issue with SIP is that, for x x x i that is actually inside R but closer to the edges, the first low-price purchases may give unexpectedly low values for the probability of x x x i being inside R, making the increment in the expected profit become smaller than the next price. This issue may cause the highest value of EIP(z i ;a) to be negative, thus, SIP may stop buying earlier than expected. We propose the SIP-T algorithm to address this issue. SIP-T defines a terminal belief (or terminal) t i for each data point x x x i and only stops buying when either all data points are in their terminals or the opening condition is satisfied. The terminal t i specifies that the buyer can be certain about whether x x x i is inside R or not. The terminating condition determines if z i is att i . Although a threshold for p i can be used for the terminating condition (e.g. p i < 0:05), the high-magnitude noise when purchasing data at low prices make this approach ineffective. Instead, SIP-T uses the standard deviation s i of the noise to check for the terminating condition. More specifically, for a noisy data point z i with the standard deviations i of the noise, z i is att i if in each dimension z i is either (1) inside R with the distance to each edge at least ks i or (2) outside R with the distance to the closest edge at least ks i . We arbitrarily choose k= 2. 88 With the terminal condition defined as above, SIP-T is SIP with two changes. The first change is that, in line 16, it returns cancel whenE = / 0 instead ofE[a b (i;q)] 0. The second change is that after buying a data point and updating the expected profit in line 27, it checks for the terminating condition, and if the condition holds, it immediately continues to the next purchase round (line 14) instead of calculating next best buying action for that data point. 5.5 Experimental Evaluation We experimentally evaluate our probing algorithms on a real-world dataset with various settings for users’ data, the buyer’s decisions and algorithmic parameters. 5.5.1 Datasets We experiment on the Gowalla dataset from the SNAP project 2 with 6,442,890 check-ins of 196,591 users over the period of February 2009 to October 2010. Each check-in includes a user id, check-in time, latitude and longitude. We collect check-ins within a Los Angeles boundary defined by a bounding box from a south- west corner at (-118.684687, 33.699675) to northeast corner (-118.144458, 34.342324) in degrees of latitude and longitude. We then keep only one random check-in per user, which results in a subset of 5827 check-ins. The radian (lat, long) coordinates are then converted to a locally planar Cartesian coordinate system with the mid-point of the latitude/longitude bounding box as the reference point. The bounding box in local Euclidean coordinates (x;y) is from the southwest corner at (-25,000, - 35,000) to the the northeast corner at (25,000, 35,000) in meters. Figure 5.2 shows the check-ins in Los Angeles area within the bounding box. 2 https://snap.stanford.edu/data/loc-gowalla.html 89 Figure 5.2: Check-ins of Gowalla users in Los Angeles converted to local Euclidean coordinates, 1 check-in per user. 5.5.2 Evaluation Metrics To avoid any bias towards any position of the target region R, we evaluate performance of each algorithm over a collection of regionsR. Our collection of regions is a grid over the bounding box. Each cell of the grid is considered as a target region for making an open/cancel decision. The evaluation metrics includes the Average Realized Profit (ARP), the Median Realized Profit (MRP) and the Recall as explained in the following. The ARP and MRP metrics. For each target region R2R, the net profit of the buyer’s action sequence a R is r(a R ), calculated using Equation 5.7. Then ARP and MRP are defined as the mean and median values offr(a R )jR2Rg. The reason we use MRP, in addition to ARP, is because there are often a few popular regions that have many users and become outliers in term of ARP since opening a restaurant there can yield extremely high profit compared to other regions. MRP 90 is more robust to the effect of outliers and, as discussed later, can better reflect the cost spent by each algorithm. Recall. Given the ground truth data, for each target region R, if the opening condition holds or does not hold for R, R is considered as a positive or negative region, respectively. Similarly, the value open=cancel of the open or cancel action a o R is considered as positive or negative decision. Consequently, recall can be calculated for the setfa o R jR2Rg as the ratio of the positive regions the algorithm can find. 5.5.3 Baselines We compare our algorithms to several baselines. The first baseline is the Oracle algorithm which simply knows the most accurate location data of all users. Oracle is used as a benchmark to show the maximum value of each metric that any algorithm can achieve, even though this is unlikely since an algorithm usually needs to purchase data in order to make an open/cancel decision and the price of purchasing data may surpass the profit. The second baseline is the utility maximization algorithm in [105], referred here as the PoI al- gorithm. With PoI, each point x x x i is assigned a value (called a “grade”), which is the extra expected profit v e such that the expected profit the buyer may earn by buying x x x i at a certain price to obtain new noisy data is the same as the expected profit from stopping buying x x x i and receiving v e . The grade v e is calculated based on the price of x x x i , the profit per user b and a uniform probability of x x x i being inside the region R. PoI then ranks data points based on their grades and then repeatedly buys data points at full price until the maximum grade is non-positive or the opening condition holds. The other baseline is the Fixed Maximum Cost (FMC) algorithm that spends a fixed amount to buy all data points at the same price. As mentioned, the challenge with the FMC is how to predict the fixed amount. We used 0:1%, 1% and 2% of the fixed cost g to derive three variations, called FMC-0.1, FMC-1 and FMC-2, respectively. 91 5.5.4 Parameter Setup The gross margin per userb is set tob2f50;100;150;200g in US dollars. The values are chosen based on the annual gross margin of some fast food chains such as Starbucks, McDonald’s or Del Taco, divided by their total number of customers, which can be found in their annual reports. The bold value indicates the default value used in the experiments where this parameter is fixed. Instead of fixing values for the fixed cost g, which is difficult to know in the real world, we consider the ratio n 0 =g=b which can be seen as the minimum number of users that the buyer would need to open, because if n> n 0 , thenbn>g, which means the opening condition holds. We call this ratio the minimum user threshold and set it to n 0 2f200;300;400;500;600g. These values are derived from [43] where the authors showed that the average number of restaurants was 2.3 (SD, 1.8) per 1,000 residents in Los Angeles County, yielding about 435 residents per restaurant. Whenb is fixed at 100, these values yieldg2f20,000;30,000;40,000;50,000;60,000g. For users’ privacy valuation, we simulate the privacy level and sensitivity values from two uniform distributions r i Unif(0;d) and n i; j Unif(0;d) then multiply them to get l i =r i n i; j . These distributions mean that each user has general privacy levels from 0 to d, and their own data points have sensitivity levels from 0 to d. The scale d is set to d2f1;2;3;4;5g. For the size of the regions R, we experimented with different L L region sizes with L2 f2,500;5,000;7,500;10,000g meters, which creates from 35 to 560 regions in the test grid. These values are derived from the study [79]. For the parameters of SIP and SIP-T, the starting price q 0 is set to a very small dollar value which is q 0 2f0:0001;0:001;0:002;0:003;0:004g, and the price increment factor h is set to h2 f1:5;2;3:5;5g to reflect small and large increment factors. The scaling factor k s is fixed at k s = 20, which would render the standard deviation of noise to bef20;40;80;:::g for the ratio l i q = f1;0:5;0:25;:::g. These values are close to the noise magnitude of real-world location noise such as GPS, Wi-Fi or cell towers. Finally, the experiments for each region is executed in a single core of an Intel ® Core TM i9-9980XE CPU. 92 200 300 400 500 600 n 0 =c/β − 500 0 500 1000 1500 2000 ARP Oracle SIP SIP-T FMC-1 FMC-2 FMC-0.1 PoI (a) ARP 200 300 400 500 600 n 0 =c/β − 1600 − 1400 − 1200 − 1000 − 800 − 600 − 400 − 200 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP 200 300 400 500 600 FMC-0.1 0.17 0 0 0 0 FMC-1 0.83 1 0.67 1 1 FMC-2 0.83 1 0.67 1 1 PoI 0.17 0.25 0 0 0 SIP 0.83 0.5 0.33 0.5 1 SIP-T 0.83 1 0.67 1 1 (c) Recall Figure 5.3: The effect of the minimum user threshold n 0 93 5.5.5 Experimental Results 5.5.6 The effect of profit model parameters 5.5.6.1 The minimum customer threshold n 0 We first show in Figure 5.3 the effect of the minimum customer threshold n 0 . SIP and SIP-T consistently outperform other algorithms in most cases. The average realized profit (ARP) decreases when n 0 increases, since there are fewer regions that have enough users to open a restaurant (Figure 5.3a). However, while the ARP of other algorithms sharply decreases, SIP and SIP-T maintain their performance compared to Oracle. Note that Oracle is assumed to know the accurate data of all users. The actual cost to obtain such accurate data is about 1.8 million USD. Besides, although ARP of FMC-1 is sometimes similar to that of SIP and SIP-T, we emphasize that FMC comes with no principles for deciding how much to spend. Therefore, in some scenarios, it just got lucky. For example, FMC can easily suffer from underspending (such as FMC-0.1) or overspending (such as FMC-2). The median realized profit (MRP) further demonstrates the superiority of SIP and SIP-T com- pared to other algorithms (Figure 5.3b). Since the decision for the majority of the regions in the grid should be cancel, Oracle has a zero MRP and other algorithms have negative MRPs, which are the median amount they spent on purchasing data. MRP values of SIP and SIP-T are stable and several times higher than those of other algorithms, except for FMC-0.1 which was underspending. While often spending much less than the other algorithms, SIP and SIP-T can still make correct open decisions for the regions where the buyer should decide open. This is shown in Table 5.3c where recall of SIP and SIP-T is comparable to FMC-1 and FMC-2 which spent significantly more. The reason for the superior performance of SIP and SIP-T are their highly adaptive nature: adaptive to the number of users in the target region and to the position of each data point relative to the target region. Figure 5.4 illustrates the amount SIP spent for each region against the true number of users (Figure 5.4a) and for each data point (Figure 5.4b). The amount spent tends to be higher for regions that have the true number of users closer to the minimum user threshold and 94 0 250 500 750 1000 1250 True number of users 0 200 400 600 800 1000 Cost Minimum user threshold (a) Amount spent per region − 10000 0 10000 − 20000 − 15000 − 10000 − 5000 0 5000 10000 15000 20000 0.000 0.171 0.342 0.513 0.683 0.854 1.025 1.196 1.367 (b) Amount spent per data point Figure 5.4: Illustrations for the amount spent by the SIP higher for data points closer to the edges of a target region. This is because the true state being inside or outside a region is harder to identify for users closer to the edge, thus requiring more accurate data, which costs more. Although PoI dynamically decides which data point to buy, it does not perform well compared to SIP and SIP-T because it bought data at full price. While SIP and SIP-T have comparable ARP, SIP has a higher MRP but a lower recall than that of SIP-T. This is because SIP-T continues to buy a data point at higher accuracy even though the EIP of the data point can be negative. Hence SIP-T would spend more than SIP but have more accurate location information. The general trade-off is that spending more money buys more accurate data, which helps make more accurate open/cancel decisions. However, spending excessively may decrease the final profit, because the high cost of purchasing data may surpass the profit. SIP and SIP-T are two alternatives that balance this trade-off with different foci: on the expected profit per data point vs. distinguish- ing whether a data point is inside or outside the target region. 95 50 100 150 200 β − 500 0 500 1000 1500 2000 ARP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (a) ARP 50 100 150 200 β − 1750 − 1500 − 1250 − 1000 − 750 − 500 − 250 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP 50 100 150 200 FMC-0.1 0 0 0.33 0.33 FMC-1 0.67 0.67 0.67 0.67 FMC-2 0.67 0.67 0.67 1 PoI 0 0 0.33 0.33 SIP 0.33 0.33 0.33 0.33 SIP-T 0.67 0.67 0.67 0.67 (c) Recall Figure 5.5: The effect of the gross margin per userb 96 5.5.6.2 The profit per userb The effect of the gross margin (or gross profit) per userb is shown in Figure 5.5. Asb increases, the buyer can gain more per user, thus, gaining higher ARP. However, as minimum user thresh- old n 0 is fixed, when b increases, g also increases, thus, increasing the amount the FMC-based algorithm spends. That is why MRP decreases for the FMC-based algorithms. With a higher value ofb, the EIP of a data point at a price also increases. Increasing EIP leads to a higher chance that such data would be purchased at a higher price. This explains the slight decrease in MRP of SIP. Since SIP-T does not rely on EIP to stop buying a data point, its MRP does not change. Recall of SIP and SIP-T remain comparable to FPC-1 and FMC-2 while spent significant less. The result from the minimum customer threshold n 0 and the profit per userb show that SIP and SIP-T can give consistent and high results for different profit model parameters. 5.5.7 The effect of query parameters and user’s data 5.5.7.1 The target region L Figure 5.6 shows the effect of the size L of the target region. Because the difference between values of ARP and MRP for different values of the size L is large, ARP is shown as the ratio to ARP of Oracle and MRP is shown on a logarithmic scale. With a larger region (i.e. a higher value of L), the buyer can gain more users, resulting in a higher profit in general. When the region size is very large (e.g. L= 10,000), all algorithms achieve comparable ARP. However, when L becomes smaller, SIP and SIP-T can maintain good performance while other methods suffer. Also, when L gets smaller, with a uniform probability, the probability of a data point being inside a region also becomes smaller. This makes PoI exclude many data points from buying, which may eventually not buy any data points and have a 0 recall. SIP and SIP-T, again, can maintain good ARP and recall while spent significantly less. 97 2500x2500 5000x5000 7500x7500 10000x10000 L (m× m) − 5 − 4 − 3 − 2 − 1 0 1 ARP ratio to Oracle Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (a) ARP 2500x2500 5000x5000 7500x7500 10000x10000 L (m× m) − 10 3 − 10 2 − 10 1 − 10 0 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP 2500 5000 7500 10000 FMC-0.1 0 0 0.2 0.8 FMC-1 1 0.67 1 1 FMC-2 1 0.67 1 1 PoI 0 0 0.2 0.4 SIP 1 0.33 0.6 0.6 SIP-T 1 0.67 1 1 (c) Recall Figure 5.6: The effect of the size L of the target region 98 5.5.7.2 The scale d of users’ privacy distributions Figure 5.7 shows the effect of the scale d of the users’ privacy distributions. In all metrics, per- formance tends to decrease when the scale increases, because the FMC-based algorithms would obtain noisier data for the same price, and SIP and SIP-T would need to spend more to obtain the same level of accuracy of a data point. For PoI, a lower value of d make it more likely to buy a data point, since the expected profit gain would be higher, hence, it would spend more to buy data, resulting in a lower MRP; and vice versa. 5.5.8 The effect of algorithmic parameters 5.5.8.1 The starting price q 0 Figure 5.8 shows the effect of the starting price q 0 . Because it is not preferable to spend a large amount during the pure exploration phase, the starting price is set to small values. While SIP-T uses the starting price, it eventually aims to distinguish whether a data point is inside or outside the target region. Therefore, as long as the starting price is relatively small, the change of the starting price does not have significant effect to the performance of SIP-T. However, for SIP, there is a small change in ARP since with a higher starting price, it can obtain more accurate data to make more accurate open/cancel decisions. However, it would decrease its MRP, which reflects the total cost of buying data. On the other hand, a very small value of q 0 may result in too noisy data points at the beginning and can negatively effect the performance of SIP. Since SIP-T is more stable to the change of q 0 , it is more preferable if one is willing to spend more budget for buying data. 5.5.8.2 The price increment factor h Figure 5.9 shows the effect of the price increment factor h. For both SIP and SIP-T, a higher value of h means that they would skip calculating EIP of a potential price more often. This can decrease ARP but probably for different reasons: SIP may incorrectly stop buying a data point, resulting in 99 1 2 3 4 5 d − 1250 − 1000 − 750 − 500 − 250 0 250 500 750 ARP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (a) ARP 1 2 3 4 5 d − 1400 − 1200 − 1000 − 800 − 600 − 400 − 200 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP 1 2 3 4 5 FMC-0.1 0.67 0.33 0 0 0 FMC-1 1 0.67 0.67 0.67 0.33 FMC-2 1 1 0.67 0.67 0.67 PoI 0.67 0.33 0 0 0 SIP 0.67 0.67 0.33 0.33 0.33 SIP-T 1 0.67 0.67 0.67 0.67 (c) Recall Figure 5.7: The effect of the scale d of the privacy distributions 100 0.0001 0.001 0.002 0.003 0.004 q 0 − 400 − 200 0 200 400 600 ARP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (a) ARP 0.0001 0.001 0.002 0.003 0.004 q 0 − 800 − 700 − 600 − 500 − 400 − 300 − 200 − 100 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP FMC-0.1 0 0 0 0 0 FMC-1 0.67 0.67 0.67 0.67 0.67 FMC-2 0.67 0.67 0.67 0.67 0.67 PoI 0 0 0 0 0 SIP 0.33 0.33 0.33 0.67 0.67 SIP-T 0.67 0.67 0.67 0.67 0.67 (c) Recall Figure 5.8: The effect of the starting price q 0 101 a higher MRP because it might spend less; on the other hand, SIP-T may purchase data points at a too high price, resulting in a lower MRP because it might spend more. With a smaller value of h, SIP may spend more on uninformative purchases, because the stan- dard deviation of the next purchase may not be different enough from the current purchase, thus, increasing the total amount spent (i.e. a lower MRP). While both SIP and SIP-T spend more, SIP- T performs better than SIP with a small value of the factor h, because it can make more accurate open/cancel decisions. 1.5 2 3.5 5 SIP 265 124 54 38 SIP-T 163 100 59 48 Table 5.1: The average execution time (in seconds) of SIP and SIP-T for different values of the price increment factor h A smaller value of h would also result in a longer execution time of SIP and SIP-T since there would be more potential prices for which they need to calculate EIPs. Table 5.1 shows the average execution time of SIP and SIP-T for different values of h. 5.6 Discussions and Future Work To ease discussion, several simplifying assumptions were made for the buyer’s profit maximization problem. For example, the privacy valuation was assumed to be known to the users or parameters of the profit model of the buyer were known; only a single snapshot of locations of users or only one buyer with one query was considered. Even with these simplifications, the buyer’s profit maximization problem remains a challenging problem. For example, it can be seen as a particularly hard instance of POMDP. We also emphasize that the buyer’s profit maximization problem is only one specific problem used to concretely demonstrate the broader problem: the spatial privacy pricing problem. Other aspects of the spatial privacy pricing problem should also be explored. 102 1.5 2.0 3.5 5 h − 400 − 200 0 200 400 600 ARP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (a) ARP 1.5 2.0 3.5 5 h − 800 − 700 − 600 − 500 − 400 − 300 − 200 − 100 0 MRP Oracle FMC-0.1 SIP SIP-T FMC-1 FMC-2 PoI (b) MRP 1.5 2 3.5 5 FMC-0.1 0 0 0 0 FMC-1 0.67 0.67 0.67 0.67 FMC-2 0.67 0.67 0.67 0.67 PoI 0 0 0 0 SIP 0.33 0.33 0.33 0 SIP-T 0.67 0.67 0.67 0.67 (c) Recall Figure 5.9: The effect of the price increment factor h 103 5.7 Chapter Summary In a geo-marketplace, users can charge a price for their location data, and buyers can try to optimize their utility by making intelligent buying decisions. In this chapter, with spatial privacy pricing we introduce the element of privacy, where a user can charge different prices depending on how much a particular data point would reveal. We illustrate this interplay between privacy, utility and price with an example scenario of a buyer who is considering opening a restaurant. The restaurant will only be profitable if there are enough people nearby. With this example, we formalize the privacy and pricing considerations of the seller as well as the buying and utility considerations of the buyer. The buyer’s reasoning is captured as an incremental expectation maximization algorithm that accounts, in a principled way, for the points’ uncertainty, prices and the anticipated profit. Our formulation results in a new geospatial problem for optimizing a buyer’s decision-making process. Using our formula for “expected incremental profit”, we introduced two related algo- rithms for specifying which location points a buyer should buy at which prices. The algorithms look for the next best point to buy. Compared with five baseline algorithms, our SIP and SIP-T algorithms are able to better adapt to the locations and prices of user data. To the best of our knowledge, this is the first research that considers the privacy, utility and price of location data in a unified framework. This is an important step for creating a real geo- marketplace. 104 Chapter 6 Degraded Trajectory Release with Variable-Price Query In this chapter, we consider another geo-marketplace setting where sellers’ trajectories can be de- graded and released or sold at different prices [85]. It is challenging to find an appropriate pricing mechanism allowing comparison between trajectories with widely different characteristics. There- fore, in this chapter, we focus on a more generic problem of quantifying the value of information of a trajectory in a principled way that allows comparison between trajectories with widely different characteristics. Then the price can be derived from the value of information. Value of information (VOI) of a trajectory may change depending on the specific application. However, in a variety of applications, knowing the intrinsic VOI of a trajectory is important to guide other subsequent tasks or decisions. This work aims to find a principled framework to quantify the intrinsic VOI of trajectories from the owner’s perspective. This is a challenging problem because an appropriate framework needs to take into account various characteristics of the trajectory, prior knowledge, and different types of trajectory degradation. We propose a framework based on information gain (IG) as a principled approach to solve this problem. Our IG framework transforms a trajectory with discrete-time measurements to a canonical representation, i.e., continuous in time with continuous mean and variance estimates, and then quantifies the reduction of uncertainty about the locations of the owner over a period of time as the VOI of the trajectory. Qualitative and extensive quantitative evaluation show that the IG framework is capable of effectively capturing important characteristics contributing to the VOI of trajectories. 105 6.1 Introduction The availability of mobile devices with location-tracking capability has enabled individuals to generate a great amount of location data from various types of signals, e.g., GPS, Wi-Fi, or cell service. A trajectory, which is a sequence of location measurements, is an important type of location data that contains valuable information about the owner’s locations and movements. For example, a trajectory may indicate the owner’s moving behavior [133], e.g., a trip to home, office, or favorite shops and time spent there; it may help identify irregularities such as vacation days or house moving; it is important for traffic, speed, and route inference [136]. Each trajectory gives information of some kind. The value of the information (VOI) varies depending on the use and user of the data. For example, the same trajectory may have one value for the owner and another for an enterprise, and that same trajectory may also have different values for different enterprises, such as a public or private, especially for an ad-targeting company. However, a trajectory also contains an intrinsic, formulaic VOI, since it contains quantified locations of an individual over space and time. For example, it is reasonable to assume that a home-to-office trajectory with 1000 measurements from beginning to end of the trip tends to have more information than another home-to-office trajectory with only 2 measurements at home and office. Our goal is to find a principled approach to quantify the intrinsic VOI of trajectories from the owner’s perspective. To the best of our knowledge, this is the first attempt at such a quantification. Given a trajectory, there are natural questions about selling it [61], sharing it, using it to train machine learning, storing it, or examining it more closely. Quantifying its VOI helps measure how valuable, revealing, informative [122], distinct, and surprising [136] it is. A trajectory’s VOI can be a critical piece of metadata that indicates its intrinsic value for a variety of tasks, both stand-alone and as part of a collection. This problem presents several interesting challenges: • First, a trajectory has many characteristics contributing to its VOI, and effectively capturing their complex relationships is non-trivial. Examples of these characteristics are the number 106 of measurements, temporal duration, and how measurements distribute spatially and tempo- rally. • Second, a trajectory can be degraded for different purposes, e.g., measurements can be per- turbed by adding random noise or completely removed to enhance the owner’s privacy before releasing or selling. It is challenging to capture the effect of different types of degradation on the VOI of the trajectory. • Third, the owner may assume different prior knowledge about a trajectory, which may change its VOI from the owner’s perspective. For example, if the owner has never released any trajectory data before, meaning their trajectory would give more information about their locations than if they already released a perturbed version of the trajectory. Some straightforward methods, e.g., using a trajectory’s characteristics such as size or duration, fail to capture other characteristics or prior knowledge. Previous work based on Spatial Privacy Pricing [84] can be adapted to sum the values of degraded measurements of the trajectory. How- ever, it also fails to capture some characteristics (e.g, how measurements distribute over space) and prior knowledge, because it ignores the mutual information of the constituent points. The notion of correctness [104] also appears promising. For this method, based on a degraded version Z of a trajectory S and prior knowledge, a probabilistic prediction is made for each measurement. Then the correctness of the prediction, indicating how close it is to the actual measurement, is aggregated to derive the VOI of Z. However, this method is not applicable when Z= S, because correctness is not available when there is no actual measurement with which to compare the prediction. The core idea behind our approach is that instead of computing the VOI of a trajectory from its discrete location measurements over time, we view VOI as how much the trajectory data helps reconstruct locations of the owner. More specifically, the VOI corresponds to how much the trajec- tory data helps reduce the uncertainty of estimating its owner’s locations in continuous time. We use information gain (IG) to quantify that reduction. Even though IG is a known measure [96], it is not obvious how it should be used to compare discrete-time trajectories with widely different characteristics: long and short, dense and sparse, clean and noisy. Our innovation here is that we 107 transform each trajectory to a canonical representation, i.e., continuous in time with continuous estimates of mean and variance, which is necessary for comparing widely different trajectories and comparing new data against prior data. We realize this transformation by employing a reconstruc- tion method that can produce continuous-time probabilistic predictions along the trajectory. Consequently, we propose an IG framework as a principled way to quantify the intrinsic VOI of a trajectory. The main idea is to quantify the reduction of uncertainty about the owner’s continuous- time locations, comparing a new trajectory to a previously released degraded version or to prior information. Thus the IG framework utilizes a reconstruction method to produce continuous prob- abilistic predictions over time, and then it calculates the reduction of uncertainty from those pre- dictions compared to a reconstruction from prior knowledge. The uncertainty is measured by (dif- ferential) entropy. Our IG framework accepts any reasonable probabilistic reconstruction method. Gaussian process is used as the reconstruction method in this paper due to its popularity and flexi- bility, but is not necessarily the only choice. Specifically, our contributions are: • Propose the problem of quantifying intrinsic VOI of a trajectory from the owner’s perspective • Define characteristics that should be captured by an appropriate quantification method • Show how alternate methods fail, even when they appear reasonable initially • Introduce a method to transform each trajectory to a canonical representation, allowing us to examine the VOI with a continuous location reconstruction and to compare trajectories with widely different characteristics • Develop an IG framework over transformed trajectories as a principled approach capable of effectively capturing various trajectory characteristics, prior knowledge, and degradation • Evaluate the proposed framework both qualitatively and quantitatively with extensive exper- iments on a large, real-world trajectory dataset By providing a standard, comprehensive method for assessing the VOI of a trajectory, we enable a deeper understanding of trajectories and their utility for a wide variety of purposes. 108 6.2 Problem Setting This section introduces the problem of quantifying the intrinsic VOI of a trajectory from the owner’s perspective. We focus on the most basic form of a trajectory, which is a sequence of location measurements. Therefore, incorporating other information, such as census data or points of interest, is beyond the scope of this work. The owner can be anyone having access and right to use the data, e.g., one whose phone recorded this trajectory, or a data collector who aggregates location data from individuals. Without loss of generality and to ease the discussion, from here on, we assume the owner is the individual whose device recorded the trajectory. A trajectory S is a sequence of (potentially noisy) location measurements S=fx x x 1 ;x x x 2 ;:::;x x x jSj g. Each measurement or data point x x x (bold symbol) is a tuple x x x=< lon;lat;t;s > where x x x:lon and x x x:lat are the longitude and latitude, x x x:t is the timestamp, and x x x:s is the accuracy or uncertainty of the measurement. It is reasonable to assume Gaussian noise for location measurements [29] such as GPS points, so x x x:s can be considered as the standard deviation of independent Gaussian noise of longitude and latitude. Measurements in S are ordered by their timestamps, i.e,8x x x i ;x x x j 2 S;1 i jjSj;x x x i :t x x x j :t. The owner can assume the potential recipient has some prior knowledgeW about the location x x x i , e.g., by gathering public data or from some data already released from the owner before. This prior knowledge is represented as a prior distribution P(x x x i jW) and should be respected by the proposed methods. Intuitively, prior knowledge about a point would tend to reduce the VOI of the point. In addition to releasing raw trajectories, owners can also offer their data at different quality levels, potentially enhancing their privacy while reducing the trajectory’s VOI, e.g, a trajectory can be degraded by adding noise or subsampling. More details about degradation are discussed in Section 6.2.3. The methods to quantify the VOI also need to reflect this quality degradation. Our goal is to find a principled approach to quantify the VOI of a trajectory from the owner’s perspective. It should capture different characteristics of the raw trajectory, the prior knowledge, 109 and the effect of various degradation processes. Next, we described the characteristics that an ap- propriate method should aim to capture. This is an extensive but not exhaustive list of desirable characteristics, thus we expect future work will study more characteristics. We will then illus- trate how several baseline methods fail to capture these characteristics and how our new proposed framework succeeds. 6.2.1 Trajectory Characteristics From the raw trajectory S, certain characteristics can be derived that should contribute to its VOI. These includes size, duration, spatial distribution, temporal distribution, and measurement uncer- tainty. Size. The sizejSj is the number of measurements of S. A larger size often means there is more data to learn about the owners’ locations and movements. Therefore, it is reasonable to assume that a trajectory with a larger size tends to have more information. Duration. The duration of a trajectory S is the time difference between the first and last measure- ments of S, i.e, Duration(S)= x x x jSj :t x x x 1 :t. Similar to the size, a trajectory with a longer duration often means it may have more information about the owner’s locations than one with a shorter duration. Spatial distribution. The spatial distribution of S indicates how its measurements are distributed over space. For example, one trajectory may have all measurements at one place, while another trajectory may have measurements at multiple places (e.g, solely at home vs. at home, then at a coffee shop, then at the office). A trajectory visiting more places tends to give more information about locations of the owner. The spatial distribution of S can be measured using spatial entropy H S (S) over a grid of cells, say 10m 10m, covering the area of interest. H S (S) is calculated by creating a histogram of number of measurements of S belonging to each cell, converting this histogram to probabilities, and then computing the Shannon entropy for the probabilities. 110 Temporal distribution. Similarly, the temporal distribution of a trajectory S indicates how its measurements are distributed over time. For example, one trajectory may have many measure- ments in the first few minutes and then a long gap before the next one. Another trajectory may be distributed more uniformly over time, e.g., one measurement every 30 seconds. Intuitively, when measurements distribute more evenly over a longer period of time, they tend to give more informa- tion. The temporal distribution of S can be measured using the temporal entropy H T (S) calculated during [x x x 1 :t;x x x jSj :t] by computing a histogram of the number of measurements of S belonging to each temporal bin, say every 1 minute, from x x x 1 :t to x x x jSj :t, converting the histogram to probabilities, and then computing the Shannon entropy for the probabilities. Measurement uncertainty. Measurements may have their own uncertainty depending on how they were taken, e.g, an uncertainty of several hundred meters with cell towers [17] or 1 to 5 meters for GPS-enabled smartphones [94]. Since Gaussian noise is a reasonable assumption for GPS [29], measurement uncertainty is defined as the standard deviations of independent Gaussian noise of longitude and latitude. In general, less accurate measurements (i.e, larger s) tend to give less information about locations of the owner. 6.2.2 Prior Knowledge As mentioned, the owner can assume there exists some available prior knowledge about the lo- cation x x x i , e.g., from public data or from previous releases from this owner. This knowledge is expressed as a prior distribution P(x x x i jW) for each timestamp of interest. In general, better prior knowledge often means the trajectory gives less information compared to the case with poorer prior knowledge. 6.2.3 Trajectory Degradation The owner can offer their trajectories at lower quality. The process of lowering the quality of a trajectory is called degradation. The reason for the owner to produce degraded trajectories is 111 that a recipient can still benefit from data at a certain quality depending on their specific applica- tions [84]. For example, estimating a neighborhood-level origin-destination matrix may not need extremely accurate measurements, thus the recipient with this application can potentially spend less to purchase lower quality trajectories. In general, a lower quality trajectory would give less information than a higher quality one. A degraded version of a trajectory S is denoted as Z. In this work, we consider three types of degradation: perturbation, truncation, and subsam- pling. Other types of degradations and other variations of these degradations are beyond the scope of this paper and considered as part of future work. Perturbation. The perturbation process degrades a trajectory S by adding independent random Gaussian noise with standard deviation s z to each x x x i 2 S. Formally, a perturbed trajectory Z of S consists of measurements z i s.t. z i :lon= x x x i :lon+h 1 (6.1) z i :lat= x x x i :lat+h 2 (6.2) z i :t= x x x i :t (6.3) z i :s = q (x x x i :s) 2 +s 2 z (6.4) h 1 ;h 2 N (0;s 2 z ) (6.5) Noise magnitude z i :s is called the total noise of Z. Truncation. The truncation process degrades a trajectory S by truncating S and only keeping a fraction a t 2[0;1] of the first measurements of S. For example, ifjSj= 100, an a t = 0:2 means that the temporally first 20% of measurements are kept, which arefx x x 1 ;:::;x x x 20 g. The fractiona t is called the truncation ratio. The retained measurements are kept in their raw form. Also, there is at least one measurement retained. 112 Size Duration Spatial Distribution Temporal Distribution Measurement Uncertainty Prior Knowledge Degradation Fixed Value Size-based Duration-based Travel Distance Entropy-based * * SPP-based [84] Correctness-based [104] * * Information Gain Table 6.1: Potential methods to quantify the intrinsic VOI of a trajectory and the characteristics they can or cannot capture. Our proposed method based on information gain can capture almost all of the desirable characteristics. Subsampling. Similarly, the subsampling process degrades a trajectory S by uniformly subsam- pling S with probablitya s 2[0;1], called the subsampling ratio. More specifically, a measurement x x x i 2 S is retained with probabilitya s ; otherwise, x x x i is discarded. Hence, in expectation, a fraction a s of measurements of S are retained. The retained measurements are also kept in their raw form and at least one measurement is retained. 6.3 Baselines This section discusses baseline techniques to quantify the VOI of a trajectory. Each baseline is described along with which desirable characteristics from Section 6.2 they can capture. Table 6.1 summarizes which characteristics the baselines and our proposed method can faithfully represent. Our proposed method, based on information gain and shown in the last row of Table 6.1, is de- scribed in detail in Section 6.4. A cross ( )/exclamation mark ( )/checkmark ( ) indicates that the method in that row cannot/can partially/can fully capture the characteristic in that column, respectively. The following analysis shows how the baseline methods, while appearing initially reasonable, fail to represent some important characteristics of the VOI. For each characteristic, a method is first evaluated qualitatively. In some cases, it is obvious that a method can or cannot capture a charac- teristic, e.g., a size-based method can capture the size but not prior knowledge. In other cases, it 113 may not be as clear. In those cases, the method is said to be capable of capturing the character- istic if there is a strong correlation between the output of the method and the characteristic. The correlation is examined using Spearman’s rank correlation coefficient r, which quantifies strictly monotonic relationships between two variables and is relatively robust against outliers [99]. An absolute magnitudejrj in[0:4;0:7) and[0:7;1] indicates moderate and strong correlation, respec- tively. The cutoff points are based on previous work [99]. The values of r are calculated from a large real-world trajectory dataset described in Section 6.5.1. 6.3.1 Fixed Value Method One potential method is to set the same value for all trajectories. While this method is straightfor- ward and, in fact, was used in some surveys about values of location data from a seller’s perspec- tive [24, 110], this method ignores the fact that each trajectory may contain a different VOI. For example, a 40-mile long commute from home to work can be very different from a short trip to a nearby store. Hence, this method does not capture any of the desirable characteristics. 6.3.2 Size-based Method The size-based method uses the sizejSj of a trajectory S to quantify the VOI of S. It is reasonable to say that a trajectory with more measurements tends to have more information. While this method may distinguish trajectories with many or few measurements, the size alone would fail to fully capture the VOI of a trajectory, because the sizejSj depends heavily on the sampling rate and the duration. For example, the same trip from home to office if sampled every 1 second would have a size 5 times larger than if sampled every 5 seconds, while having roughly similar information about the locations of the person along the trip. Consequently, the size-based method successfully captures the size characteristic of a trajec- tory, but not other aspects. For the duration characteristic, because it only depends on the first and last measurements of the trajectory, the size cannot fully capture it, e.g., a trajectory with two measurements can be arbitrarily short or long. However, the size can partially capture the duration 114 if the sampling rate is relatively similar among trajectories. The correlation coefficientr between size and duration of trajectories in our dataset is 0:86, which indicates a strong correlation between them. The sizejSj also does not capture the spatial nor temporal distribution of S, because it does not indicate how the measurements are distributed. The same number of measurements can happen at nearly the same place/time if the sampling rate is high, or at different places/times if the sampling rate is low. It is also clear thatjSj does not indicate measurement uncertainty nor prior knowledge. In fact, any method using solely trajectory characteristics would fail to capture prior knowledge because prior knowledge is not considered in that method. The size can capture some degradation such as truncation, but not other degradation such as perturbation. 6.3.3 Duration-based Method Another method is to use the duration Duration(S) of S, i.e, Duration(S)= x x x jSj :t x x x 1 :t, to quan- tify its VOI, as it is reasonable to assume that a temporally longer trajectory tends to have more information. However, using only duration would fail to fully capture the VOI since it ignores all information between the start and end points. Similar to the size-based method explained before, this method can fully capture the duration but can only partially capture the size. The duration also does not represent how the measurements distribute over space and time, and is unable to take into account measurement uncertainty and prior knowledge. Duration can capture the effect of truncation but not that of perturbation nor subsampling, thus failing to fully capture the effects of degradation in general. 115 6.3.4 Travel Distance Method This method computes the travel distance TravelDistance(S) of S by summing the distances be- tween each pair of consecutive measurements, i.e., TravelDistance(S)= jSj1 å i=1 d(x x x i ;x x x i+1 ) (6.6) where d(x x x i ;x x x i+1 ) is a distance function between two measurements, e.g., Euclidean distance. It is reasonable to assume that a trajectory with longer distance tends to have more information than a shorter one. With the same mode of transportation (e.g., with bikes or cars), travel distance likely reflects the duration. Thus, in general, travel distance can partially capture duration and the size of S. With a longer distance, the trajectory tends to go through more places. Therefore, even though travel distance cannot fully capture the spatial distribution, travel distance can still partially capture that. In our dataset, a strong correlation 0:89 between the travel distance and spatial entropy indicates that travel distance can be a good representation for the spatial distribution. Similarly, for a trajectory with measurements to too far apart (e.g., hours apart), travel distance can capture some part of temporal distribution. In our dataset, the correlation coefficient between travel distance and temporal entropy is 0:78. However, using travel distance would fail to take into account any prior knowledge. It is also unclear how the measurement uncertainty can be captured using travel distance. 6.3.5 Entropy-based Method The spatial and temporal entropy H S (S) or H T (S) of S, computed from the Shannon entropy of the probabilities converted from the histogram of number of measurements of S belonging to each grid cell or temporal bin, can be good candidates to quantify the VOI given its extensive applications in information theory. A higher spatial/temporal entropy likely indicates that the trajectory gives more information about a person’s locations over space/time. 116 Since H S (S) or H T (S) are used to measure the spatial and temporal distribution of S, respec- tively, a method combining both H S (S) or H T (S) can capture these two characteristics. However, it is unclear how they should be combined, which is why there are asterisks for the entropy-based row in Table 6.1. While H S (S) and H T (S) do not fully capture the size and duration (e.g., a trajectory with only two points but far apart from each other can have a long duration but low entropy values), when S has a relatively high sampling rate, H T (S) would be highly correlated with the size and duration. In our dataset, the correlation coefficients between H T (S) and size and duration are 0:89 and 0:97, respectively, which indicate a very strong correlation. So, an entropy-based can partially capture these characteristics. However, since entropy reflects the uncertainty, when the measurements have higher uncer- tainty (i.e., larger x x x:s), both H S (S) and H T (S) tend to increase. This is opposite of what one might expect, because a more uncertain measurement means less information. Thus, entropy does not capture measurement uncertainty nor degradation such as perturbation. Computing H S (S) and H T (S) also ignores all prior knowledge. Another issue is that the entropy can change significantly when arbitrary parameters for computing entropy (i.e., size of grid cells or length of time bins) change. 6.3.6 Spatial Privacy Pricing-based Method This method is based on the previous work on Spatial Privacy Pricing [84] (SPP) where each measurement has the same value defined by the owner, and the value can be reduced when the measurement is perturbed by noise. The VOI of the trajectory is then calculated by summing up the values of each individual measurement. With the ability to change the value based on the noise in the measurements, this method can capture the measurement uncertainty and represent degradation. Summing the individual values means this method is similar to the size-based method, but is also sensitive to the perturbation parameters. Therefore, this method has similar characteristics as the size-based method, which 117 means fully capturing size, partially capturing duration, and unable to capture spatial and temporal distributions and prior knowledge. 6.3.7 Correctness-based Method This method is based on the correctness of reconstructing the raw measurements from available data [104]. Roughly speaking, when calculating the VOI of a degraded version Z of S, the owner can assume the recipient is attempting to reconstruct each measurement x x x i 2 S based on Z. The correctness is the expected error between the actual points x x x i and the probabilistically reconstructed points ˆ x x x i . If the expected error is lower, then it would be reasonable to assume a higher VOI of Z. While this concept was proposed for a different problem setting, some techniques can be used to adapt it to our problem. In fact, our proposed framework, discussed later, also has a reconstruc- tion step that can capture prior knowledge and degradation. Thus a correctness-based method can capture the prior knowledge and degradation. The asterisks in the correctness-based row in Ta- ble 6.1 indicate that a correctness-based method needs some modifications for the problem setting and techniques to be fit for our problem. The main drawback of using correctness is that it requires some ground-truth measurements of S being available to evaluate the correctness of the reconstructed trajectory made from the de- graded version Z and prior knowledge. So, it cannot measure the VOI of the full, raw trajectory S, because there is no ground-truth measurement exists to evaluate correctness in that case. Conse- quently, since the full, raw trajectory cannot be fully captured, which means there are often some characteristics not fully captured (e.g., subsampling changes the size and truncation changes the duration), this method can only partially capture characteristics of a trajectory. Another issue is that this method relies on the correctness of discrete-time predictions, and it is unclear how the correctness of each prediction should be aggregated to obtain the correctness of the whole trajectory. For example, the correctness derived from an 80% subsampled version is evaluated on each measurement of the remaining 20% data, while the correctness derived from a 60% subsampled version is evaluated on the remaining 40%. It is unclear how the correctness 118 should be modified to reasonably quantify both, and how the correctness of each prediction should be aggregated. 6.3.8 Other Potential Quantities There are other potential quantities contributing to the VOI of a trajectory. However, these are not considered as baselines because they are either orthogonal (i.e., they can be used in conjunction with other methods) or complicated (i.e., finding these quantities requires techniques beyond the scope of this work). Examples are time period, subjective sensitivity, and visits. Time period. The time period that a trajectory S was taken,e.g., weekdays or weekend, can be a factor contributing to its VOI. For example, a person may often have commute-related trajectories during weekdays but more leisure-related trajectories on weekends. While time period does not represent actual locations, it can be used in conjunction with other methods, e.g., an owner can have different values for trajectories during weekdays compared to weekends because of privacy concerns. Subjective sensitivity. Each person may have their own sensitivity for different types of loca- tions, which may lead to different sensitivity for different trajectories. For example, one may feel their workplace is more sensitive than their favorite coffee shop, thus having a higher sensitivity for the trajectory from home to work than the one to the coffee shop. While this information can contribute to the VOI of a trajectory, incorporating it requires additional information about the location measurements, which is not the focus of this work. When this information is available, it can be used in conjunction with the proposed methods in this work to better quantify the VOI tailored to the owner’s subjective reasoning. Visits. This method bases on the number of visits of a trajectory to quantify the VOI. Its main drawback is how to define a visit. It is often not feasible for the owner to manually define all visits for all of their trajectories. On the other hand, complicated techniques to automatically find 119 visits [133] are often used to segment a long sequence of measurements into trajectories, which is beyond the scope of this work. These techniques also often require additional information and/or a complex set of parameters where a small change of some parameters may result in a significantly different number of visits, which is not desirable. 6.4 The Information Gain Framework This section describes the proposed framework to quantify the VOI of a trajectory. The framework is based on the notion of information gain (IG) which quantifies the reduction of uncertainty when new information is available. Adopting IG to trajectories is not straightforward, and we enable it by transforming each trajectory to a canonical representation. This transformation is done by em- ploying a reconstruction method that can produce continuous-time probabilistic predictions along the trajectory. The Gaussian process (GP) is used in this work as the reconstruction method and discussed in the next section, but we emphasize that our IG framework accepts any reasonable method of producing probabilistic location inferences. 6.4.1 Trajectory Information Gain We define information gain IG T (Z;W) of a degraded version Z of a trajectory S as the total reduc- tion of uncertainty about locations of the owner over time period T compared to the uncertainty from the prior knowledgeW. For a typical human trajectory, there are potentially several reason- able choices for T . In this work, T is defined as the entire day covering the trajectory, because a typical trajectory would not extend beyond a day. We propose IG T (Z;W) to quantify the intrinsic VOI of S and/or its degraded version Z. This section describes how IG T (Z;W) is derived. In Sec- tion 6.5, we will explain how IG T (Z;W) satisfies almost all of the criteria in Table 6.1, making it a better choice than the previous methods we described. Recall that the owner can assume that the recipient obtained some prior knowledge W, e.g., from public data or from previous noisier release of the same trajectory S. FromW, the owner can 120 assume that the recipient can derive a prior probability distribution P(x x xjW) for the location of the owner at a specific time x x x:t. The owner can then assume that after receiving Z, the recipient can use a model to reconstruct (or predict/interpolate) locations of the owner in continuous time. This assumption is made, be- cause without knowledge of the recipient’s intent, the owner should act conservatively and assume the recipient will exploit the new data fully, such as with a maximally accurate reconstruction. Such an inference can be represented as a posterior distribution P(x x xjZ;W). Subsequently, the information gain IG t (Z;W) at timestamp t= x x x:t indicates the reduction of uncertainty from P(x x xjW) to P(x x xjZ;W). The uncertainty is measured by differential entropy [21] (or continuous entropy), because both the prior and posterior distributions are likely continuous distributions in space. The differential entropy h(X) of a random variable X with probability density function P whose support is a setX is defined as h(X)= Z X P(x)logP(x)dx (6.7) Several popular probability distributions have a closed-form expression for their differential en- tropy, e.g, if a X is a Gaussian random variable with distributionN (m;s 2 ), its differential entropy is h(X)= 1 2 log2pes 2 (6.8) The entropy values of latitude and longitude of x x x are calculated separately using Equation 6.7, and summed to get h(x x xjW) and h(x x xjZ;W). Thus the IG at time t, IG t (Z;W), can be calculated as IG t (Z;W)= h(x x xjW) h(x x xjZ;W) (6.9) 121 Figure 6.1a illustrates IG t (Z;W) where the black dot shows the prediction mean and the blue area shows the uncertainty as the circle with a radius which is twice the standard deviation of the predicted distribution P(x x xjW). If Z is available, the uncertainty is reduced to the smaller blue circle on the right representing P(x x xjZ;W). IG t (Z;W) quantifies the reduction and is illustrated as the red ring. (a) IG t (Z;W) illustration 0 2 4 6 8 10 Time − 20 − 15 − 10 − 5 0 5 10 15 20 Location Prior Posterior Measurements (b) IG T (Z;W) illustration Figure 6.1: Illustrations of information gain at a single timestamp and over a time period. The reduced uncertainty is shown in the red area minus the blue area. Note that x x x does not need to be in any trajectory, because the probabilistic reconstruction is continuous in time. For example, x x x can be in between two measurements x x x i ;x x x i+1 2 S, thus, knowing x x x i and x x x i+1 would help reduce the uncertainty of where x x x can be. Subsequently, the information gain IG T (Z;W) over a time period T can be calculated by inte- grating IG t (Z;W) for all time t2 T , i.e., IG T (Z;W)= Z t2T IG t (Z;W)dt (6.10) Figure 6.1b illustrates IG T (Z;W) computed over a time period T =[0;10] with prior P(x x xjW)= N (0;100) for each timestamp. Over this period, the red line shows the prediction mean of the 122 prior distributions, and the red area shows the prior uncertainty. When Z, shown as two black crosses, is available, the uncertainty is reduced to the blue area. The amount of such reduction is shown as the red area minus the blue area and is quantified using IG T (Z;W). 6.4.2 Reconstruction Method To compute IG T (Z;W), a probabilistic reconstruction method is needed to reconstruct locations of the owner givenW and/or Z. A Gaussian process (GP) is used in this work as the reconstruction method because of its flexibility to incorporate different types of information. However, we em- phasize that reconstruction is not the focus of this paper. The IG framework accepts any reasonable probabilistic reconstruction method. Thus, we briefly discuss important aspects of the GP. More details about GPs can be seen in [97]. For a scalar function f(t), a GP implies that any subset of points sampled from f(t) is dis- tributed according to a multidimensional Gaussian. Two independent GPs are created for longitude and latitude prediction. The input for a GP are pairs< t; f(t)> where f(t) is longitude or latitude of the measurement at time t. The output for a set of timestampsT are predictions f (t) for each t2T . A GP depends on a scalar covariance kernel/function k(t;t 0 ) defining how much a measurement at time t correlates with a measurement at time t 0 . In general, the correlation decreases to zero as jt t 0 j gets larger. A good kernel can help incorporate different types of information into the model, especially when kernels can be combined together. In our implementation, two common kernels are summed together: the main kernel is a Mat´ ern kernel that captures the relationship between measurements as well as the prior knowledge, and a white kernel to capture measurement uncertainty. The formula of the Mat´ ern kernel used in this work is k Mat´ ern (t;t 0 )=s 2 f 1+ p 3 l jtt 0 j exp p 3 l jtt 0 j (6.11) 123 Settings f to the standard deviation of the prior distribution can help capture the prior knowledge, so that when the model becomes more uncertain, the standard deviation of the prediction will gradually reach this value. The length scale l is trained from data. The formula of a white kernel is k white (t;t 0 )= 8 > > < > > : s 2 m if t = t’ 0 otherwise (6.12) Settings m to x x x:s or z:s helps capture the measurement uncertainty. The final kernel is k(t;t 0 )= k Mat´ ern (t;t 0 )+ k white (t;t 0 ) (6.13) Finally, a GP also depends on a mean function m(t) which defines the expected mean values of the measurements. In our adaption, the mean function m(t) of a GP is set to be the regression line obtained by running a linear regression on the data inW. 6.5 Evaluations of the IG framework This section provides an evaluation of how the IG framework can capture each characteristic from Table 6.1, both qualitatively and quantitatively. The quantitative evaluation is performed on the Geolife dataset, which is a large, real-world trajectory dataset. 6.5.1 Dataset The Geolife dataset [137] is used for experiments to quantitatively justify claims in the paper. This is a trajectory dataset collected by 182 people carrying GPS loggers and GPS-phones in the Beijing area from April 2007 to August 2012. A trajectory is represented by a sequence of measurements containing the latitude, longitude, and timestamp, with a variety of sampling rates. With this large, real-world dataset, covering a large span in both space and time, different types of devices, and 124 a variety of sampling rates, we expect it is representative of the observations in other real-world datasets. Measurements in the Geolife dataset are filtered to retain only ones within the Bejing area with longitude from 116:20 to 116:55 degrees and latitude from 39:80 to 40:06 degrees. This area covers almost all measurements and, in total, more than 16 million measurements were retained from all 182 people. The latitude/longitude coordinates in each measurement are converted to local Euclidean coordinates(x;y) in meters with the reference origin(lon 0 ;lat 0 ) arbitrarily chosen as the center of the aforementioned area. In local Euclidean coordinates, the area has a lower left coordinate of(15000;15000) and an upper right coordinate of(15000;15000). Measurements of each individual are separated into trajectories by iterating through the ordered measurements and creating a new trajectory whenever the time gap between the current measure- ment and the next measurement is more than t seconds. The maximum time gap t is arbitrarily chosen at 300 seconds or 5 minutes. The actual value oft does not significantly change the experi- mental results. We also emphasize that finding trajectories is not the focus of this work. Thus, more sophisticated methods to find trajectories from measurements can also be used as an alternative to this segmentation approach. The final dataset has 45,831 trajectories. 6.5.2 Experiment Setup The experiments were conducted on the aforementioned dataset with various sets of parameters. The results, such as the correlation coefficient or regression line, are computed from all trajectories. Several outliers are removed for visual presentation purpose but still included in all computations. The default size of grid cells/temporal bins to calculate spatial/temporal entropy is 10 meters/one minute. The raw measurements do not include uncertainty. However, because they were recorded by GPS-equipped devices and the noise for GPS-equipped smartphones is about 3 meters [94], each measurement is assumed to have x x x i :s = 3. For perturbation degradation, the noise is added 125 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 9.2 Prior Posterior Measurements (a) Small size, large uncertainty 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 16.9 Prior Posterior Measurements (b) Small size, small uncertainty 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 29.1 Prior Posterior Measurements (c) Small size, long duration 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 34.1 Prior Posterior Measurements (d) Large size, clustered around 1st point 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 45.1 Prior Posterior Measurements (e) Same size as before, scattered 0.0 2.5 5.0 7.5 10.0 Time − 20 − 10 0 10 20 Location IG = 46.8 Prior Posterior Measurements (f) Same size and times as before, more locations Figure 6.2: An illustration of how IG can capture trajectory characteristics. The figures from left to right showing the effect of measurement uncertainty, duration, size, temporal, and spatial distribution of a trajectory. 126 such that the total noise (i.e., z i :s) is inf3;10;100;200;300;400g meters. The ratios for trun- cation and subsampling degradation (i.e., a t and a s ) aref0:8;0:6;0:4;0:2;0:05g, which mean f80%;60%;40%;20%;5%g of the raw trajectory. The subsampling is implemented so that the subsampled data with a higher ratio is a superset of the subsampled data with a smaller ratio. A measurement is assumed to be inside the Beijing area. This choice is reasonable for this dataset. However, it is not crucial to the framework where any reasonable prior knowledge can be used (e.g., measurements can be anywhere on Earth) and would only change the scale of IG. This prior knowledge is represented by a Gaussian distribution with extremely high variance where the mean is at the center of the area (i.e., coordinates(0;0)) and the standard deviation is s 0 = 7500 meters (i.e., the distance from the center to an edge of the area is twice this standard deviation). This is an uninformative prior and called the Gaussian prior, i.e., for each timestamp, the location is assumed to beN (0;s 2 0 ) for each dimension. It is also natural to consider any previously released data as prior knowledge when computing the VOI of the data the owner still retains. So we also consider previous releases as priors. This also demonstrates an advantage of the IG framework, where any previous release can be naturally considered when computing the VOI of the owner’s remaining data. To illustrate these cases, de- pending on the nature of a degradation, different informative priors based on previous releases are considered. For perturbation, two additional priors are 400m noise and 300m noise priors, illustrat- ing the cases where the owner released their trajectories at those noise levels before. For example, with the 400m noise prior, when computing the IG for a degraded version Z 100 with 100m noise of a trajectory S, the owner may consider the prior knowledge is the degraded version Z 400 with 400m noise of S. Thus, there are three priors for perturbation in total: the uninformative Gaussian prior and two informative priors 400m noise and 300m noise. Similarly, there are three priors for truncation/subsampling: the uninformative Gaussian prior and two informative 5% trajectory and 20% trajectory priors, illustrating the cases when the owner released trajectories with 0:05 or 0:2 truncation/subsampling ratio before. 127 The GP is trained on the priorW (e.g., 300m prior), except when the prior is the uninformative Gaussian prior, when it is then trained on the new degraded version Z. The degraded version Z and priorW are combined and then provided to the GP. For perturbation, measurements Z andW of each timestamp are combined using inverse-variance weighting [52]. For truncation and subsampling, because the higher-ratio degraded version is always a superset of a lower-ratio degraded version (e.g, Z 5% Z 20% Z 40% ), the combined data is Z. This is a reasonable approach for GP. There can be other methods to better combine a prior and new data, especially if another prediction model is used instead of GP; however, this is not the focus of the paper. While we expect similar trends, especially from models similar to GP such as a Kalman filter [60] or a particle filer [31], absolute values from other models can be slightly different. For GP kernels,s 0 = 7500 is provided to the Mat´ ern kernel. Total noise z:s is provided to the white kernel. Length scale l can take values in[0:01;10] and is trained separately for each degraded version. IG over the time period is computed using numerical integration with the trapezoid rule. Finally, logarithms are base 2. 6.5.3 IG Capturing Trajectory Characteristics For qualitative evaluation, Figure 6.2 illustrates the IG for different cases in one spatial dimension. Starting from left to right with a trajectory having two noisy measurements close to each other in Figure 6.2a, the IG increases as expected when the measurements are less noisy as in Figure 6.2b. When the duration increases, illustrated by two measurements being farther apart in Figure 6.2c, the IG also increases because these two measurements can also help reduce uncertainty for the longer time in between these two measurements. When more measurements are available as in Figure 6.2d, these points help reduce more uncertainty around them, thus increasing the IG fur- ther. The IG keeps increasing when the measurements are temporally distributed more evenly as shown in Figure 6.2e because, for most of the duration along the trajectory, there are temporally nearby measurements, thus helping reduce uncertainty. Finally, in Figure 6.2f, the IG also slightly increases when measurements appear in between the start and end locations (shown on the vertical 128 axis), thus making the movement less abrupt. However, if the locations jump unpredictably, the uncertainty tends to increase compared to, e.g., when the person stays at the same place, thus de- creasing the IG for that case. Therefore, the IG can only partially capture the spatial distribution and can capture the size, duration, temporal distribution, and measurement uncertainty, as reflected in Table 6.1. 2000 4000 6000 Trajectory Size 0 50000 100000 150000 200000 250000 300000 IG Regression Line 4 16 64 256 1024 4096 (a) Size,r = 0:85 5000 10000 15000 Trajectory duration 0 50000 100000 150000 200000 250000 300000 IG Regression Line 4 16 64 256 1024 4096 (b) Duration,r = 0:97 5 10 Temporal entropy (1 minute time bins) 0 50000 100000 150000 200000 250000 300000 IG Regression Line 4 16 64 256 1024 (c) Temporal entropy,r = 0:96 5 10 Spatial entropy (10m x 10m cells) 0 50000 100000 150000 200000 250000 300000 IG Regression Line 4 16 64 256 1024 (d) Spatial entropy,r = 0:68 Figure 6.3: Histogram of IG and size, duration, spatial, temporal entropy, and their correlation coefficients. For quantitative evaluation using the Geolife data, the correlation coefficients of the IG with size, duration, and temporal entropy are 0:85;0:97 and 0:96, respectively, which show very strong correlation. The correlation coefficients of the IG and spatial entropy is 0:68 which is also close 129 to a strong correlation. Figure 6.3 shows the histograms of the IG and these characteristics on a log scale of the number of trajectories, along with a linear regression line indicating a clear trend between the IG and these characteristics. The regression uses Huber loss [56] to mitigate the effect of outliers. Quantitative evaluation shown in Figure 6.4 (explained in detail in the next section) also shows that the IG can capture measurement uncertainty, because when a trajectory has large measurement uncertainty (i.e., less informative), the IG also tends to decrease. 6.5.4 IG Capturing Prior Knowledge and Degradation The impact of prior knowledge and degradation can be effectively captured by the IG framework. In general, when a trajectory is degraded more (e.g., larger total noise), the IG tends to decrease, matching the intuition that a lower quality trajectory gives lower information. Similar trends show in both the absolute value IG(Z;W) and the ratio IG(Z;W) IG(S;W) of the IG compared to the raw trajectory S. 3 10 100 200 300 400 Total noise (meters) 0 20 40 60 80 IG (thousands) Gaussian prior 400m noise prior 300m noise prior (a) Absolute IG 10 100 200 300 400 Total noise (meters) 0 20 40 60 80 100 IG change vs full trajectory (%) Gaussian prior 400m noise prior 300m noise prior (b) Percentage IG change Figure 6.4: Absolute IG and percentage IG change compared to original trajectories for different values of total noise and different types of prior knowledge. We first discuss the results for perturbation. Figure 6.4 shows the IG value IG(Z;W) and the IG ratio IG(Z;W) IG(S;W) in percentage with different perturbations and prior knowledge. As discussed in 130 Section 6.5.2, the total noise ranges from 3 to 400 meters, and the prior knowledge scenarios are Gaussion, 400m noise, and 300m noise priors. Each box plot in Figure 6.4a (or 6.4b) shows the IG value (or the IG ratio) of Z with the total noise shown on the x axis given particular prior knowledge. For example, the red, dotted box plot at total noise 100 in Figure 6.4a shows the IG values of a degraded version Z 100 of S with total noise 100 meters, given that the prior knowledge is another previously-released degraded version Z 400 of S with total noise 400 meters. It is clear that when total noise becomes larger, i.e., more degradation, the IG value and the IG ratio tend to decrease. With a more informative prior, the absolute IG value tends to be smaller. For example, the 400m noise prior is a more informative prior than the Gaussian prior, thus the absolute IG value of the same Z with the 400-noise prior is smaller than with the Gaussian prior. This also supports intuition: if one already has good information about the location of a person, getting more data does not increase such information as much as when one only has little information. This is an important property that does not exist in other baselines. 100% 80% 60% 40% 20% 5% Truncation Ratio 0 20 40 60 80 IG (thousands) Gaussian prior 5% trajectory prior 20% trajectory prior (a) Absolute IG 80% 60% 40% 20% 5% Truncation Ratio 0 20 40 60 80 100 IG change vs full trajectory (%) Gaussian prior 5% trajectory prior 20% trajectory prior (b) Percentage IG change Figure 6.5: Absolute IG and percentage IG change compared to original trajectories for different truncation ratios and types of prior knowledge. 131 100% 80% 60% 40% 20% 5% Subsampling Ratio 0 20 40 60 80 IG (thousands) Gaussian prior 5% trajectory prior 20% trajectory prior (a) Absolute IG 80% 60% 40% 20% 5% Subsampling Ratio 0 20 40 60 80 100 120 IG change vs full trajectory (%) Gaussian prior 5% trajectory prior 20% trajectory prior (b) Percentage IG change Figure 6.6: Absolute IG and percentage IG change compared to original trajectories for different subsampling ratios and types of prior knowledge. Truncation and subsampling degradation also show similar observations as those from pertur- bation. Figures 6.5 and 6.6 show the IG value and the percentage of IG change for truncation and subsampling with different prior knowledge, respectively. The truncation/subsampling ratios range from 0:8 to 0:05, which means 80% down to 5% of the original trajectory is retained. Be- cause of the 0:05 (or 1 20 ) ratio, these figures show the results from almost 40,000 trajectories that have at least 20 measurements. As discussed in Section 6.5.2, the prior knowledge scenarios are Gaussion prior, 5% trajectory prior and 20% trajectory prior. The IG also tends to decrease when the trajectory is degraded more (i.e., lower ratio) and when the prior gets better (e.g, 5% vs. 20% priors). The decrease in IG with truncation is much stronger than with subsampling. The reason is that knowing a measurement not only reduces the uncertainty at the time of the measurement but also for the time around that measurement. Thus when measurements are more evenly distributed, they help reduce more uncertainty, which means getting higher IG, as illustrated in Section 6.5.2. With the same ratio, which means the same number of measurements are retained, a truncation is similar to the case when measurements are close to each other (Figure 6.2d) while (uniform) subsampling 132 is similar to the case when measurements are more evenly distributed (Figure 6.2e). This property also highlights the difference and advantage of IG for quantifying the VOI of a trajectory compared to other baselines. Another observation is that by effectively capturing the impact of degradation, one can find the equivalence classes of different types of degradation for the same trajectory. For instance, given S, which truncation ratio produces the same VOI compared to perturbing S at total 300 meter noise? Figure 6.7 shows an example of such VOI equivalence for the first trajectory of the first owner in our dataset. The intersection between the (interpolated) truncation and perturbation lines indicates that truncating this trajectory at 65% gives roughly similar VOI to perturbing it with total 130m noise. 0 20 40 60 80 100 Percentage IG change 1% 5% 20% 40% 60% 80% 100% Subsampling/Truncation Ratio Perturbation Truncation Subsampling 0 50 100 150 200 250 300 350 400 Total noise (meters) Figure 6.7: An example of VOI equivalence classes of different types of degradation for a trajec- tory. 6.6 Related Work Quantifying the value of location data has been an active line of research. There have been several surveys of how individuals value their location data [24, 110]. Aly et. al [5] showed how the value of a location data point can be quantified from the buyer’s perspective in a geo-marketplace. Also in a geo-marketplace context, Nguyen et. al [84] proposed a framework allowing sellers to offer 133 a location data point at different qualities for different prices. However, previous work focused on the monetary value of location data. To the best of our knowledge, this is the first work that attempts to quantify the intrinsic VOI of a trajectory in their most basic form, which is a sequence of location measurements. There is also extensive work on quantifying location privacy, which also attempted to recon- struct the locations of individuals. Krumm [69] surveyed a variety of computational location pri- vacy schemes and emphasized the importance of finding a single quantifier for location privacy. Shokri et. al [104] proposed correctness as the metric for quantifying location privacy. Location privacy and the VOI of a trajectory can be related, e.g., a trajectory with high VOI may reveal more about a person than one with lower VOI, thus leaking privacy. However, location privacy does not exactly correspond to the intrinsic VOI of a trajectory: the former concerns more about semantic locations and is often more subjective, while the latter concerns more about the location coordinates over time. Correctness and subjective sensitivity, which are related to location privacy, were also discussed in detail in Section 6.3.7. 6.7 Chapter Summary The intrinsic VOI of trajectories plays an important role in many applications. We proposed the IG framework and qualitatively and quantitatively demonstrated its capability to effectively capture important characteristics of raw trajectories, prior knowledge, and various types of degradation. This shows that the IG is an appropriate framework to quantify the intrinsic VOI of trajectories. There are several potential directions for future work such as incorporating other features (e.g., points of interest), or supporting other degradation types (e.g., histogram). The IG framework allows the VOI of trajectories to be quantifying appropriately for comparison between trajectories with widely different characteristics and degradation, thus, allowing reasonable pricing strategies to be derived from the VOI. 134 Chapter 7 Conclusion and Future Work This thesis shows that geo-marketplaces can provide an feasible alternative to current practice of location data collection. Geo-marketplaces allow data owners to control their data and get reward for their contribution, especially with regard to their location privacy concerns. We studied different settings of geo-markplaces, which cover a wide range of data types, data operations, and pricing strategies. We first explored the setting of releasing data points with free query by introducing the problem of publishing the entropy of a set of locations according to differential privacy. Several methods were proposed and experimented on both synthetic and real-world datasets. With those extensive experiments, we concluded that the proposed techniques are practical. Future directions for this setting may consider improving the utility of the published location entropy, increasing the privacy guarantee with smaller privacy budget, or publishing other statistics from location data. Next, we considered a different settings where not only location data points but other geo- tagged objects such as images or videos can be advertised and purchased through a geo-marketplace for a fixed price. We proposed a blockchain-based privacy-preserving, accountable and spam- resilient marketplace for geospatial data which allows owners and buyers to be matched using only encrypted location information. In future work, we can investigate architectures that eliminate the need for trusted entities or reduce the amount of information that is made available to such enti- ties. In addition, we can explore alternative data encodings and encrypted processing techniques to 135 further reduce system overhead. We can also investigate how our results can be extended to other types of data, not only geo-spatial attributes. Next, we consider another setting of geo-marketplaces where sellers can degrade their data points by adding some noise and then sell their noisy data with potentially a price corresponding to their amount of noise added. We illustrate the interplay between privacy, utility and price with an example scenario of a buyer who is considering opening a restaurant. Our formulation results in a new geospatial problem for optimizing a buyer’s decision-making process. Using our formula for “expected incremental profit”, we introduced two related algorithms for specifying which location points a buyer should buy at which prices. Future work should look at other scenarios that may illustrate new nuances to the central problem. In addition, while the proposed algorithms, SIP and SIP-T, consistently outperformed the baselines, other algorithms could also be explored to further optimize the buyer’s decisions. We should also explore other aspects of the spatial privacy pricing problem. For example, users’ valuation of their data may change when they can observe selling and buying actions in a data marketplace, so we would explore the dynamic of users’ valuation given the interactions in the marketplace. Other privacy preserving mechanism or other ways of ensuring privacy beyond adding noise (e.g. encryption) can also be employed. We would also study further the role of the marketplace or other types of queries of the buyers. Then, we consider another geo-marketplace setting where sellers’ trajectories can be degraded and released or sold at different prices. This brought us to a more generic problem of quantify- ing the value of information of a trajectory in a principled way that allows comparison between trajectories with widely different characteristics. We propose a framework based on information gain (IG) as a principled approach to solve this problem. Qualitative and extensive quantitative evaluation show that the IG framework is capable of effectively capturing important characteristics contributing to the VOI of trajectories. Future directions may include incorporating other charac- teristics of trajectories, especially when trajectories may be captured with other types of data such as images of videos, identifying the best reconstruction method for trajectory data, or incorporating other subjective knowledge for the value of information of trajectories. 136 Other future directions for this thesis can be considering other geo-marketplaces settings, such as encrypted trajectories or publishing trajectories with differential privacy and deriving prices accordingly. Other data types can also be considered such aerial trajectories or aerial videos. Different pricing strategies or the applications of buyers, such as machine learning tasks, can also be promising future directions. 137 References [1] 112th Congress. Location privacy protection act of 2012, 2012. [2] Mart´ ın Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Tal- war, and Li Zhang. Deep learning with differential privacy. arXiv:1607.00133, 2016. [3] Eytan Adar and Bernardo A Huberman. A market for secrets. First Monday, 2001. [4] Berker Agir, Thanasis G Papaioannou, Rammohan Narendula, Karl Aberer, and Jean-Pierre Hubaux. User-side adaptive protection of location privacy in participatory sensing. GeoIn- formatica, 18(1):165–191, 2014. [5] Heba Aly, John Krumm, Gireeja Ranade, and Eric Horvitz. On the value of spatiotemporal information: Principles and scenarios. In Proceedings of the 26th ACM SIGSPATIAL In- ternational Conference on Advances in Geographic Information Systems, pages 179–188, 2018. [6] Heba Aly, John Krumm, Gireeja Ranade, and Eric Horvitz. To buy or not to buy: Computing value of spatiotemporal information. ACM TSAS, 5(4):1–25, 2019. [7] Miguel E. Andr´ es, Nicol´ as E. Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. Geo-indistinguishability: Differential privacy for location-based systems. In the 2013 ACM SIGSAC CCS’13, pages 901–914, New York, NY , USA, 2013. ACM. [8] Scott Scheper Arie Trouw, Markus Levin. The xy oracle network: The proof-of-origin based cryptographic location network. https://docs.xyo.network/XYO-White-Paper.pdf. [9] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. In PODS, pages 128–138. ACM, 2005. [10] Dan Boneh, Eu-Jin Goh, and Kobbi Nissim. Evaluating 2-dnf formulas on ciphertexts. In TCC’05, pages 325–341, Berlin, Heidelberg, 2005. Springer-Verlag. [11] Dan Boneh, Amit Sahai, and Brent Waters. Fully collusion resistant traitor tracing with short ciphertexts and private keys. In Proc. of Intl. Conf. on The Theory and Applications of Cryptographic Techniques, pages 573–592, 2006. [12] Dan Boneh and Brent Waters. Conjunctive, subset, and range queries on encrypted data. In TCC’07, pages 535–554, Berlin, Heidelberg, 2007. Springer-Verlag. 138 [13] Giacomo Brambilla, Michele Amoretti, and Francesco Zanichelli. Using block chain for peer-to-peer proof-of-location. CoRR, abs/1607.00174, 2016. [14] David Cash, Stanislaw Jarecki, Charanjit Jutla, Hugo Krawczyk, Marcel-C˘ at˘ alin Ros ¸u, and Michael Steiner. Highly-scalable searchable symmetric encryption with support for boolean queries. In CRYPTO’13, pages 353–373. Springer, 2013. [15] Pablo Samuel Castro, Daqing Zhang, and Shijian Li. Urban traffic modelling and prediction using large scale taxi gps traces. In International Conference on Pervasive Computing, pages 57–72. Springer, 2012. [16] Dario Catalano and Dario Fiore. Vector commitments and their applications. In Public-Key Cryptography–PKC 2013, pages 55–72. Springer, 2013. [17] Mike Y Chen, Timothy Sohn, Dmitri Chmelev, Dirk Haehnel, Jeffrey Hightower, Jeff Hughes, Anthony LaMarca, Fred Potter, Ian Smith, and Alex Varshavsky. Practical metropolitan-scale positioning for gsm phones. In International Conference on Ubiquitous Computing, pages 225–242. Springer, 2006. [18] Eunjoon Cho, Seth A Myers, and Jure Leskovec. Friendship and mobility: user movement in location-based social networks. In SIGKDD, pages 1082–1090. ACM, 2011. [19] Delphine Christin, Andreas Reinhardt, Salil S Kanhere, and Matthias Hollick. A survey on privacy in mobile participatory sensing applications. Journal of systems and software, 84(11):1928–1946, 2011. [20] Adrien Cou¨ etoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization (LION 2011), pages 433–445. Springer, 2011. [21] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012. [22] Justin Cranshaw, Eran Toch, Jason Hong, Aniket Kittur, and Norman Sadeh. Bridging the gap between physical location and online social networks. In UbiComp. ACM, 2010. [23] Reza Curtmola, Juan Garay, Seny Kamara, and Rafail Ostrovsky. Searchable symmetric encryption: improved definitions and efficient constructions. Journal of Computer Security, 19(5):895–934, 2011. [24] Dan Cvrcek, Marek Kumpost, Vashek Matyas, and George Danezis. A study on the value of location privacy. In Proceedings of the 5th ACM workshop on Privacy in electronic society, pages 109–118, 2006. [25] Yanan Da, Ritesh Ahuja, Li Xiong, and Cyrus Shahabi. React: Real-time contact tracing and risk monitoring via privacy-enhanced mobile tracking. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 2729–2732. IEEE, 2021. [26] Emiliano De Cristofaro and Claudio Soriente. Participatory privacy: Enabling privacy in participatory sensing. IEEE network, 27(1):32–36, 2013. 139 [27] Yves-Alexandre de Montjoye, C´ esar A. Hidalgo, Michel Verleysen, and Vincent D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports, 2013. [28] Subhankar Dhar and Upkar Varshney. Challenges and business models for mobile location- based services and advertising. Communications of the ACM, 54(5):121–128, 2011. [29] Frank Van Diggelen. Gnss accuracy: Lies, damn lies, and statistics. GPS world, 18(1):26– 33, 2007. [30] Mario Dobrovnik, David Herold, Elmar F¨ urst, and Sebastian Kummer. Blockchain for and in logistics: What to adopt and where to start. Logistics, 2(3):18, 2018. [31] Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pages 3–14. Springer, 2001. [32] Cynthia Dwork. Differential privacy. In Automata, languages and programming, pages 1–12. Springer, 2006. [33] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284. Springer, 2006. [34] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foun- dations and Trends r in Theoretical Computer Science, 9(3-4):211–407, 2014. [35] ´ Ulfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregat- able privacy-preserving ordinal response. In SIGSAC, pages 1054–1067. ACM, 2014. [36] Ethersphere. Interplanet file system. https://ipfs.io/. [37] Ethersphere. Swarm. https://swarm-guide.readthedocs.io/en/latest/introduction.html. [38] M. A. Ferrag, M. Derdour, M. Mukherjee, A. Derhab, L. Maglaras, and H. Janicke. Blockchain technologies for the internet of things: Research issues and challenges. IEEE Internet of Things Journal, pages 1–1, 2018. [39] Lisa K Fleischer and Yu-Han Lyu. Approximately optimal auctions for selling privacy when costs are correlated with data. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 568–585, 2012. [40] Foamspace Corp. Foam white paper. Available online at https://foam.space/publicAssets/FOAM Whitepaper.pdf. [41] Sheng Gao, Jianfeng Ma, Weisong Shi, and Guoxing Zhan. Ltppm: a location and trajec- tory privacy protection mechanism in participatory sensing. Wireless Communications and Mobile Computing, 15(1):155–169, 2015. [42] Sheng Gao, Jianfeng Ma, Weisong Shi, Guoxing Zhan, and Cong Sun. Trpf: A trajectory privacy-preserving framework for participatory sensing. IEEE Transactions on Information Forensics and Security, 8(6):874–887, 2013. 140 [43] Lauren N Gase, Gabrielle Green, Christine Montes, and Tony Kuo. Understanding the density and distribution of restaurants in los angeles county to inform local public health practice. Preventing chronic disease, 16, 2019. [44] Johannes Gehrke, Michael Hay, Edward Lui, and Rafael Pass. Crowd-blending privacy. In Advances in Cryptology, pages 479–496. Springer, 2012. [45] Gabriel Ghinita, Panos Kalnis, Ali Khoshgozaran, Cyrus Shahabi, and Kian-Lee Tan. Pri- vate queries in location based services: anonymizers are not necessary. In SIGMOD, pages 121–132. ACM, 2008. [46] Gabriel Ghinita and Razvan Rughinis. An efficient privacy-preserving system for mon- itoring mobile users: Making searchable encryption practical. In CODASPY ’14, pages 321–332, New York, NY , USA, 2014. ACM. [47] Arpita Ghosh and Aaron Roth. Selling privacy at auction. Games and Economic Behavior, 91:334–346, 2015. [48] Eu-Jin Goh. Secure indexes. IACR Cryptology ePrint Archive, page 216, 2003. [49] Marco Gruteser and Dirk Grunwald. Anonymous usage of location-based services through spatial and temporal cloaking. In MobiSys, pages 31–42. ACM, 2003. [50] Anupam Gupta, Haotian Jiang, Ziv Scully, and Sahil Singla. The markovian price of infor- mation. In IPCO’19, pages 233–246. Springer, 2019. [51] Ruchika Gupta and Udai Pratap Rao. Achieving location privacy through cast in location based services. Journal of Communications and Networks, 19(3):239–249, 2017. [52] Joachim Hartung, Guido Knapp, and Bimal K Sinha. Statistical meta-analysis with appli- cations, volume 738. John Wiley & Sons, 2011. [53] Haibo Hu, Jianliang Xu, Qian Chen, and Ziwei Yang. Authenticating location-based ser- vices without compromising location privacy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 301–312, 2012. [54] Kuan Lun Huang, Salil S Kanhere, and Wen Hu. Towards privacy-sensitive participatory sensing. In 2009 IEEE International Conference on Pervasive Computing and Communica- tions, pages 1–6. IEEE, 2009. [55] Kuan Lun Huang, Salil S Kanhere, and Wen Hu. Preserving privacy in participatory sensing systems. Computer Communications, 33(11):1266–1280, 2010. [56] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992. [57] Hissu Hyv¨ arinen, Marten Risius, and Gustav Friis. A blockchain-based approach towards overcoming financial fraud in public sector services. Business & Information Systems En- gineering, 59(6):441–456, 2017. 141 [58] Yaxian Ji, Junwei Zhang, Jianfeng Ma, Chao Yang, and Xin Yao. Bmpls: Blockchain-based multi-level privacy-preserving location sharing scheme for telecare medical information sys- tems. Journal of Medical Systems, 42(8):147, Jun 2018. [59] Hongbo Jiang, Jie Li, Ping Zhao, Fanzi Zeng, Zhu Xiao, and Arun Iyengar. Location privacy-preserving mechanisms in location-based services: A comprehensive survey. ACM Computing Surveys (CSUR), 54(1):1–36, 2021. [60] Rudolph Emil Kalman et al. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. [61] Yaron Kanza and Hanan Samet. An online marketplace for geosocial data. In Proceedings of the 23rd SIGSPATIAL’15, pages 1–4, 2015. [62] Leyla Kazemi and Cyrus Shahabi. A privacy-aware framework for participatory sensing. ACM Sigkdd Explorations Newsletter, 13(1):43–51, 2011. [63] Leyla Kazemi and Cyrus Shahabi. GeoCrowd: enabling query answering with spatial crowdsourcing. In SIGSPATIAL 2012, pages 189–198. ACM, 2012. [64] Aggelos Kiayias, Stavros Papadopoulos, Nikos Triandopoulos, and Thomas Zacharias. Del- egatable pseudorandom functions and applications. In CCS’13, pages 669–684. ACM, 2013. [65] Hidetoshi Kido, Yutaka Yanagisawa, and Tetsuji Satoh. Protection of location privacy using dummies for location-based services. In 21st International conference on data engineering workshops (ICDEW’05), pages 1248–1248. IEEE, 2005. [66] Fabian Knirsch, Andreas Unterweger, and Dominik Engel. Privacy-preserving blockchain- based electric vehicle charging with dynamic tariff decisions. Computer Science - Research and Development, 33(1):71–79, Feb 2018. [67] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. Re- leasing search queries and clicks privately. In WWW, pages 171–180. ACM, 2009. [68] Ahmed Kosba, Andrew Miller, Elaine Shi, Zikai Wen, and Charalampos Papamanthou. Hawk: The blockchain model of cryptography and privacy-preserving smart contracts. In 2016 IEEE symposium on security and privacy (SP), pages 839–858. IEEE, 2016. [69] John Krumm. A survey of computational location privacy. Personal and Ubiquitous Com- puting, 13(6):391–399, 2009. [70] Nallapaneni Manoj Kumar and Pradeep Kumar Mallick. Blockchain technology for security issues and challenges in iot. Procedia Computer Science, 132:1815–1823, 2018. [71] Ponnurangam Kumaraguru and Lorrie Faith Cranor. Privacy indexes: A survey of westin’s studies. Technical Report CMU-ISRI-5-138, Carnegie Mellon University, 2005. 142 [72] Tsung-Ting Kuo and Lucila Ohno-Machado. Modelchain: Decentralized privacy- preserving healthcare predictive modeling framework on private blockchain networks. arXiv preprint arXiv:1802.01746, 2018. [73] Shangqi Lai, Sikhar Patranabis, Amin Sakzad, Joseph K. Liu, Debdeep Mukhopadhyay, Ron Steinfeld, Shi-Feng Sun, Dongxi Liu, and Cong Zuo. Result pattern hiding searchable encryption for conjunctive queries. In CCS ’18, pages 745–762, New York, NY , USA, 2018. ACM. [74] Kenneth Wai-Ting Leung, Dik Lun Lee, and Wang-Chien Lee. Personalized web search with location preferences. In ICDE, pages 701–712. IEEE, 2010. [75] L. Li, J. Liu, L. Cheng, S. Qiu, W. Wang, X. Zhang, and Z. Zhang. Creditcoin: A privacy- preserving blockchain-based incentive announcement network for communications of smart vehicles. IEEE Transactions on Intelligent Transportation Systems, 19(7):2204–2220, July 2018. [76] Srdjan Capkun Lionel Wolberger, Allon Mason. Platin, proof of location blockchain. https://platin.io/assets/whitepaper/Platin Whitepaper 2.2.2.pdf. [77] Bozhong Liu, Ling Chen, Xingquan Zhu, Ying Zhang, Chengqi Zhang, and Weidong Qiu. Protecting location privacy in spatial crowdsourcing using encrypted data. Advances in Database Technology-EDBT, 2017. [78] Fysical Technologies Pte. Ltd. Fysical: A decentralized location data market. https://view.attach.io/SJm3DCJPG. [79] John C Melaniphy. The restaurant location guidebook: a comprehensive guide to select- ing restaurant & quick service food locations. International Real Estate Location Institute, 2007. [80] Kreˇ simir Miˇ sura and Mario ˇ Zagar. Data marketplace for internet of things. In 2016 In- ternational Conference on Smart Systems and Technologies (SST), pages 255–260. IEEE, 2016. [81] Mohamed F Mokbel, Chi-Yin Chow, and Walid G Aref. The new Casper: query processing for location services without compromising privacy. In VLDB, pages 763–774, 2006. [82] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system. Bitcoin.org, 2008. [83] Kien Nguyen, Gabriel Ghinita, Muhammad Naveed, and Cyrus Shahabi. A privacy- preserving, accountable and spam-resilient geo-marketplace. In ACM SIGSPATIAL’19, pages 299–308, Chicago, IL, USA, 2019. [84] Kien Nguyen, John Krumm, and Cyrus Shahabi. Spatial privacy pricing: The interplay between privacy, utility and price in geo-marketplaces. In Proceedings of the 28th Interna- tional Conference on Advances in Geographic Information Systems, pages 263–272, 2020. 143 [85] Kien Nguyen, John Krumm, and Cyrus Shahabi. Quantifying intrinsic value of informa- tion of trajectory from the onwer’s perspective. In To appear in proceedings of the 29th International Conference on Advances in Geographic Information Systems, 2021. [86] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75–84. ACM, 2007. [87] Kobbi Nissim, Salil Vadhan, and David Xiao. Redrawing the boundaries on purchasing data from privacy-sensitive individuals. In ITCS’14, pages 411–422, 2014. [88] Benedikt Notheisen, Jacob Benjamin Cholewa, and Arun Prasad Shanmugam. Trading real- world assets on blockchain. Business & Information Systems Engineering, 59(6):425–440, 2017. [89] Gang Pan, Guande Qi, Zhaohui Wu, Daqing Zhang, and Shijian Li. Land-use classification using taxi gps traces. IEEE Transactions on Intelligent Transportation Systems, 14(1):113– 123, 2012. [90] Torben Pryds Pedersen. Non-interactive and information-theoretic secure verifiable secret sharing. In Joan Feigenbaum, editor, Advances in Cryptology — CRYPTO ’91, pages 129– 140, Berlin, Heidelberg, 1992. Springer Berlin Heidelberg. [91] Tao Peng, Qin Liu, and Guojun Wang. Enhanced location privacy preserving scheme in location-based services. IEEE Systems Journal, 11(1):219–230, 2014. [92] Huy Pham, Cyrus Shahabi, and Yan Liu. Inferring social strength from spatiotemporal data. ACM Trans. Database Syst., 41(1):7:1–7:47, March 2016. [93] Luiz Guilherme Pitta and Markus Endler. Market design for iot data and services the emer- gent 21th century commodities. In 2018 IEEE Symposium on Computers and Communica- tions (ISCC), pages 00410–00415. IEEE, 2018. [94] NCO PNT. Gps accuracy, 2021. [95] Vincent Primault, Antoine Boutet, Sonia Ben Mokhtar, and Lionel Brunie. The long road to computational location privacy: A survey. IEEE Communications Surveys & Tutorials, 21(3):2772–2793, 2018. [96] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1), 1986. [97] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005. [98] Mahmoud Sakr. A data model and algorithms for a spatial data marketplace. International Journal of Geographical Information Science, 32(11):2140–2168, 2018. [99] Patrick Schober, Christa Boer, and Lothar A Schwarte. Correlation coefficients: appropriate use and interpretation. Anesthesia & Analgesia, 2018. 144 [100] Fabian Schomm, Florian Stahl, and Gottfried V ossen. Marketplaces for data: an initial survey. ACM SIGMOD Record, 42(1):15–26, 2013. [101] Konstantin M Seiler, Hanna Kurniawati, and Surya PN Singh. An online and approximate solver for pomdps with continuous action space. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 2290–2297. IEEE, 2015. [102] Claude Elwood Shannon and Warren Weaver. A mathematical theory of communication, 1948. [103] Yao Shen, Liusheng Huang, Lu Li, Xiaorong Lu, Shaowei Wang, and Wei Yang. Towards preserving worker location privacy in spatial crowdsourcing. In 2015 IEEE global commu- nications conference (GLOBECOM), pages 1–6. IEEE, 2015. [104] Reza Shokri, George Theodorakopoulos, Jean-Yves Le Boudec, and Jean-Pierre Hubaux. Quantifying location privacy. In 2011 IEEE symposium on security and privacy, pages 247–262. IEEE, 2011. [105] Sahil Singla. The price of information in combinatorial optimization. In 29th Annual ACM- SIAM Symposium on Discrete Algorithms, pages 2523–2532. SIAM, 2018. [106] SNAP. Snap gowalla dataset. https://snap.stanford.edu/data/loc-Gowalla.html [January 11, 2019]. [107] Agusti Solanas and Antoni Mart´ ınez-Ballest´ e. A ttp-free protocol for location privacy in location-based services. Computer Communications, 31(6):1181–1191, 2008. [108] Dawn Xiaoding Song, David Wagner, and Adrian Perrig. Practical techniques for searches on encrypted data. In 2000 IEEE Symposium on Security and Privacy, pages 44–55. IEEE, 2000. [109] Florian Stahl, Fabian Schomm, and Gottfried V ossen. The data marketplace survey revis- ited. Technical report, ERCIS Working Paper, 2014. [110] Jacopo Staiano, Nuria Oliver, Bruno Lepri, Rodrigo de Oliveira, Michele Caraviello, and Nicu Sebe. Money walks: a human-centric study on the economics of personal mobile data. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 583–594, 2014. [111] Emil Stefanov, Charalampos Papamanthou, and Elaine Shi. Practical dynamic searchable encryption with small leakage. In NDSS, volume 71, pages 72–75, 2014. [112] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM ’01, pages 149–160, New York, NY , USA, 2001. ACM. [113] Zachary N Sunberg and Mykel J Kochenderfer. Online algorithms for pomdps with contin- uous state, action, and observation spaces. In ICAPS’18, 2018. 145 [114] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), 2002. [115] Nick Szabo. Smart contracts: building blocks for digital markets. EXTROPY: The Journal of Transhumanist Thought,(16), 1996. [116] Hien To, Liyue Fan, Luan Tran, and Cyrus Shahabi. Real-time task assignment in hyper- local spatial crowdsourcing under budget constraints. In PerCom. IEEE, 2016. [117] Hien To, Gabriel Ghinita, and Cyrus Shahabi. A framework for protecting worker location privacy in spatial crowdsourcing. VLDB, 7(10):919–930, 2014. [118] Hien To and Cyrus Shahabi. Location privacy in spatial crowdsourcing. In Handbook of mobile data privacy, pages 167–194. Springer, 2018. [119] Hien To, Cyrus Shahabi, and Leyla Kazemi. A server-assigned spatial crowdsourcing framework. TSAS, 1(1):2, 2015. [120] Eran Toch, Justin Cranshaw, Paul Hankes Drielsma, Janice Y Tsai, Patrick Gage Kelley, James Springfield, Lorrie Cranor, Jason Hong, and Norman Sadeh. Empirical models of privacy in location sharing. In UbiComp, pages 129–138. ACM, 2010. [121] H Van Dyke Parunak and Sven Brueckner. Entropy and self-organization in multi-agent systems. In AAMAS, pages 124–130. ACM, 2001. [122] Daksh Varshneya and G Srinivasaraghavan. Human trajectory prediction using spatially aware deep attention models. arXiv preprint 1705.09436, 2017. [123] Idalides J Vergara-Laurens and Miguel A Labrador. Preserving privacy while reducing power consumption and information loss in lbs and participatory sensing applications. In 2011 IEEE GLOBECOM Workshops, pages 1247–1252. IEEE, 2011. [124] Khuong Vu, Rong Zheng, and Jie Gao. Efficient algorithms for k-anonymous location privacy in participatory sensing. In 2012 Proceedings IEEE INFOCOM, pages 2399–2407. IEEE, 2012. [125] Hanbiao Wang, Kung Yao, Greg Pottie, and Deborah Estrin. Entropy-based sensor selection heuristic for target localization. In IPSN, pages 36–45. ACM, 2004. [126] Yu Wang, Dingbang Xu, Xiao He, Chao Zhang, Fan Li, and Bin Xu. L2p2: Location- aware location privacy protection for location-based services. In 2012 Proceedings IEEE INFOCOM, pages 1996–2004. IEEE, 2012. [127] Yuan H Wang. On the number of successes in independent trials. Statistica Sinica, pages 295–312, 1993. [128] Jianhao Wei, Yaping Lin, Xin Yao, and Jin Zhang. Differential privacy-based location pro- tection in spatial crowdsourcing. IEEE Transactions on Services Computing, 2019. 146 [129] Yonghui Xiao and Li Xiong. Protecting locations with differential privacy under temporal correlations. In CCS, pages 1298–1309. ACM, 2015. [130] Yonghui Xiao, Li Xiong, Si Zhang, and Yang Cao. Loclok: Location cloaking with differ- ential privacy via hidden markov model. VLDB Endowment, 10(12):1901–1904, 2017. [131] Toby Xu and Ying Cai. Feeling-based location privacy protection for location-based ser- vices. In CCS, pages 348–357. ACM, 2009. [132] Keiji Yanai, Hidetoshi Kawakubo, and Bingyu Qiu. A visual analysis of the relationship between word concepts and geographical locations. In CIVR, page 13. ACM, 2009. [133] Mingxuan Yue, Yaguang Li, Haoze Yang, Ritesh Ahuja, Yao-Yi Chiang, and Cyrus Shahabi. Detect: Deep trajectory clustering for mobility-behavior analysis. In 2019 IEEE Interna- tional Conference on Big Data (Big Data), pages 988–997. IEEE, 2019. [134] Ce Zhang, Cheng Xu, Jianliang Xu, Yuzhe Tang, and Byron Choi. Gem2-tree: A gas- efficient structure for authenticated range queries in blockchain. In 2019 IEEE 35th Inter- national Conference on Data Engineering (ICDE). IEEE, 2018. [135] Shuyuan Zheng, Yang Cao, and Masatoshi Yoshikawa. Money cannot buy everything: Trad- ing mobile data with controllable privacy loss. In 2020 21st IEEE International Conference on Mobile Data Management (MDM), pages 29–38. IEEE, 2020. [136] Yu Zheng. Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology (TIST), 6(3):1–41, 2015. [137] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locations and travel sequences from gps trajectories. In WWW’09, pages 791–800, 2009. [138] Guy Zyskind, Oz Nathan, et al. Decentralizing privacy: Using blockchain to protect per- sonal data. In 2015 Security and Privacy Workshops, pages 180–184. IEEE, 2015. 147 Appendices A Proof of Theorems A.1 Proof of Theorem 1 We prove the bound of global sensitivity for two cases, c= 1 and c> 1. It is obvious that when c= 1, the maximum of the change of location entropy is log2. When c> 1, log(1+ 1 exp(H(Cn c u )) ) decreases when n increases. Thus, it is maximized when n= 1 and the maximum change is log2. We also have: log n 1 n 1+C + C n 1+C logC 0 n = 1 n 1 1 n 1+C C logC (n 1+C) 2 = (n 1+C) 2 (n 1)(n 1+C)(n 1)C logC (n 1)(n 1+C) 2 = C 2 +(n 1)C(n 1)C logC (n 1)(n 1+C) 2 Then, log n 1 n 1+C + C n 1+C logC 0 n 0 , C 2 +(n 1)C(n 1)C logC 0 , C(n 1)(logC 1) 148 If logC 1 0 or C is not less than the base of the logarithm (which is e in this work), where n= C logC1 + 1, log n1 n1+C + C n1+C logC is maximized and: log n 1 n 1+C + C n 1+C logC = log C logC 1 C logC 1 +C + C C logC 1 +C logC = log C C+C logCC + C(logC 1)logC C logC = log 1 logC + logC 1 = logC log(logC) 1 If logC 1< 0, log n1 n1+C + C n1+C logC always increases. We have: lim n!¥ n 1 n 1+C + C n 1+C logC= 0 ) n 1 n 1+C + C n 1+C logC< 0 Similarly, we can prove that: max n log n n+C + C n+C logC = logC log(logC) 1 A.2 Proof of Theorem 2 In this section, we derive the bound for the sensitivity of location entropy of a particular location l when the data of a user is removed from the dataset. For convenience, let n be the number of users visiting l, n =jU l j, c u =jO l;u j. Let C = fc 1 ;c 2 ;:::;c n g be the set of numbers of visits of all users to location l. Let S =jO l j=å u c u . 149 Let S u = S c u be the sum of numbers of visits of all users to location l after removing user u. From Equation 3.1, we have: H(l)= H(O l )= H(C)= å u c u S log c u S By removing a user u, entropy of location l becomes: H(Cn c u )= å u c u S u log c u S u From Equation 3.4, we have: H(C)= S u S H(Cn c u )+ H( S u S ; c u S ) Subsequently, the change of location entropy when a user u is removed is: DH u = H(Cn c u ) H(C)= c u S H(Cn c u ) H( S u S ; c u S ) (A.1) Taking derivative ofDH u w.r.t c u : (DH u ) 0 c u = S c u S 2 H(Cn c u )+( S u S log S u S + c u S log c u S ) 0 c u = S u S 2 H(Cn c u ) S u S 2 log S u S + S u S S u S 2 S u S + S u S 2 log c u S + c u S S u S 2 c u S = S u S 2 H(Cn c u ) S u S 2 log S u S S u S 2 + S u S 2 log c u S + S u S 2 = S u S 2 (H(Cn c u ) log S u c u ) 150 We have: (DH u ) 0 c u = 0, H(Cn c u )= log S u c u Therefore,DH u decreases when c u c u , and increases when c u < c u where c u is the value so that H(Cn c u )= log S u c u . In addition, because H(Cn c u ) log(n 1) and n 1< S u when C> 1, )DH u < 0 when c u = 1 )jDH u j is maximized when c u = c u or c u = C. Case 1: c u = C: IfDH u < 0,jDH u j is maximized when c u = c u . IfDH u > 0, we have: 0 c u S H(Cn c u ) H( S u S ; c u S ) C S log(n 1) H( S u S ; C S ) Taking the derivative of the right side w.r.t S u : C S log(n 1) H( S u S ; C S ) 0 S u = C S u +C log(n 1) H( S u S u +C ; C S u +C ) 0 S u = C S u +C log(n 1)+ S u S u +C log S u S u +C + C S u +C log C S u +C 0 S u = C (S u +C) 2 log(n 1)+ C (S u +C) 2 log S u S u +C + C (S u +C) 2 C (S u +C) 2 log C S u +C C (S u +C) 2 = C (S u +C) 2 log(n 1)+ log S u C = C (S u +C) 2 log S u (n 1)C 0 151 )DH u is maximized when S u is minimized. ) When c u = C andDH u > 0: DH u C n 1+C log(n 1) H( n 1 n 1+C ; C n 1+C ) = C n 1+C log(n 1)+ n 1 n 1+C log n 1 n 1+C + C n 1+C log C n 1+C = log n 1 n 1+C + C n 1+C logC Case 2: c u = c u : DH u = c u S log S u c u H( S u S ; c u S ) = c u S log S u c u + S u S log S u S + c u S log c u S = c u S log S u S + S u S log S u S = log S u S =log S S u =log S u + c u S u =log(1+ c u S u ) =log(1+ 1 S u c u ) =log(1+ 1 exp(H(Cn c u )) ) )DH u is maximized when H(Cn c u ) is minimized. Lemma 1. Given a set of n numbers,C =fc 1 ;c 2 ;:::;c n g, 1 c i C, entropy H(C) is minimized when c i = 1 or c i = C, for all i= 1;:::;n. 152 Proof. When the value of a number c u is changed and others a fixed, from equation A.1: H(C) 0 c u =(DH u ) 0 c u = S u S 2 (log S u c u H(Cn c u )) Therefore, H(C) increases when c u c u , and decreases when c u < c u where c u is the value so that H(Cn c u )= log S u c u . ) H(C) is minimized when c u = 1 or c u = C. Lemma 2 (Minimum Entropy). Given a set of n numbers,C =fc 1 ;c 2 ;:::;c n g, 1 c i C, the minimum entropy H(C)= logn logC C1 + log logC C1 + 1. Proof. Using Lemma 1, entropy H(C) is minimized whenC =f1;:::;1 | {z } nk times ;C;:::;C | {z } k times g. Let S=å i c i = n k+ kC. We have: H(C)= n k S log 1 S kC S log C S = n k S logS+ kC S logS kC S logC = logS kC S logC We have S 0 k = C 1. Take the derivative of H(C) w.r.t k, we have: (H(C)) 0 k = C 1 S S k(C 1) S 2 C logC = 1 S 2 S(C 1)(S kC+ k)C logC = 1 S 2 (n k+ kC)(C 1)(n k+ kC kC+ k)C logC = 1 S 2 (nC n kC+ k+ kC 2 kC nC logC) (H(C)) 0 k = 0, k= n(C logCC+ 1) (C 1) 2 153 When k = 0 or k = n, H(C) is maximized, which means H(C) is minimized when k = n(C logCC+ 1) (C 1) 2 . It is clear that we need k to be an integer. Thus, k would be chosen asbkc orbkc+ 1, depending on what value creates a smaller value of H(C). S= n k+ kC = n+ k(C 1) = n+ n(C logCC+ 1) C 1 = nC n+ nC logC nC+ n C 1 = nC logC C 1 kC S logC= n(C logCC+ 1) (C 1) 2 C C 1 nC logC logC = C logCC+ 1 C 1 H(C)= log nC logC C 1 C logCC+ 1 C 1 = logn+ log C logC C 1 C logC C 1 + 1 = logn logC C 1 + log logC C 1 + 1 Using Lemma 2, when c u = c u : jDH u j log(1+ 1 exp(H(Cn c u )) ) where H(Cn c u )= log(n 1) logC C1 + log logC C1 + 1. Thus, the maximum change of location entropy when a user is removed equals : 154 • log n n1 when C= 1 • max log n1 n1+C + C n1+C logC;log(1+ 1 exp(H(Cnc u )) ) where H(Cn c u )= log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. Similarly, the maximum change of location entropy when a user is added equals : • log n+1 n when C= 1 • max log n n+C + C n+C logC;log(1+ 1 exp(H(Cnc u )) ) where H(Cn c u ) = log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. Thus, we have the proof for Theorem 2. A.3 Proof of Theorem 3 In this section, we prove that Theorem 3 satisfies e-differential privacy. We prove the theorem when a user is removed from the database. The case when a user is added is similar. Let l be an arbitrary location andA l : L!R be the Algorithm 2 when only location l is considered for perturbation. O l(org) be the original set of observations at location l; O l;u(org) be the original set of observations of user u at location l; O l;u be the set of observations after limiting c u to C and limiting maximum M locations per user; O l be the set of all O l;u for all users u that visit location l;C be the set visits. Let O l n O l;u be the set of observations at location l when a user u is removed from the dataset. Let b= MDH e . If O l;u(org) =?: Pr[A l (O l(org) )= t l ] Pr[A l (O l(org) n O l;u(org) )= t l ] = 1 155 If O l;u(org) 6=?: Pr[A l (O l(org) )= t l ] Pr[A l (O l(org) n O l;u(org) )= t l ] = Pr[A l (O l )= t l ] Pr[A l (O l n O l;u )= t l ] = Pr[H(C)+ Lap(b)= t l ] Pr[H(Cn c l;u )+ Lap(b)= t l ] If H(C) H(Cn c l;u ): Pr[H(C)+ Lap(b)= t l ] Pr[H(Cn c l;u )+ Lap(b)= t l ] = Pr[H(C)+ Lap(b)= t l ] Pr[H(C)+DH(l)+ Lap(b)= t l ] = Pr[Lap(b)= t l H(C)] Pr[Lap(b)= t l H(C)DH(l)] = exp 1 b (jt l H(C)j+jt l H(C)DH(l)j) exp( jDH(l)j b ) exp( DH b ) = exp( e M ) If H(C)> H(Cn c l;u ), similarly: Pr[H(C)+ Lap(b)= t l ] Pr[H(Cn c l;u )+ Lap(b)= t l ] exp( e M ) Similarly, we can prove that: Pr[A l (O l(org) n O l;u(org) )= t l ] Pr[A l (O l(org) )= t l ] exp( e M ) Therefore,A l satisfies 0-differential privacy when O l;u(org) =?, and satisfies e M -differential privacy when O l;u(org) 6=?. 156 For all locations, letA : L!R jLj be the Algorithm 2. let L 1 be any subset of L. Let T = ft 1 ;t 2 ;:::;t jL 1 j g2Range(A ) denote an arbitrary possible output. Thus,A is the composition of allA l , l2 L. Let L(u) be the set of all locations that u visits,jL(u)j M. Applying composition theorems [34] for allA l whereA l satisfies 0-differential privacy when l = 2 L 1 \ L(u), and satisfies e M -differential privacy when l2 L 1 \ L(u), we haveA satisfiese-differential privacy. 157
Abstract (if available)
Abstract
The advance of modern mobile communication devices has enabled people to easily create, consume, and share information about every aspect of their lives from almost everywhere at any time. The location-tracking capability of these devices allows individuals to generate real-time geospatial data, which enables or plays an important role in many different applications. However, the current practice of geospatial data collection and sharing is often that individuals provide their data for free in order to use services. The massive amount of collected geospatial data is then used to pay for service development through other methods such as targeted advertising or even selling the data to other businesses. ❧ However, information about locations of individuals can have serious privacy implications because by linking locations with other data sources, important, and potentially sensitive, information about individuals can be revealed such as home, health condition, religious or political preferences. Without proper privacy protection, malicious adversaries can execute a wide range of physical and virtual attacks such as physical surveillance or stalking. While there was some regulation of location data collection and sharing, current practice is far from ideal for individuals to safely share their location information to service providers. ❧ An emerging alternative framework for current practice is to allow individuals to offer their location data through data marketplaces. We called these marketplaces geo-marketplaces. Geo-marketplaces raise a number of interesting issues about data ownership, utility, pricing and privacy. In this thesis, we focus on the interplay between utility, privacy and pricing of geospatial data in various settings of geo-marketplaces. More specifically, two important aspects of geo-marketplaces are at the center of interest: location privacy and pricing for various types of location data. Location privacy is essential for geo-marketplaces due to the sensitivity of location data. Pricing is crucial to make geo-marketplace viable, especially when geospatial data can come in many different forms. Thus, geo-marketplaces require efficient algorithms for selling different types of geospatial data with alternative pricing strategies while protecting sellers' location privacy. ❧ This thesis aims to enable geo-marketplaces with those requirements by investigating different settings of data types and pricing strategies in a geo-marketplace along with privacy considerations. These settings include (a) differentially private data point release with free query where some quantities derived from individuals' data points can be released for free with strong privacy protection, (b) encrypted data point release with fixed-price query where location data points or geo-tagged data objects for a fixed price with their locations advertised in encrypted space, (c) noisy data point release with variable price query where a location data point can be sold at different prices depending on how much noise is added, and (d) degraded trajectory release with variable-price query where a trajectory can be released or sold at different prices depending on how it is degraded. In each setting, we design the marketplace and develop principled methods to price data, protect privacy of owners and maintain utility of data for buyers in order to enable a viable geo-marketplace. In all settings, our proposed design and methods are evaluated by extensive experiments on large real-world datasets to show its practicality. ❧ With a wide range of novel settings and practical applications of geo-marketplaces developed in this thesis, we make important steps to the realization and adoption of geo-marketplaces. This will enable new ways for individuals to interact with Internet and mobile services, take control of the collection, usage and sharing of their geospatial data, receive appropriate reward for their contribution; and for service providers and other organizations to leverage individuals' valuable data while respecting their privacy.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Differentially private learned models for location services
PDF
Location privacy in spatial crowdsourcing
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Mechanisms for co-location privacy
PDF
Location-based spatial queries in mobile environments
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Responsible AI in spatio-temporal data processing
PDF
Efficient crowd-based visual learning for edge devices
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Efficient indexing and querying of geo-tagged mobile videos
PDF
Generalized optimal location planning
PDF
Modeling intermittently connected vehicular networks
PDF
Enabling query answering in a trustworthy privacy-aware spatial crowdsourcing
PDF
Inferring mobility behaviors from trajectory datasets
PDF
GeoCrowd: a spatial crowdsourcing system implementation
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Security and privacy in information processing
Asset Metadata
Creator
Nguyen, Kien Duy
(author)
Core Title
Privacy-aware geo-marketplaces
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-12
Publication Date
09/13/2021
Defense Date
08/16/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
blockchain,data release,differential privacy,Gaussian process,geo-marketplace,location data,location entropy,location privacy,OAI-PMH Harvest,privacy,searchable encryption,spatial privacy pricing,trajectories,value of information
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Krishnamachari, Bhaskar (
committee member
), Krumm, John (
committee member
), Kuhn, Peter (
committee member
)
Creator Email
duykienvp@gmail.com,kien.nguyen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15909835
Unique identifier
UC15909835
Legacy Identifier
etd-NguyenKien-10057
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nguyen, Kien Duy
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
blockchain
data release
differential privacy
Gaussian process
geo-marketplace
location data
location entropy
location privacy
searchable encryption
spatial privacy pricing
trajectories
value of information