Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Location privacy in spatial crowdsourcing
(USC Thesis Other)
Location privacy in spatial crowdsourcing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOCATION PRIVACY IN SPATIAL CROWDSOURCING by Hien To A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2018 Copyright 2018 Hien To Dedication I would like to decidate my thesis to my beloved parents, Thuc To and Lan Do, for nursing me with affections and love, and for being my first teachers. ii Acknowledgments First and foremost, I would like to thank my advisor, Professor Cyrus Shahabi. He has been very supportive throughout my PhD studies. He was always available to discuss my research projects. He helped me to identify the thesis topic and guided me through countless challenges. How to identify a legitimate research problem, validate the problem to make sure that it has the real impact, propose a baseline solution to the problem and iteratively improve the baseline as well as derive analytical and/or empirical results of the problem. Prof. Shahabi always emphasizes the importance of conducting a systematic experiment on real-world datasets. He showed me how to present my research results in a paper as well as to other people, encouraged me to go to conferences to connect with other researchers. He taught me how to review papers and gave me opportunities to participate in professional services, such as reviewing papers/journals, organizing conferences/workshops. Withouthispersistentguidance, Iwouldnothaveaccomplishedthisthesis. With a special thank to my many mentors, Prof. Gabriel Ghinita for advising me on my first two years working on data privacy, Prof. Liyue Fan for supporting me on different research topics that I proposed, and Prof. Li Xiong, who have gave me many valuable feedbacks in different stages of my PhD career. With a special mention to Dr. Seon Ho Kim, Dr. Farnoush Banaei- Kashan and Dr. Luciano Nocera. It was fantastic having the opportunity to work with them on IMSC projects, including iCampus, iWatch, and MediaQ. Particularly, five years working with Dr Kim and the MediaQ team was truly a blast. I am also grateful to receive support from the InfoLabers at USC, Ritesh Ahuja, Mingxuan Yue, Dimitris Stripelis, Chrysovalantis Anastasiou, Kien Nguyen, Minh Nguyen, Luan Tran, Rose Yu, Giorgos Constantinou, Ying Lu, Abdullah Alfarrarjeh, Dingxiong Deng, Mohammad Asghari, Afsin Akdogan, Huy Pham, Bei Pan, Houtan Shirani- Mehr, Ali Khodaei, Leyla Kazemi, Ugur Demiryurek, Ling Hu. It was a bless sharing the lab with all of you. I also would like to thank the following USC staffs: Daisy Tang, Lizsl A. De Leon, Tracy Charles and Melinda Ballengee for their unfailing support and assistance during my time at USC. I would like to thank the VEF staffs, Dr. Lynne McNamara, Dr. Phuong Nguyen, Ms. Sandarshi Gunawardena, Ms. Sandy Dang, and many others, who gave me the fellowship to go to the US and pursue my dream. Last but not least, I am grateful to have support from my bride Minh Nguyen, my two sisters, Huong To and Huyen To, my other family members and friends, who have provided me support in my life. I could not be more thankful! Thanks for all your encouragement! iii Contents Dedication ii Acknowledgments iii List of Tables vi List of Figures vii Abstract 1 1 Introduction 3 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Privacy-Preserving Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background in Spatial Crowdsourcing 12 2.1 Taxonomy of Spatial Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Comparison to Other Fields of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Survey of Related Work 19 3.1 Spatial Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Privacy Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Case Study of TaskRabbit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Privacy Countermeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Location Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Location Privacy in Spatial Crowdsourcing . . . . . . . . . . . . . . . . . . . 25 4 Privacy-Preserving Task Assignment 29 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Task Assignment: The Focus of SC . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.2 Mathematically Rigorous Definitions of Privacy . . . . . . . . . . . . . . . . . 30 4.1.3 Private Spatial Decompositions (PSD) . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Protecting Locations of Workers Using Trusted Third Party . . . . . . . . . . . . . . 34 4.2.1 Privacy Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.2 Protection for Static Workers’ Locations . . . . . . . . . . . . . . . . . . . . . 38 iv 4.2.3 Protection for Dynamic Workers’ Locations . . . . . . . . . . . . . . . . . . . 49 4.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.5 PrivGeoCrowd: A Tool for Tuning Parameters . . . . . . . . . . . . . . . . . 65 4.3 Protecting Locations of Workers and Tasks Without Trusted Third Party . . . . . . 70 4.3.1 The SCGuard Privacy Framework . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Online Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.4 Extensions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5 Conclusion 95 Reference List 97 v List of Tables 1.1 Three stages of the privacy-aware framework. . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Spatial crowdsourcing studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Attacks on SC users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Three tasks requested by requester Alice. . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Overview of problem focuses (Re: reporting, Ta: tasking); privacy techniques used (Ps: pseudonym,Cl: cloaking,Pt: perturbation,Ex: exchange-based,En: encryption- based); threats (W: worker, T: requester, S: server); trusted third party (TTP); opti- mization type (ST: single task, MT: multiple tasks). x and (x) represent primary and secondary aspects, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Granularity m 2 and average count per cell ¯ n (N 0 = 100) . . . . . . . . . . . . . . . . 40 4.3 Dataset characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 The average relative increase in percentage of different measurements when varying ∈{0.1, 0.4, 0.7, 1} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Performance comparison with pull-mode benchmark. . . . . . . . . . . . . . . . . . . 62 4.6 Summary of notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vi List of Figures 1.1 The tasking phase of spatial crowdsourcing. . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 The focus of this thesis, spatial crowdsourcing, is shown in grey. . . . . . . . . . . . . 12 3.1 Threat models in spatial crowdsourcing. . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Screenshots of TaskRabbit web application from worker Bob. . . . . . . . . . . . . . 23 4.1 Privacy framework for spatial crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 A snapshot of adaptive grid (ε = 0.5, α = 0.5) . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Examples of partial cell selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Workflow for dynamic worker PSD computation . . . . . . . . . . . . . . . . . . . . 51 4.5 Two-level grid for dynamic worker PSD . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.6 The effect of f on probability p h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7 Comparison of GR construction heuristics by varying ε. . . . . . . . . . . . . . . . . 57 4.8 Average ASR over ∈{.1,.4,.7, 1}, varying k 2 . . . . . . . . . . . . . . . . . . . . . . 58 4.9 Comparison of compactness-based heuristics by varying ε. . . . . . . . . . . . . . . . 59 4.10 Overhead of privacy (G-GP-Hybrid) compared to non-private algorithm. . . . . . . . 60 4.11 Performance of geocast algorithm (G-GP-Hybrid) when varying Acceptance Rate (Ye.-Linear). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.12 Performance of geocast algorithm (G-GP-Hybrid) when varying number of workers required to complete a task (Ye.-Linear). . . . . . . . . . . . . . . . . . . . . . . . . . 61 vii 4.13 Performance overhead for multiple-snapshot worker PSD when varying privacy bud- get ε. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.14 Varying budget split f. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.15 AG Structure visualization for Gowalla dataset. . . . . . . . . . . . . . . . . . . . . . 64 4.16 Varying number of timestamps T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.17 PrivGeoCrowd main GUI integrates several component module panels . . . . . . . . 65 4.18 Architecture of PrivGeoCrowd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.19 The effect of customized granularity on GR (SF data) . . . . . . . . . . . . . . . . . 68 4.20 The effect of small budget (ε = 0.1) (SF data) . . . . . . . . . . . . . . . . . . . . . . 69 4.21 The effect of worker density on GR size (Yelp data) . . . . . . . . . . . . . . . . . . 69 4.22 The effect of different heuristics on GR geometry . . . . . . . . . . . . . . . . . . . . 70 4.23 SCGuard: privacy-aware framework for spatial crowdsourcing. . . . . . . . . . . . . . 72 4.24 Example of online tasking with three known workers. Each task arrives one-by-one in the order of t 1 →t 2 →t 3 . Figure 1.1 shows the exact layout of the workers and the tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.25 Estimation of the pdf of d(w,t) for each stage. . . . . . . . . . . . . . . . . . . . . . 82 4.26 Distributions of d and Pr(d≤R w |d 0 ). . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.27 Accuracy of the baseline algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.28 Pruning during U2U. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.29 Comparison of the variants of the algorithms by varying privacy guarantee r. . . . . 89 4.30 Comparison of the algorithms by varying . . . . . . . . . . . . . . . . . . . . . . . . 90 4.31 Performance of Probabilistic-Model by decreasing U2U threshold (α). . . . . . . . . . 91 4.32 Performance of Probabilistic-Model by increasing U2E threshold (β). . . . . . . . . . 92 viii Abstract Spatial crowdsourcing (SC) is a new platform that engages individuals in collecting and ana- lyzing environmental, social and other spatiotemporal information. With SC, requesters outsource their spatiotemporal tasks (tasks associated with location and time) to a set of workers, who will perform the tasks by physically traveling to the tasks’ locations. However, current solutions require the workers, who in many cases are simply volunteering for a cause, to disclose their locations to untrustworthy entities. Revealing an individual’s location data to other entities may prevent people from contributing to SC applications, thus rendering location privacy a critical obstacle to the growth of SC applications. This thesis aims to help SC companies (i.e., Uber, TaskRabbit) to protect location privacy of the users (both workers and requesters) participating in their SC platforms. To this end, we identify privacy threats toward both users during the two main phases of SC, tasking and reporting. Tasking (aka task assignment) is the process of identifying which tasks should be assigned to which workers, which is handled by a SC-server. The latter phase is reporting, in which workers travel to the tasks’ locations, complete the tasks and upload their reports to the server. Countermeasure studies during the reporting phase have been well studied; hence, we shift our focus on privacy protection during the tasking phase. In order to protect privacy during tasking, the server must assign tasks to workers without having access to the raw locations of the users. We propose privacy- aware frameworks for task assignment in SC as follows. We first focus on a tasking scenario where the tasks are public and the protection focus is on the workers’ locations. We introduce the first framework relies on a trusted third party (TTP) to sanitize the raw location data of the workers. We propose a mechanism based on differential 1 privacy (a mathematically rigorous definition of privacy) and geocasting that achieves effective SC services while offering privacy guarantees to workers. We address scenarios with both static and dynamic (i.e., moving) datasets of workers. Next, we focus on a more realistic scenario where the location privacy of both workers and requesters’ tasks need to be protected. We introduce another framework that does not require the TTP—the location data is perturbed locally. We propose a protocol based on geo-indistinguishability (notion of location privacy based on differential privacy) and reachability that achieves effective SC services while offering privacy guarantees to the users. In both scenarios, we investigate analytical/empirical models and task assignment strategies that balance multiple crucial aspects of SC functionality, such as task completion rate, worker travel distance and system overhead. 2 Chapter 1 Introduction 1.1 Motivation Smartphones have recently surpassed the PC as the device of choice for accessing the Web, and mobile phone subscription has grown from 5.9 billion worldwide at the end of 2011 [124] to 6.9 billion in 2014, which is 95.5% of the world population. There has also been a significant increase in mobile phone bandwidth: from 2.5G (up to 384Kbps) to 3G (up to 14.7Mbps) and recently 4G (up to 100 Mbps) [148]. The increase in computational and communication performance of mobile devices, coupled with the advances in sensor technology (e.g., video cameras, GPS, motion sensors) lead to an exponential growth in data collection and sharing by smartphones. An individual with a mobile phone can nowadays act as a multi-modal sensor collecting and sharing various types of high-fidelity spatiotemporal data instantaneously (e.g., picture, video, audio, location, time, speed, direction, acceleration). Exploiting mobility of such a large volume of potential users 1 , a new mechanism for efficient and scalable data collection has emerged, named Spatial crowdsourcing (SC) [95]. With spatial crowdsourcing, the goal is to outsource a set of spatial tasks (i.e., tasks related to a location) to a set of workers who perform the spatial tasks by physically traveling to those locations. Spatial crowdsourcing has applications in numerous domains such as environmental sensing (iRain [131]), smart cities (Waze and TaskRabbit), journalism and crisis response (MediaQ [101]). Effective assignment of tasks to workers is an important phase in spatial crowdsourcing, i.e., tasks are completed in a timely fashion, and workers do not incur significant travel cost [95, 96, 40, 167]. Hence, the tasking phase requires workers to reveal their locations and requesters to disclosetheirtasks’locationstopotentiallyuntrustworthyentities(e.g.,server). However,disclosing 1 We use “user” when referring to both worker and requester throughout the thesis. 3 individual locations has serious privacy implications. First, leaked locations are often collected and shared without user consent [73, 84, 117], leading to a breach of sensitive information; these include an individual’s health (e.g., presence in a cancer treatment center), alternative lifestyles, political and religious preferences (e.g., presence in a church). Second, knowing user locations, an adversary can stage a broad spectrum of attacks such as physical surveillance and stalking, and identity theft [149], leading to real-world consequences. These include discovering patterns of one-night stands from the trajectory of Uber riders [136], monitoring the locations of the riders in real-time for entertainment [71], tracking a journalist and finding her personal information [192], or stalking Waze users by generating fake events like accidents [187]. Thus, ensuring location privacy is an essential aspect of spatial crowdsourcing, because mobile users may not agree to engage in spatial crowdsourcing if their privacy is violated. 1.2 Thesis Statement In this thesis, we identify privacy as the major impediment to the success of any spatial crowd- sourcing system. Thus, we propose a server-assigned spatial crowdsourcing framework that enables the participation of the individuals (both workers and requesters) without compromising their location privacy. Particularly, our goal is to use regorious privacy techniques in the task assignment phase of spatial crowdsourcing to hide the locations of workers and tasks from untrustworthy entities (e.g., SC-server) with low cost, low overhead and without compromising the performance of the SC system. Location privacy of the individual is protected with differential privacy, which is a mathematically rigorous definition of privacy. The cost is measured in terms of the travel distance between a task’s location and an assigned worker while overhead refers to the computation and communication overhead required to match a task request to the assigned workers. The performance is measured by the number of assigned (or completed) tasks. 4 1.3 Privacy-Preserving Task Assignment Protecting Locations of Workers: We first assume a scenario with public tasks and private workers, and the goal is to perform task assignment while at the same time protecting location privacy of the workers. This is a challenging task, as protecting privacy either introduces uncer- tainty with respect to worker whereabouts, decreasing assignment quality, or creates significant communication overhead. One may argue that simply removing workers’ identity by using fake identity (i.e., pseudonymity) would achieve privacy. However, we argue that hiding users’ identity without hid- ing their locations does not provide privacy. This is because a user’s location information can be tracked through several stationary connection points (e.g., cell towers). The user’s location trace can be easily associated with a certain residence home or office, which reveal the user’s identity. This has been referred to as inference attack [103]. This study collects GPS data from volunteer users and finds their home locations with median error of 60 meters. A study in [60] showed that there is a close correlation between people’s identities and their mobility. Furthermore, an article in Nature [38] performed a large-scale study showed that 4 locations uniquely identify 95% individ- uals. Thus, hiding workers’ locations are much more challenging than hiding his/her identity since the workers’ locations are needed for effective task assignment. Location privacy has been studied in the context of location-based services. Proposed solu- tions [62, 126, 86, 16] typically focus on location-based queries, such as finding points of interest nearby a user’s location without disclosing the actual coordinates. However, in SC, the worker location is no longer part of the query, but instead the result of a spatial query around the task location—searching workers in the proximity of the task. In addition, while some studies consider queries on private locations [204, 33], it is assumed that the data owner entity and the querying entity trust each other, with protection being offered only against intermediate service provider entities. This scenario does not apply to SC, as there is no explicit trust relationship between requesters and workers. We propose a framework for protecting privacy of worker locations, whereby the SC-server only has access to data sanitized according to differential privacy (DP) [44]. In practice, there may be 5 many SC-servers run by diverse organizations that do not have an established trust relationships with the workers. On the other hand, every worker subscribes to a cellular service provider (CSP) that provides Internet connectivity. The CSP already has access to the worker locations (e.g., through cell tower triangulation), but as opposed to the SC-server, the CSP signs a contract with its subscribers, stipulating the terms and conditions of location disclosure. The CSP collects user locations and releases them to third party SC-servers in noisy form, according to DP. However, using DP introduces three difficult challenges, as follows: First, the SC-server must match workers to tasks using noisy data, which requires complex strategies to ensure effective task assignment. To create sanitized data releases at the CSP, we adopt the Private Spatial Decomposition (PSD) approach [35]. A PSD is a sanitized spatial index, where each index node contains a noisy count of the workers rooted at that node. To ensure that taskassignment has a high success rate, weintroduceananalyticalmodelthatdetermineswithhigh probability a PSD partition around the task location that includes sufficient workers to complete the task. Second, the DP protection model requires fake entries to be created in the PSD. Thus, the SC- server cannot directly contact workers (even if pseudonyms are used) as establishing a connection to an entity would allow the SC-server to learn whether an entry is real or not, and breach privacy. To address this challenge, we propose the use of geocasting [130] as means to deliver task requests to workers. Once a PSD partition is identified by the analytical model outlined above, the task request is geocast to all workers in the partition. Geocast introduces overhead considerations that need to be carefully considered in the framework design. Third, protecting worker locations across multiple timestamps is notoriously difficult [46]. As workers move, new snapshots of sanitized worker locations must be disclosed, to maintain task assignment effectiveness. However, access to sequential releases gives an adversary more powerful attack opportunities. To counter such threats, differential privacy requires more noise injection, which in the worst case may reach amounts that are proportional to the length of the released loca- tion history (i.e., number of disclosed snapshots). Clearly, such large noise would render the data useless, since SC is likely to be a continuously-offered service in practice. We study custom-designed 6 mechanisms for differentially-private release of worker locations across consecutive timestamps that preserve location accuracy, such that the task assignment remains highly effective. Our contributions are: 1) We identify the specific challenges of location privacy in SC, and we propose a framework that achieves differentially-private protection guarantees. 2) We propose an analytical model that measures the probability of task completion with uncertain worker locations, and devise a strategy that finds appropriate PSD partitions to ensure high success rate of task assignment. 3) We introduce a geocast mechanism for task request dissemination that is necessary to overcome the restrictions imposed by DP, and we factor the geocast system overhead in the PSD partition search strategy. 4) We extend our solution to the more challenging scenario of multiple-snapshot releases of worker datasets. Weinvestigatetechniquesforcarefulprivacybudgetallocationacrossconsecutivereleases, and we employ post-processing techniques based on Kalman filters to reduce the inaccuracy intro- duced by noise addition. 5) We conduct extensive experiments on real-world datasets which show that the proposed frame- work is able to protect workers’ location privacy without significantly affecting the effectiveness and efficiency of the SC system. Protecting Locations of Both Workers and Tasks: Several studies (e.g., [95, 179]) focus on effective tasking by maximizing the number of assigned tasks while minimizing workers travel distances, for which they require workers to reveal their locations and requesters to disclose their tasks’ locations to the server. We argue that to enable effective tasking, the server does not have to know the exact locations of the workers and the tasks because a task can be matched to a nearby worker as long as their proximity is known. However, once the worker agrees to complete the task, he must travel to the task’s location, perform it, and report the result to the server. Obviously, at this phase, referred to as reporting, the disclosures of the task’s location to the assigned worker and vice versa are usually unavoidable. Thus, privacy during the reporting phase is less critical and beyond the scope of this thesis; instead, we focus on privacy protection during the tasking phase. 7 Privacy-preserving task assignment in SC has been an active area of research in recent years. Existing studies have two major drawbacks. First, they generally focus only on protecting location privacy of workers [170, 58, 151] and assume that task locations are public. However, task locations should be secure during tasking since they can be sensitive. For example, the task locations can indirectly reveal requesters’ location, i.e., requesters often post tasks in the proximity of their locations [173]. Second, existing studies often assume a trusted entity to sanitize the location data [93, 140, 170]. This is not always the case in all applications of SC as there is no explicit trust relationship between any two parties (e.g., requester and worker). Hence, we assume a broader privacy setting where all SC parties could be curious but not malicious, and aim to protect location privacy of both workers and tasks during the tasking phase without relying on any trusted entity. To obscure the locations of the workers and the tasks from potentially untrustworthy entities (e.g., servers), we can apply location privacy mechanisms, such as cloaking [62], perturbation [16, 198], private information retrieval [56], and secure multi-party computation [61]. Among them, we adopt a recent perturbation technique, named geo-indistinguishability [16], for these reasons: it is a mathematically rigorous definition of location privacy, and it suits our privacy setting in that the mechanism for achieving geo-indistinguishability can be performed in real-time by smartphones of workers and requesters, without the need of any trusted anonymization entity. The challenge is to accurately estimate the worker-task pair reachability given only perturbed (or noisy) locations of the workers and the tasks. Due to the location uncertainty, a nearby task may not be assigned to a worker because the perturbed location of the task is farther away. This may reduce the number of completed tasks. On the other hand, a task can be assigned to a worker whose location is far away because the perturbed location of the worker is close to the task. This may increase the worker travel distance. More concretely, we first assume the online tasking strategy, which has been shown to be scalable and effective for tasking in SC [17, 177, 176]. Within online tasking, the set of workers is known up front, and each task, upon arrival, needs to be immediately matched to an available worker (i.e., one at a time); once assigned, the worker becomes unavailable for assignment. The main objective is to assign as many tasks to the workers as possible during a given time interval. 8 Consider the example of three online tasks in Figure 1.1. Each task arrives one-by-one in the order of t 1 → t 2 → t 3 . Every worker corresponds to a specific geographical region (aka spatial region) represented by a circle where any enclosed task is considered reachable from (and thus can be assigned to) the worker. The optimal assignment in this example is to matcht 1 tow 2 ,t 2 tow 3 and t 3 to w 1 . However, if t 2 is assigned first to w 1 , w 1 becomes unavailable and therefore t 3 remains unmatched, resulting in a local optimum. It is known that the optimal algorithm for this problem (in terms of maximizing the number of assigned tasks 2 ) is Ranking [92], which selects a worker that is reachable to a task based on a random rank. However, Ranking may no longer generate the optimal result when it only has access to the perturbed locations of workers and tasks since the reachability between workers and tasks cannot be exactly determined. Server " # # % " % 0. Spatial region Workers 1. Task location Requesters 2. Assignments # , % ( % , # ) ( " ,___) Figure 1.1: The tasking phase of spatial crowdsourcing. To address this challenge of location uncertainty, we partition the online tasking setting into a three-stage privacy-aware framework, dubbed SCGuard (see Table 1.1). SCGuard involves different parties at each stage of the task assignment to ensure effective tasking. This is achieved through revealing users’ locations gradually, only if needed, as the tasking proceeds from one stage to the other. In the first stage, the server recommends a small set of nearby candidate workers for a given task (without knowing the exact locations of either party); the server then forwards the perturbed locations of these workers to the task’s requester. On receipt of these perturbed locations, in the 2 There could be different optimality criteria such as minimizing worker travel distance. 9 second stage, the requester identifies the most likely reachable worker and sends the task location to this worker. Once receiving the exact location of the task, in the final stage, the selected worker accepts the task if it is enclosed within his spatial region; otherwise, he rejects the task. Hence, it is possible for a candidate worker to learn the exact location of the task even when the worker is not reachable to the task; we quantify this disclosure later. The last two stages may repeat until either the task is assigned or no candidate worker is left. Stage By Input Output 1 SC-Server Uncertain task location, uncertain workers’ locations A set of candidate workers for a given task 2 Requester The uncertain candidate workers recommended by the server The worker of thehigh- est rank 3 Worker Exact task location Accept the task or not Table 1.1: Three stages of the privacy-aware framework. A key component in each stage of the above framework is to determine whether a task is reachable from a worker. We first devise a baseline solution by assuming the perturbed locations of tasks and workers as their actual locations. We call this baseline the “oblivious” technique, as it is oblivious to the fact that the locations are perturbed and not real. Obviously, the utility of this approach is very low. To improve the utility of the oblivious approach, we propose analytical and empirical models to quantify the probability of reachability between a task and a worker in each stage of SCGuard. Thereafter, we introduce a probability-based solution that improves the baseline in several metrics, including a higher number of assigned tasks, smaller worker travel cost and lower disclosure of location information. The contributions of this work [175] are as follows. 1) We propose SCGuard, a privacy-aware framework that enables workers and requesters to par- ticipate in SC without compromising their location privacy. To the best of our knowledge, this is the first work designed to protect the privacy of both parties in SC without assuming any trusted entity. 2) We propose the analytical and empirical models to quantify the worker-task pair reachability in every stage of SCGuard, based on which of the probabilistic tasking algorithm is introduced. 10 3) We conduct an extensive set of experiments on a real-world dataset, named T-Drive [208], showing three main results. First, the analytical model performs as well as the empirical model without relying on precomputation on past or synthetic data. Second, our probabilistic tasking algorithm is superior to the baseline in all key metrics, including a higher number of assigned tasks (×3), smaller worker travel cost (2/3) and lower disclosure of task location (/100), with only a slight increase in system overhead (20%). Third, SCGuard is able to protect location privacy of both workers and requesters without significantly compromising the key metrics of the SC system. 11 Chapter 2 Background in Spatial Crowdsourcing 2.1 Taxonomy of Spatial Crowdsourcing Spatial crowdsourcing opens up a new mechanism for spatial tasks (i.e., tasks related to a location) to be performed by humans. Consequently, we formally define spatial crowdsourcing as the process of crowdsourcing a set of spatial tasks to a set of human workers where performing an outsourced task is only possible if the workers are physically at the location of the task, termed spatial task. In this section, we present a taxonomy for crowdsourcing (Figure 2.1). First, we classify spatial crowdsourcing based on worker’s motivation. Next, we define two modes of task publishing in spatial crowdsourcing. Finally, we classify the workers into two groups based on whether or not they have constraints. Crowdsourcing Reward-based Self-incentivised Worker Selected Server Assigned Worker Selected Server Assigned Worker Spatial Constrained Server Spatial Constrained Worker Spatial Constrained Server Spatial Constrained Figure 2.1: The focus of this thesis, spatial crowdsourcing, is shown in grey. Worker’s Motivation A major challenge in any crowdsourcing system is how to motivate people to participate. Four levels of worker motivation can be found in [143], including pay, altruism, fun and implicit. To simplify, crowdsourcing can be classified based on the motivation of the 12 workersintotwoclasses: reward-based andself-incentivised (Figure2.1). Withreward-basedspatial crowdsourcing, every spatial task has a price (assigned by a requester) and workers will receive a certain reward for every spatial task they perform correctly. Examples of this class include [8, 9]. With self-incentivised spatial crowdsourcing, workers volunteer to perform the tasks or usually have other incentives rather than receiving a reward such as documenting an event or promoting their cultural, political or religious views. An example of this class is [6], in which more than 5000 users voluntarily install traffic software onto their phones and report traffic information. Our work focuses on self-incentivised spatial crowdsourcing. Task Publishing Modes Next, we define two task publishing modes in spatial crowdsourcing, pull and push. With the pull mode, the SC-server publishes the spatial tasks and online workers can choose any spatial task in their vicinity without the need to coordinate with the server. One advantage of the pull mode is that the workers do not need to reveal their locations to SC-server since they can choose any arbitrary task in their vicinity autonomously. However, one drawback of this mode is that the server does not have any control over the allocation of spatial tasks. This may result in some spatial tasks never be assigned, while others are assigned redundantly. Another drawback of the pull mode is that workers choose tasks based on their own objectives (e.g., choosing thek closest spatial tasks to minimize their travel cost), which may not result in a globally optimal assignment. An example of the pull mode is [13], where the workers browse for available spatial tasks, and pick the ones in their neighborhood. With the push mode, the SC-server does not publish the spatial tasks to the workers. Instead, any online worker sends his location to the SC-server. The SC-server after receiving the locations of all online workers, assigns to every worker his closeby tasks. The advantage of the push mode is that unlike the pull mode, the SC-server has the big picture, and therefore, can assign to every worker his nearby tasks while maximizing the overall task assignment. Examples of this mode of spatial crowdsourcing include [95] and [93]. In [93], a framework for small campaigns is proposed, where workers are assigned to their closeby sensing tasks. However, the drawback is that the workers should report their locations to the server for every assignment, which can pose a privacy threat. Recently in [170], a framework was proposed to sanitize workers’ locations according to 13 differential privacy while still using SC-server as a broker to assign tasks to workers. A real-world example of the push mode is Uber 1 , a mobile app that connects passengers with drivers of vehicles for ridesharing. Our focus in this thesis is on this mode of spatial crowdsourcing. Worker’s Constraints Finally, in the case of the push mode, we divide the workers into two groups based on whether or not they have constraints. With workers without constraints, the server has full flexibility on how tasks should be assigned to the workers. This means that workers only send their locations to the server, and the server assigns every spatial task to its nearby worker [93]. With workers with constraints, the server needs to satisfy the constraints while assigning the tasks. An example of spatial constraint is that every worker only accepts spatial tasks in a spatial region (i.e., his working region). 2.2 Comparison to Other Fields of Study Crowdsourcing: Crowdsourcing has recently been attracting extensive attention in the research community. A recent survey in this area can be found in [102]. With a growing recognition of crowdsourcing, many crowdsourcing services including oDesk [3], MTurk [2] and CrowdFlower [7] have emerged, which allow requesters to issue tasks that workers can perform for a certain reward. Crowdsourcing has been largely adopted in a wide range of applications. Examples of such applications include but are not limited to image search [202], natural language annotations [153], video and image annotations [29], [156] and [194], search relevance [12] and [24], social games [185] and [65] and graph search [134]. Moreover, the database community has utilized crowdsourcing in database design, query processing [53], [115], [135], [39] and [219] and data analytics [112] and [188]. In [53], a relational query processing system is proposed, which uses crowdsourcing to answer queries that cannot otherwise be answered. As part of the crowdsourced database systems, human- powered versions of the fundamental operators, such as sort and join [115] and filter [135] were developed. Recently in [112], a system was developed to improve the accuracy of data analytics jobs by exploiting crowdsourcing techniques. 1 https://www.uber.com/ 14 Spatial Crowdsourcing: Despite all the studies on crowdsourcing, spatial crowdsourcing only recently received some attention [170, 40, 96, 37, 95, 13]. In [13], a crowdsourcing platform is proposed, which utilizes location as a parameter to distribute tasks among workers. In [95], a spatial crowdsourcing platform whose goal is to maximize the number of assigned tasks is proposed. Since the workers cannot always be trusted, another work aims to tackle the issue of trust by having tasks performed redundantly by multiple workers [96]. In [37], the problem of complex spatial tasks (i.e., each task comprises of a set of spatial sub-tasks) is introduced, in which the assignment of the complex task requires performing all of its sub-tasks. Meanwhile, the problem of scheduling tasks for a worker that maximizes the number of performed tasks is proposed in [40]. In [181], an online spatial task assignment problem is suggested to maximize the number of successful assignments. Recently in [170], the authors introduced the problem of protecting worker location privacy in spatial crowdsourcing. This study proposes a framework that achieves differentially- private protection guarantees. The solutions for this problem are quite complex, and require tuning multiple parameters to obtain satisfactory results. Thus, the same authors propose PrivGeoCrowd [171], an interactive visualization and tuning toolbox for privacy-preserving spatial crowdsourcing. PrivGeoCrowd helps system designers investigate the effect of parameters such as privacy budget and allocation strategy, task-assignment heuristics, dataset density on the effectiveness of private task matching. At the same time, privacy-preserving task assignment using cloaked locations is proposed in [140]. Moreover, the problem of crowdsourcing location-based queries over Twitter has been studied, which employs a location-based service (e.g., Foursquare) to find the appropriate people to answer a givenquery[25]. Eventhoughthisworkfocusesonlocationbasedqueries, itdoesnotassigntousers any spatial task, for which the user should go to that location and perform the corresponding task. Instead, it chooses users based on their historical Foursquare check-ins. In [129], spatiotemporal dynamics in mobile task markets, such as [8, 9], were studied. The well-known concept of participatory sensing could be deemed as one class of spatial crowd- sourcing, in which workers form a campaign to perform sensing tasks. Examples of participatory sensing campaigns include [6, 93, 36, 79] and [125]. However, the major drawback of all the exist- ing work on participatory sensing is that they focus on a single campaign and try to address the 15 challenges specific to that campaign. Another drawback of most existing studies on participatory sensing (e.g., [93]) is that they are designed for small campaigns, with a small number of partici- pants, and are not scalable to large spatial crowdsourcing applications. Finally, while most existing work on participatory sensing systems focus on a particular application, our work can be used for any type of spatial crowdsourcing system. Anotherclassofspatialcrowdsourcingisknownasvolunteeredgeographicinformation(orVGI), whose goal is to create geographic information provided voluntarily by individuals. Examples for this class include Google Map Maker [5], OpenStreetMap [1] and WikiMapia [4]. These projects allowtheuserstogeneratetheirowngeographiccontent,andaddittoapre-builtmap. Forexample, ausercanaddthefeaturesofalocation, ortheeventsoccurredatthatlocation. However, themajor difference between VGI and spatial crowdsourcing is that in VGI, users voluntarily participate by randomly contributing data, whereas in spatial crowdsourcing, a set of spatial tasks are queried by the requesters, and workers are required to perform those tasks. Moreover, with most VGI projects ([5] and [4]), users are not required to physically go to a particular location in order to generate data with respect to that location. Finally, as the name suggests, VGI falls into the class of self-incentivised crowdsourcing. Matching Problems: One can consider the task assignment problem in spatial crowdsourcing as the online bipartite matching problem ([92] [87] [98], [88] and [120]). Online bipartite matching problem is the most relevant variation of spatial crowdsourcing as it captures the dynamism of tasks arriving at different times. However, in online bipartite matching one of the item sets is given in advance and items from the other set arrive (usually one at a time), while in spatial crowdsourcing both sets, i.e., workers and tasks, can come and go without our knowledge. Thus, to some extent, online matching can be considered as a special case of our task assignment where the worker set (or task set) is fixed. In addition, with online matching, the cost/weight of any match is known in advance. However, with spatial crowdsourcing, the cost for a worker to perform a task mainly corresponds to the time it takes for him to travel to the location of the task. As the result, the cost of a task is not a fixed value but is dependent on the worker’s prior location. Hence, the sequence in which the tasks are performed impacts the cost. That is, with spatial crowdsourcing, the cost of the execution of a set of tasks is the distance of the shortest path that starts from the worker’s current 16 location and goes through the locations of all the assigned tasks. On the other hand, with online matching [88] the overall cost for one worker would be the sum of the distances between the worker and each assigned task. Finally, the performance of an online algorithm is often evaluated based on competitive ratio - the ratio between its performance and the offline algorithm’s performance. The online algorithm is competitive if its competitive ratio is bounded under any circumstance; this is not the goal of MSA, which focuses on the average performance. Some recent studies in spatial matching [195] and [206] do focus on efficiency and use the spatial features of the objects for more efficient assignment. Spatial matching is a one to one (or in some cases one to many) assignment between objects of two sets where the goal is to optimize over some aggregate function (e.g., sum, max) of the distance between matched objects. For example, the objective in [184] is to pair up 2N points in the plane into N pairs such that sum of the Euclidean distances between the paired points is minimized. In [Wong et al. 2007], given a set of customers and a set of service providers with limited capacities, the goal is to assign the maximum number of customers to their nearest providers, among all the providers whose capacities have not been exhausted in serving other closer customers. These studies assume a global knowledge about the locations of all objects exists a priori and the challenge comes from the complexity of spatial matching. However, spatial crowdsourcing differs due to the dynamism of tasks and workers (i.e., tasks and workers come and go without our knowledge), thus the challenge is to perform the task assignment at a given instance of time with the goal of global optimization across all times. Moreover, the fact that workers need to travel to task locations causes the landscape of the problem to change constantly. This adds another layer of dynamism to spatial crowdsourcing that renders it a unique problem. Expertise matching is the concept of assigning queries to experts that has gained extensive interest from various research fields for its wide range of applications such as product-reviewer alignment, product-endorser matching, paper-reviewer assignment [123], [160] and [70]. In [162], a framework for expertise matching with various constraints was introduced, which is capable of rendering the optimal solution. However, none of these studies address the problem of expertise in spatial matching. Our objective is different from these studies as we address the problem of 17 maximum expertise matches, while they try to find the best individual match based on some models/methods. Vehicle Routing Problems: Modeling the assignment cost as the shortest path visiting the location of multiple tasks brings another class of problems to attention. In this context the assignment problem in spatial crowdsourcing becomes similar to the Traveling Salesman Problem (TSP) [104] and the Vehicle Routing Problem (VRP) [178]. The goal of VRP is to minimize the cost of delivering goods located at a central depot to customers who have placed orders for such goods with a fleet of vehicles. The online versions of both TSP and VRP have been studied to some extent where new locations to visit are revealed incrementally. Since there is only one salesman in the standard version of TSP, here we focus on VRP. Different variations of VRP have been studied, yet there are differences between task assignment in spatial crowdsourcing and these variations. With VRP, all workers start from the same depot, whereas in spatial crowdsourcing each worker can have a different starting location. Moreover, with VRP we have a fixed number of workers, whereas in spatial crowdsourcing the same type of dynamism for tasks can apply to the workers. That is, workers can be added (removed) to (from) the system at any time. 18 Chapter 3 Survey of Related Work 3.1 Spatial Crowdsourcing Spatial Crowdsourcing (SC) has recently been attracting attention in both the research com- munities (e.g., [95, 40, 170, 167, 169]) and industry (e.g., TaskRabbit, Gigwalk [129]). Various aspects of spatial crowdsourcing are summarized in Table 3.1. In the following, we distinguishe SC from related fields, including crowdsourcing, participatory sensing, volunteered geographic infor- mation [174]. 3.2 Privacy Threats There have been known attacks on SC applications, such as location-based attacks during tasking in the push mode [93] and collusion attacks during reporting in the pull mode [187] (see Table 3.2). Despite the fact that most studies have solely focused on one of the two major threats, privacy risks to SC users may occur in the other scenarios: reporting in the push mode and tasking Problem Focus Studies Task assignment [95, 37, 139, 210, 67, 51, 181], [174, 207, 200, 196, 182, 50], [64, 113, 211, 177, 176, 74, 31, 18, 54, 214, 183, 20, 161, 114, 183, 167, 154, 217] Task scheduling [40, 145, 27, 42, 107, 66, 28, 52, 52, 127, 146, 191, 158, 41] Privacy [170, 140, 58, 213, 75, 169, 151, 220, 132, 159, 110, 108, 109] Trust [96, 32, 49, 23, 155, 190, 218, 91, 15, 48, 216, 111] Scalability [11] Incentives [105, 14, 203, 81, 69, 128, 129, 164, 144, 165, 150, 82, 47, 80, 63, 168, 106, 122, 121, 89, 90, 215] Applications [30, 101, 172, 166] Table 3.1: Spatial crowdsourcing studies. 19 Tasking Reporting Push [93] [152] Pull [Sec. 3.2.2] [187, 152], [Sec. 3.2.2] Table 3.2: Attacks on SC users. in the pull mode. In this section we present a threat model which characterizes the full spectrum of privacy threats to workers and requesters during both tasking and reporting phases with either push or pull mode. Next, we illustrate the privacy risks on TaskRabbit. 3.2.1 Threat Model As the privacy threats vary according to the modes of task publishing, we discuss possible threats associated with each mode. W R W R W R Tasking Server Worker W R W R W R Adversary Requester W R W R W R Assignment links Reporting links Reporting (a) Push mode Tasking SC-server Worker Requester Adversary Assignment links Reporting links Reporting W R W R W R W R W R W R W R W R W R (b) Pull mode Figure 3.1: Threat models in spatial crowdsourcing. Privacy Threats with the Push Mode With the push mode, the server takes as input the perturbed locations of both workers and tasks to perform effective task assignment; hence, there is a serious privacy threat from the server which might become a single point of attack. Figure 3.1a depicts the threat model for the push mode of spatial crowdsourcing. W and R denote workers and requesters, respectively. The dotted circles surrounding them denote that they are protected from a malicious entity shown in the first column of the first row in a dashed shaded box. After the tasking and reporting phases, the links between W and R represent the established connections 20 during each phase. We refer to these links as the assignment link and reporting link. The dashed links indicate connections that are oblivious to the corresponding malicious entity. The first row means that locations of workers and tasks are protected from the server at all the time. The role of the server is to create the assignment links between the workers and the requesters so that they can establish a direct communication channel among themselves. Each worker-requester pair cooperatively decides whether to accept the assignment from the server. If yes, they send a consent message to the server, confirming that the worker will perform the requester’s tasks. This agreement is illustrated by the first reporting link in Figure 3.1a. We argue that to preserve location privacy during both tasking and reporting phases, task locations need to be protected from the server. Otherwise, the completion of a task reveals that some workers must have visited the task’s location. In restrictive privacy settings, workers and requesters can also be malicious to each other. Hence, to ensure minimum disclosure among them, only workers who aim to perform the tasks should know the tasks’ locations (see the second row in Figure 3.1a). Likewise, a requester should only know the workers’ locations once her tasks are matched to and then performed by those workers (see the third row in Figure 3.1a). We emphasize the minimum disclosure of location information for both workers and tasks. The reason for this is twofold. First, the server knows only the assignment links between workers and tasks. Due to such links, the assigned workers (or tasks) may infer that there exists nearby tasks (or workers). These disclosures are unavoidable in the push mode of SC. Second, the disclosure of workers’locationstotheircorrespondingrequesterisinevitableatthereportingphaseperdefinition of SC. It is worth mentioning that this threat model is restrictive; hence, weaker variants exist. For example, most existing studies in the push mode assume that workers are trusted [93, 140, 75] and task locations are public [186, 170, 58, 212, 151]. Privacy Threats with the Pull Mode With the pull mode, despite the fact that workers do not need to send their locations to the server, the locations can still be learned during both tasking and reporting phases. As long as a worker connects to the server to either request some tasks or report results, he may reveal to the server patterns of where and when the connections were made and what kind of tasks he wants to perform. Consequently, in [152], the authors show that 21 linking multiple requests or reports of the worker may allow an adversary to trace him since the worker’s location information can be tracked through several stationary connection points (e.g., cell towers). In addition, the worker’s location trace can be inferred by both the server and requesters since he must be in the vicinity of the tasks in order to perform them. Figure 3.1b depicts the proposed threat model for the pull mode. To preserve privacy and identity of the workers from the server, both assignment links and reporting links should be secure during tasking and reporting phases, respectively. This is because if the connections are discovered by the server, which already knows the locations of tasks, the server learns the locations of workers since they must have visited the locations of the performed tasks. Hence, the workers must request tasks without revealing their identity to the server; once the tasks are performed, the workers must also disassociate their connections with the performed tasks while uploading task content to the server. Similar to the push mode, both workers and requesters themselves can be hostile to one another. Thus, the privacy threats from workers and requesters (rows 2 and 3 in Figure 3.1b) are similar to those in the push mode (rows 2 and 3 in Figure 3.1a), except the difference in the assignment links of the two second rows. The reason for this is that the requester is oblivious to the requests between the worker and the server during tasking. 3.2.2 Case Study of TaskRabbit We show that an adversary can perform harmful attacks on a typical SC application without much effort. TaskRabbit is a pull-based 1 online and mobile marketplace that matches workers with requesters, allowing requesters to find immediate help with everyday tasks including, but not limited to, cleaning, moving, and delivery. In the following we discuss the aforementioned threats to TaskRabbit users. Note that the following attacks on TaskRabbit.com were conducted in October 2014; the website has been updated since then. We first show the breach of task location during tasking. We signed up as a worker account and searched for delivery tasks in Los Angeles; 2381 spatial tasks were found. We obtained var- ious information about a particular task by clicking on it, such as description, price, task status 1 We present the privacy threats to a pull-based SC system only; however, some of these privacy threats also occur in push-based SC such as iRain. 22 and cloaked locations. Although each location is cloaked in a circle with a radius of half a km 2 (Figure 3.2a) to protect task locations from workers, the actual drop-off and pick-up locations were mentioned in the task description, i.e., “Please pick up a box of mini-muffins from (S) promptly at 8 am on Tues, 9/4, and drive them straight to me at (D).” It is also worth noting that task requests often contain sensitive information, such as health status of the requesters. An example of a sensitive task is one with title “super easy task deliver a bag to the doorstep of a sick friend.” Nonetheless, these privacy risks are due to the disclosure of task content, which is beyond the scope of this study. We then show the leak of worker location during tasking and reporting. To gain a competitive advantage, a worker may wish to not disclose locations of his visits to other workers and requesters. Thetaskstatus(Figure3.2c)infersthattheworker, referredtoasBob, wasatthepick-upanddrop- off locations of the task during the one-hour period between his assigned time and his completed time. The risk of precisely inferring Bob’s locations is even higher for time-sensitive tasks such as delivery and help at home, which requires him to meet requesters in-person at a specific place and time. This inference attack shows that TaskRabbit does not guarantee privacy protection for the pull mode in Section 3.2.1, which says that Bob’s locations are private to the server and only requesters who have their tasks performed by Bob should know his locations. In addition, one can also see much more information about Bob, including his previously performed tasks (Figure 3.2d) and all reviews from the requesters who hired him. These associations between Bob and his performed tasks indicate that the assignment links and reporting links are known to the server.xeeeeeeeeeeeeeeeeeeeeeeeeeeee (a) Task locations (b) Task price (c) Task status (d) Performed tasks Figure 3.2: Screenshots of TaskRabbit web application from worker Bob. 2 We obtained this information via JavaScript code. 23 Task description Corresponding JavaScript Quick post-party dishwashing clean up needed “radius” : “0.5”, “geo_center” : {“lat” : “33.xxxxxx”, “lng” : “-118.xxxxxx”} Take down light Christmas decorations “radius” : “0.5”, “geo_center” : {“lat” : “33.xxxxxx”, “lng” : “-118.xxxxxx”} Put up 20 yard sale signs in Mid-Wilshire area “radius” : ”0.5”, “geo_center” : {“lat” : “33.xxxxxx”, “lng” : “-118.xxxxxx”} Table 3.3: Three tasks requested by requester Alice. Among Bob’s requesters, we randomly picked one named Alice. We further show that her home location can be learned by tracking her task requests. We searched for household tasks that Alice requested in the past; three of them are shown in Table 3.3. We replaced six digits after the decimal point of “geo_center” by ’x’ to protect the privacy of the requester. These tasks were in the proximity of each other and likely situated at her home. Our hypothesis is that the tasks’ locations were randomly cloaked such that the cloaking regions covered the actual location of the tasks. The location must be in the overlapped area using triangulation. We validated our hypothesis by confirming that the location of another task, whose location was known, is within the overlapped region. This attack suggests that the more task requests are posted, the more accurately their locations can be learned. This simple attack is against the threat model, which states that the locations of Alice’s tasks should only be revealed to the workers who performed her tasks. 3.3 Privacy Countermeasures 3.3.1 Location Privacy Location privacy has been studied first within the model of spatial k-anonymity [62, 99, 126, 55, 201, 43, 133], where the location of a user is hidden among k other users. Other studies focused on space transformation to preserve location privacy [97, 205]. Such techniques assume a centralized architecture with a trusted third party, which is a single point of attack. More recent work removes this assumption and provides cryptographic-strength protection [56]. While location privacy has largely been studied in the context of location-based services, only a few works have studied privacy in participatory sensing (PS) [94, 77, 78, 36]. The focus of [94] is to privately assign 24 a set of spatial tasks to each worker while other works [77, 78] focus on preserving privacy in a PS campaign during the data contribution (i.e., how participants upload the collected data to the server without revealing their identities). Closest to our work is the approach from [36], in which a privacy-preserving framework in pull mode is proposed, and the participants collect data in an opportunistic manner without the need to coordinate with the server. However, as discussed in Section 2.1, the pull mode yields poor results in practice. Furthermore, the privacy model in [36] does not provide formal privacy guarantees. In our work, we focus on the push mode in the context of differential privacy, the de-facto standard for data publication. 3.3.2 Location Privacy in Spatial Crowdsourcing In this section we survey some state-of-the-art approaches addressing the privacy issues in spatial crowdsourcing. We first categorize the studies into two groups: tasking in the push mode and reporting in the pull mode. Subsequently, each subgroup is further classified according to the applied techniques. Within each subgroup we identify one key paper shown in boldface to be presented in depth while follow-up studies are briefly discussed. An overview of these studies is presented in Table 3.4. The table shows that the studies solely focus on location privacy of workers and assume that the locations and content of tasks are public. Moreover, the server is regarded as a primary threat in all studies, while some consider workers and requesters as secondary adversaries. We also notice that the most recent studies focus on the push mode, which requires privacy protection during tasking. This problem is considerably more challenging when compared to the problem of privacy-preserving reporting in the pull mode. Protection in the Pull Mode Privacy protection in the pull mode has been studied in the context of participatory sensing. In this section we highlight recent studies that often focus on the reporting phase of the pull mode. They use either pseudonymity [152] or exchange-based tech- niques [22, 209]. The pseudonymity method disassociates the connections between one’s uploaded data and his/her identity while the latter exchanges workers’ crowdsourced data and location infor- mationbeforeuploadingthemtoaserversothattheserverisuncertainaboutlocationsofindividual workers. 25 Table 3.4: Overview of problem focuses (Re: reporting, Ta: tasking); privacy techniques used (Ps: pseudonym, Cl: cloaking, Pt: perturbation, Ex: exchange-based, En: encryption-based); threats(W:worker, T: requester, S: server); trusted third party (TTP); optimization type (ST: single task, MT: multiple tasks). x and (x) represent primary and secondary aspects, respectively. Paper Phase Techniques Protection Threats TTP Re Ta Ps Cl Pt En Ex W T W R S Yes No [152] x x x (x) (x) x N/A N/A x x [22] x (x) x x N/A (x) N/A x x [209] x x x N/A N/A x x [93] x (x) x x (x) x x [186] x x (x) x (x) (x) x x [159] x x x x x [170] x x x (x) (x) x x [58] x x x (x) (x) x x [213] x x x (x) (x) x x [169] x x x (x) (x) x x [140] x x x (x) x x [75] x x x (x) x x [151] x x x (x) (x) x x [110] x x x x x [108] x x x x x x [109] x x x x x x Protection in the Push Mode While preserving privacy during reporting in the pull mode has been largely studied in the context of participatory sensing (a recent survey can be found in [34]), recent SC studies focus on the more challenging phase of tasking. These studies generally assume the push mode. We emphasize that focusing on the tasking step in the push mode is the correct approach, given that SC workers have to physically travel to the task location. The completion of a task discloses the fact that some worker must have been at that location, and this is unavoidable in SC. Focusing on tasking also makes sense from a disclosure volume standpoint. During the assignment, all workers are candidates for participation; therefore, locations of all workers are exposed, absent a privacy-preserving mechanism. Nevertheless, after task request dissemination, only a few workers will participate in task completion, and only if they give their explicit consent (see the threat model for the push mode in Section 3.2.1). Location Privacy Threats: Protecting location privacy in SC has attracted much interest. There have been known attacks on SC applications, such as location-based attacks during tasking 26 in the push mode [93] and collusion attacks during reporting in the pull mode [187]. A recent survey [173] shows that locations of workers and requesters’ tasks can be learned during both tasking and reporting phases. Hence, various techniques have been proposed to protect location privacy of workers during tasking, including cloaking (hide the accurate location in a cloaked region) [140], perturbation (distort the actual location information by adding artificial noise) [170, 58, 212, 169, 189, 83] and encryption [110, 137]. We compare these studies as follows. Privacy-Preserving Task Assignment: A recent survey in this area can be found in [138]. Particularly, in [140, 76], given the cloaking regions of a set of workers, the objective of the server is tomatchasetofspatialtaskstotheworkerssuchthattaskassignmentismaximizedwhilesatisfying the travel budget constraint of each worker. This study protects workers’ locations solely. Also, the privacy guarantee is weaker than differential privacy, as cloaking is sensitive to the adversary’s prior knowledge. Simultaneously, a differentially private framework is proposed to protect the privacy of workers’ locations [170]. In [170], the workers do not trust the server but send their locations to a trusted third party (i.e., cell service provider). The third party sanitizes the location data according to differential privacy and releases the so-called private spatial decomposition (PSD). The server performs task assignment using the statistics of the workers’ location data provided by the PSD, while the workers refine the assignment using their exact locations. An extension of [170] that protects workers’ locations across multiple timestamps is introduced in [169]. This study investigates privacy budget allocation techniques across consecutive releases and employs post-processing techniques to reduce the inaccuracy introduced by the addition of noise. Our recent work [175] improves [170, 169] in two main aspects. First, the proposed privacy model is broader than that in the prior work—workers and requesters may not trust each other and the server. Second, this work does not require any trusted third party to sanitize the location data; instead, individual users locally perform location sanitization on their smartphones. Thereafter, the authors in [59, 213] adopts the privacy model used in our work, i.e., it assumes a trusted CSP and differentially private location sanitization. Similar to our study published in [170, 169], the authors in [59] develop analytical models and task allocation strategies that balance privacy, utility, and system overhead. The CSP can integrate mobile server location and reputation information. The work in [213] studies reward-based spatial crowdsourcing that enables 27 task assignment with optimized reward allocation. The approach in [57] introduces a framework for protecting privacy of workers using location obfuscation techniques. This study assumes a trusted proxy to aggregate exact context (e.g., location) information from workers and output sanitized contexts to the SC-server. The authors develop an optimization model for task selection that max- imizes the total expected utility of the server given constraints on privacy and efficiency. However, only worker statistics are protected under differential privacy, whereas locations are obfuscated using cloaking techniques. Location cloaking is vulnerable to background knowledge attacks, and does not provide sufficient protection strength. In contrast, our work provides the strong semantic protection guarantees of differential privacy. Finally, the work in [106] addresses both location pri- vacyandincentivesinmobilesensingsystemsbyproposingtwocredit-basedprivacy-awareincentive schemes. One of them assumes an online trusted third party, similar to our CSP-centered approach. Geo-I has been recently used in task allocation to protect the privacy of workers’ locations [189]. However, similar to the prior work [140, 170, 169], this study assumes that task locations are known to the server. This can be a privacy threat to the tasks’ requesters because the platform can infer requesters’ locations from the locations of their tasks [173]. In contrast, our framework aims to protect locations of both workers and tasks. We also emphasize the online setting rather than the offline variant as in [189], in which the server must wait for a number of tasks to arrive prior to task assignment. In [83], Jin et al. propose a differentially private framework to select participants while participating in the spectrum-sensing tasks. Unlike ours, this study addresses a very different problem, in which the sensing locations are predetermined and publicly known. A few recent studies use encryption-based approaches [110, 137]. In [110], the locations of workers and tasks are protected by homomorphic encryption (HE). The platform performs task assignment based on the worker-task distances computed from the encrypted data. The workers receive the encrypted location of the assigned task and decrypt it to obtain the task location. While this approach guarantees exact assignment, its computational overhead is high. Unlike our study, the reason [110] could afford to use HE is that this work assumes batch assignment (server waits for multiple tasks to arrive and assign them to workers in batch). The experiment results in [110] show that it takes from 10 minutes to two hours for a batch assignment which is not feasible for our online setting. 28 Chapter 4 Privacy-Preserving Task Assignment 4.1 Preliminaries 4.1.1 Task Assignment: The Focus of SC Crowdsourcing is the process of outsourcing tasks to a distributed group of individuals. The difference between crowdsourcing and ordinary outsourcing is that a task or problem is outsourced to an undefined public rather than a specific body, such as paid employees. Spatial crowdsourcing (SC) [95] is a type of online crowdsourcing enabled by mobile devices, where performing an out- sourced task is only possible if the worker is physically at the location of the task. There are two modes of tasking based on how workers are matched to tasks: push and pull. With the pull mode, the server publicly 1 publishes the spatial tasks, and online workers autonomously choose tasks in their vicinity without coordinating with the server. One advantage of the pull mode is that the workers do not need to reveal their locations to the server. However, one drawback of this mode is that the server does not have any control over the allocation of spatial tasks; this may result in some spatial tasks never be assigned, while others are assigned redundantly. Another drawback of the pull mode is that workers choose tasks based on their own objectives (e.g., choosing the k closest spatial tasks to minimize their travel cost), which may not result in a globally optimal assignment. An example of the pull mode is TaskRabbit. With the push mode, requesters post tasks that include locations, while online workers send their locations to the server, which assigns tasks to nearby workers. The advantage of this mode is that unlike the pull mode, the server has the big picture and can assign to every worker his nearby tasks while maximizing the overall task assignment. However, the drawback is that locations of 1 Exact geographical coordinates of the tasks may not be published; instead, their cloaked locations or representa- tive names are provided. 29 both tasks and workers should be sent to the server for effective assignment, which can pose privacy threats. An example of the push mode is Uber. With the pull mode, the main focus of privacy protection is during the reporting phase, which has been well studied in the context of participatory sensing (PS), e.g., [152, 93, 186, 22, 209]. With PS, the goal is to exploit the ability of mobile users to collect and share data using their sensor- equipped phones for a given campaign. Most studies on PS focus on small campaigns with a limited number of workers; hence, they do not have issues of task assignment. However, with SC, the focus is on devising a scalable, generic and multipurpose crowdsourcing framework, similar to Amazon Mechanical Turk, but spatial, where multiple campaigns can be handled simultaneously. Therefore, most SC studies, including ours, assume the push mode and emphasize privacy protection during the tasking phase. We emphasize that the benefits of offering privacy protection during the tasking phase lies in the volume of disclosure. This is because even though all workers may appear eligible for participation, the tasking process based on distance can strictly limit the breadth of task dissemination. At the reporting phase, the disclosure of a task’s location to its assigned worker and vice versa is inevitable per definition of SC. Once the worker visits the task’s location, his location is known to the requester. This disclosure makes sense because we ensure the privacy of the individual location but not the entire location trace of the worker. 4.1.2 Mathematically Rigorous Definitions of Privacy Differential Privacy Differential Privacy (DP)[44]hasemergedasthede-factostandardindataprivacy, thankstoits strong protection guarantees rooted in statistical analysis. DP is a semantic model which provides protection against realistic adversaries with background information. Releasing data according to differential privacy ensures that an adversaryâĂŹs chance of inferring any information about an individual from the sanitized data will not substantially increase, regardless of the adversary’s prior knowledge. DP ensures adversary do not know whether an individual is present or not in the original data. DP is formally defined as follows. 30 Definition 1 (-indistinguishability). [45] Consider that a database produces set of query results ˆ D on the set of queries Q = {q 1 ,q 2 ,...,q |Q| }, and let > 0 be an arbitrarily small real constant. Then, transcript U produced by a randomized algorithm A satisfies -indistinguishability if for every pair of sibling datasets D 1 , D 2 that differ in only one record, it holds that ln Pr[Q(D 1 ) =U] Pr[Q(D 2 ) =U] ≤ In other words, an attacker cannot learn whether the transcript was obtained by answering the query set Q on dataset D 1 or D 2 . Parameter is called privacy budget, and specifies the amount of protection required, with smaller values corresponding to stricter privacy protection. To achieve -indistinguishability, DP injects noise into each query result, and the amount of noise required is proportional to the sensitivity of the query set Q, formally defined as: Definition 2 (L 1 -Sensitivity). [45] Given any arbitrary sibling datasetsD 1 andD 2 , the sensitivity of query set Q is the maximum change in the query results of D 1 and D 2 σ(Q) = max D 1 ,D 2 ||Q(D 1 )−Q(D 2 )|| 1 There are multiple ways to achieve DP. One approach to achieving DP is adding random noise to each query result to preserve privacy, such that an adversary that attempts to attack the privacy of some individual w will not be able to distinguish from the set of query results whether a record representing w is present or not in the database [45]. An essential result from [45] shows that a sufficient condition to achieve differential privacy with parameter is to add to each query result randomly distributed Laplace noise with mean 0 and scale λ =σ(Q)/. Typically, the interaction with a dataset consists of a series of analyses A i , each required to satisfy i -differentialprivacy. Then, theprivacyleveloftheresultinganalysisiscomputedasfollows: Theorem 1 (Sequential Composition [118]). Let A i be a set of analyses such that each provides ε i -DP. Then, running in sequence all analyses A i provides ( P i ε i )-DP. 31 Theorem 2 (Parallel Composition [118]). If D i are disjoint subsets of the original database, and A i is a set of analyses each providingε i -DP, then applying each analysisA i on partitionD i provides max ( i )-DP. Geo-Indistinguishability Geo-indistinguishability (Geo-I) [16] is a notion of location privacy based on differential privacy (DP). Similar to DP, Geo-I is a semantic model which provides protection against adversaries with background information. A mechanism provides -geo-indistinguishability if any two locations at distance at most r produce observations with “similar” distributions bounded by . We refer to this privacy guarantee as (,r)-Geo-I. The parameter is the level of privacy, which specifies the amount of protection required, with smaller values corresponding to stricter privacy protection. The parameter r is the radius of concern within which privacy is guaranteed. This means that an adversary cannot distinguish locations which are at mostr distance away. Geo-I is formally defined as follows. Definition 3 ((,r)-Geo-indistinguishability [16]). LetX be a set of exact locations. A mechanism A satisfies (,r)-geo-indistinguishability iff for all x, x 0 ∈X such that d(x, x 0 )≤ r: d P (A(x),A(x 0 ))≤d(x,x 0 ) d(x,x 0 ) is the Euclidean distance between x and x 0 while d P (A(x),A(x 0 )) is the multiplicative dis- tance between the two distributions produced by x and x 0 , correspondingly. Note that we use the constrained version of Geo-I (i.e., d(x, x 0 )≤ r), which forces the corresponding distributions to be at most r distance from each other. To preserve privacy, random noise is injected into each location such that, by observing a perturbed location, an adversary cannot infer the true location among all locations within radius r. Particularly, one way to achieve Geo-I is to generate random point z (from actual point x∈X) according to planar Laplace distribution. This function ensures that the probability of reporting a point in a certain (infinitesimal) area around z, when the actual locations are x and x 0 , differs 32 at most by a multiplicative factor e −d(x,x 0 ) . Hence, pdf of the noise-adding mechanism is called planar Laplacian centered at x. d (x)(z) = 2 2π e −d(x,z) (4.1) where 2 2π is a normalization factor. 4.1.3 Private Spatial Decompositions (PSD) The spatial aggregation of users can be used in spatial crowdsourcing requires for optimization. For example, the number of workers within a certain region is used to maximize task assign- ment [170, 174]. With privacy, the problem is to publish a synopsis of two-dimensional datasets using differential privacy. The challenge is to enable accurate answers range count queries given a privacy budget. Existing methods often construct a hierarchy of the partitions, or lay a one or two-level equi- width grid over the data domain. Xiao et al. [199] proposed imposing a fixed equal-width grid over the base data. It then partitions the data using kd-tree based on noisy counts in the grid. A heuristic was proposed to determine non-uniform nodes that will be split to minimize the non- uniformity error. In [197], Xiao et al. applied Wavelet transformation over the dataset before adding noise to it, namely Privlet. The study in [35] introduced the concept of private spatial decomposition (PSD) to release spatial datasets in a DP-compliant manner. PSD-based techniques have been shown to outperform both the cell-based methods in [199] and the Privlet method [197]. A PSD is a spatial index transformed according to DP, where each index node is obtained by releasing a noisy count of the data points enclosed by that node’s extent. Various index types such as grids, quadtrees or kd-trees [147] can be used as a basis for PSD. The accuracy of PSD is greatly influenced by the type of PSD structure and its parameters (e.g., height, fan-out). With space-based partitioning PSD, the split position for a node does not depend on the spatial distribution of data points. This category includes flat structures such as grids and quadtrees [147]. The privacy budget needs to be consumed only when counting the data points in each index node. Typically, all nodes at same index level have non-overlapping extents, which yields a constant and low sensitivity of 2 per level (i.e., changing a single location in the 33 data may affect at most two partitions in a level). The budget is best distributed across levels according to the geometric allocation [35], where leaf nodes receive more budget than higher levels. The sequential composition theorem applies across nodes on the same root-to-leaf path, whereas parallel composition applies to disjoint paths in the hierarchy. PSD structures such as kd-trees [35] perform splits of nodes based on the placement of data points. To ensure privacy, split decisions must also be done according to DP, and significant budget may be used in the process. Typically, the exponential mechanism [35] is used to assign a merit score to each candidate split point according to some cost function (e.g., distance from median in case of kd-trees), and one value is randomly picked based on its noisy score. The budget must be split between protecting node counts and building the index structure. The study in [141] compares tree-based methods with grid-based approaches and shows that uniform grid tends to perform better than recursive partitioning counterparts (e.g., kd-trees and quadtrees). Thepaperalsoproposesanadaptivegridapproach, wherethegranularityofthesecond- level grid is chosen based on the noisy counts obtained in the first-level (sequential composition is applied). Adaptive grid is a hybrid technique which inherits the simplicity and robustness of space-based PSD, but still uses a small amount of data-dependent information in choosing the granularity for the second level. The same authors [142] extend the grid-based approaches to arbitrary dimension. This study also shows that that branching factors and parameter settings can greatly influence the performance of hierarchical methods. 4.2 Protecting Locations of Workers Using Trusted Third Party 4.2.1 Privacy Framework Section 4.2.1 presents the system model and the workflow for privacy-preserving SC. Sec- tion 4.2.1 outlines the privacy model and assumptions. Section 4.2.1 discusses design challenges and associated performance metrics. 34 System Model We consider the problem of privacy-preserving SC task assignment in the push mode. Figure 4.1 shows the proposed system architecture. Workers send their locations (Step 0) to a trusted cellular service provider (CSP) which collects updates and releases a PSD according to privacy budget mutually agreed upon with the workers. The PSD is accessed by the SC-server (Step 1), which also receives tasks from a number of requesters (Step 2). For simplicity, we focus on the single-SC-server case, but our system model can support multiple SC-servers. When the SC-server receives a task t, it queries the PSD to determine a geocast region (GR) that encloses with high probability workers close tot. Due to the uncertain nature of the PSD, this is a challenging process which will be detailed later in Section 4.2.2. Next, the SC-server initiates a geocast communication [130] process (Step 3) to disseminatet to all workers within GR. According to DP, sanitizing a dataset requires creation of fake locations in the PSD. If the SC-server is allowed to directly contact workers, then failure to establish a communication channel would breach privacy, as the SC-server is able to distinguish fake workers from real ones. Using geocast is a unique feature of our framework which is necessary to achieve protection. Geocast can be performed either with the help of the CSP infrastructure, or through a mobile ad-hoc network where the CSP contacts a single worker in the GR, and then the message is disseminated on a hop-by-hop basis to the entire GR. The latter approach keeps CSP overhead low, and can reduce operation costs for workers. Upon receiving request t, a worker w decides whether to perform the task or not. If yes (Step 4), she sends a consent message to the SC-server confirming w’s availability. If w is not willing to participate in the task, then no consent is sent, and no information about the worker is disclosed. Privacy Model and Assumptions Our specific objective is to protect both the location and the identity of workers during task assignment. Once a worker consents to a task, the worker herself may directly disclose information to the task requester (e.g., to enable a communication channel between worker and requester). However, such additional disclosure is outside our scope, as each worker has the right to disclose 35 2. Task Request t Requesters Workers SC-Server Worker Database 1. Sanitized Release PSD 4. Consent Cell Service Provider GR 0. Report Locations 3. Geocast {t,GR} Figure 4.1: Privacy framework for spatial crowdsourcing his or her individual information. Our focus is on what happens prior to consent, when worker location and identity must be protected from both task requesters and the SC server. We emphasize that focusing on the SC assignment step is the correct approach, given that SC workers have to physically travel to the task location. Mere completion of a task discloses the fact that some worker must have been at that location, and this is unavoidable in SC. To protect her location after consent, a worker can still enjoy some form of identity protection (e.g., using pseudonyms and anonymous routing), for which solutions are already available (e.g., TOR). On the other hand, no solution exists to date for the more challenging problem of privacy-preserving task assignment, hence we direct our efforts in this direction. Focusing on task assignment also makes sense from a disclosure volume standpoint. During assignment, all workers are candidates for participation, so locations of all workers would be exposed, absent a privacy-preserving mechanism. On the other hand, after task request dissemination, only few workers will participate in task completion, and only if they give their explicit consent. Workers cannot trust the SC-server, especially as there may be many such entities with diverse backgrounds, e.g., private companies, non-profits, government organizations, academic institutions, etc. On the other hand, the CSP already has a signed agreement with workers through the service contract, so there is already a trust relationship established, as well as mutually-agreed upon rules fordatadisclosure. Furthermore,theCSPalreadyknowswheresubscribersare,e.g.,usingcelltower 36 triangulation, so worker location reporting does not introduce additional disclosure. In addition, having the CSP expose a PSD release of the user location dataset can benefit applications beyond crowdsourcing. For instance, the PSD can be shared with law enforcement agencies for public safety, or with commercial organizations to increase the revenue of the CSP. Therefore, there is sufficient motivation for the CSP to provide such a location sanitization service. However, the CSP has no expertise, and perhaps no financial interest, to host an SC service (or possibly multiple SC services with diverse requirements). Running such services requires dealing with a diverse set of issues such as interacting with various task requester categories, managing profiles (e.g., some workers may only volunteer for environmental tasks), etc. The role of the CSP is to aggregate locations from subscribed workers, transform them according to DP, and release the data in sanitized form to one or more SC-servers for assignment. As multiple SC-servers can use the same PSD, it is practical for the CSP to provide PSDs for a small fee, e.g., a percentage of the workers’ payment, or a tax incentive in case of a public-interest SC application. Design Goals and Performance Metrics Protecting worker location complicates significantly task assignment, and may reduce the effec- tiveness and efficiency of worker-task matching. Due to the nature of DP, it is possible for a region to contain no workers, even if the PSD shows a positive count. Therefore, no workers (or an insufficient number thereof) may be notified of the task request. The task may not be completed. Alternatively, a worker may be notified of the task even though she is at a long distance away from the task location, whereas a nearer worker does not receive the request. Finally, in the non-private push mode, only one selected worker, whose location and identity is known, is notified of the task request. With location protection, redundant messages need to be sent, increasing overhead. We focus on the following performance metrics: • Assignment Success Rate (ASR). Due to PSD data uncertainty, the SC-server may incorrectly assign workers to tasks (e.g., no worker is reached, or task is too far and workers do not accept it). ASR measures the ratio of tasks accepted by a worker to the total number of task requests. The challenge is to maintain ASR close to 100%. 37 Symbol Definition ε, εi Total privacy budget and level-i budget α AG budget split, α = 0.5 means ε1 =ε2 N Total number of workers N 0 Noisy worker count of level-1 cells mi×mi Level-i grid granularity ¯ n Expected noisy worker count of a level-2 cell t A task or its location, used interchangeably ci A level-2 cell nc i Noisy worker count of ci p a c i Acceptance rate of workers within ci c 0 i Sub-cell of cell ci Table 4.1: Summary of notations • Worker Travel Distance (WTD). The SC-server is no longer able to accurately evaluate worker-task distance, hence workers may have to travel long distances to tasks. The challenge is to keep the worker travel distance low, even when exact worker locations are not known. • System Overhead. Dealing with imprecise locations increases the complexity of assign- ment, which poses scalability problems. A significant metric to measure overhead is the average number of notified workers (ANW). This number affects both the communication overhead required to geocast task requests, as well as the computation overhead of the match- ing algorithm, which depends on how many workers need to be notified of a task request. 4.2.2 Protection for Static Workers’ Locations Sanitizing Workers’ Locations The first step in the proposed framework consists of building a PSD (at the CSP side) to be later used for task assignment at the SC-server. Building the PSD is an essential step, because it determines how accurate is the released data, which in turn affects ASR, WTD and ANW. In this section, we extend the state-of-the-art Adaptive Grid (AG) method proposed in [141] to address the specific requirements of the SC framework. Table 4.1 summarizes the notations used in our presentation. PSDs based on uniform grids treat all regions in the dataset identically, despite large variances in location density. As a result, they over-partition the space in sparse regions, and under-partition 38 in dense regions. AG avoids these drawbacks by using a two-level grid and variable cell granularity. At the first level, AG creates a coarse-grained, fixed-size m 1 ×m 1 grid over the data domain. AG uses a data-independent heuristic to choose level-1 granularity as m 1 =max(10, l 1 4 s N× k 1 m ) where N is the total number of locations (this is considered known, but a high-precision estimate can also be found using a small fraction of ). k 1 = 10 is suggested in [141]. Next, AG issuesm 2 1 count queries, one for each level-1 cell, using a fraction of the total privacy budget: 1 = ×α, where 0 < α < 1. AG then partitions each level-1 cell into m 2 ×m 2 level-2 cells, where m 2 is adaptively chosen based on the noisy count N 0 of the level-1 cell: m 2 = l s N 0 × 2 k 2 m (4.2) where 2 = − 1 is the remaining budget, and the constant is set empirically in [141] to k 2 = 5. The budget parameter α determines how privacy budget is divided between the two levels, and a setting of α = 0.5 is recommended [141]. Figure 4.2 shows a snapshot of an adaptive grid, with four level-1 cellsA,B,C,D. Constructing a differentially private AG requires two steps. First, the noisy countsN 0 ofA,B,C,D are computed by adding random Laplace noise with scale λ 1 = 2/ε 1 . Second, based on the noisy counts, level-1 cells are further split into level-2 cells. According to Eq. (4.2), cell D, which has noisy count 200 is partitioned according to a 3× 3 grid, while the granularity for other cells is 2× 2. Thereafter, AG adds to each level-2 cell (c i , i = 1... 21) random Laplace noise with scale λ 2 = 2/ε 2 . Finally, their corresponding noisy counts n c i together with the structure of the AG are published. Although AG yields good results for general spatial queries [141], it is not directly applicable to SC, due to its rigidity in choosing parameters. Specifically, the granularity m 2 of the level-2 grid is too coarse, leading to large geocast areas and high communication overhead. According to Eq. (4.2), the expected number of workers (i.e., noisy count) in a level-2 cell is: ¯ n =N 0 /m 2 2 =k 2 / 2 39 A B C D Level 1 Level 2 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c c c 13 c 14 c 16 c 17 c 15 c 18 c c c c ) 100 ( ' = A N ) 100 ( ' = B N ) 100 ( ' = C N ) 200 ( ' = D N 11 c 12 c 19 c 20 c 21 c Figure 4.2: A snapshot of adaptive grid (ε = 0.5, α = 0.5) ε ε 2 m 2 ¯ n 1 0.5 3 11 0.5 0.25 2 25 0.1 0.05 1 100 (a) Original AG (k 2 = 5) ε ε 2 m 2 ¯ n 1 0.5 6 2.8 0.5 0.25 5 5.6 0.1 0.05 2 28.2 (b) Modified AG (k 2 = √ 2) Table 4.2: Granularity m 2 and average count per cell ¯ n (N 0 = 100) Table 4.2a presents different values ofm 2 and ¯ n when varying total budget withα = 0.5. The values of ¯ n are large, especially for stricter privacy settings (i.e., lower ). For = 0.1, ¯ n is 100. In practice, a geocast region is likely to include multiple PSD cells, hence 100 is a lower bound on the ANW, while its typical values can grow much higher, leading to prohibitive communication cost. We propose a more suitable heuristic for choosing k 2 . Recall that the primary requirement of SC task assignment is to achieve high ASR. To that extent, we want to ensure that the task request is geocast in a non-empty region, i.e., the real worker count is strictly positive. According to the Laplace mechanism of DP, each PSD count is the sum of noisy and real counts. Given the level-2 privacy budget 2 , we can also quantify the distribution of added noise, which has standard deviation μ = 2 √ 2/ 2 . Therefore, if the PSD count is larger than a significant fraction of μ (i.e., μ/2), then with high probability there will be at least one worker in the level-2 cell. Our objective is to increase the granularity m 2 in order to decrease overhead, but only to the point where there is at least one worker in a cell. Denote by count PSD the value reported by PSD 40 for a certain level-2 cell. Given the Lap(2/ε 2 ) distribution, the probability that the noisy count is larger than zero is expressed as p h = 1− 1 2 exp(− count PSD 2/ 2 ) Furthermore, toincludeatleastonerealworkerinthecellwithhighprobability, wewanttohavethe PSD count larger than the noise with high probability. , i.e., ¯ n =k 2 /ε 2 ≥ √ 2/ε 2 , so at the limit we setk 2 = √ 2. The resulting probability of having non-empty cells is p h = 1− 1 2 exp(−1/ √ 2) = 0.75. According to Eq. (4.2), the corresponding granularity ism 2 = l q N 0 ε 2 / √ 2 m . Table 4.2b shows that this new setting significantly reduces ¯ n, and as a result ANW. Performing Task Assignment on Sanitized Data When a request for a task t is posted, the SC-server queries the PSD and determines a geocast region GR where the task is disseminated. The goal is to obtain a high success rate for task assign- ment, while at the same time reducing the worker travel distance WTD and request dissemination overhead ANW. Task Localness and Acceptance Rate Travel distance is critical in SC, as workers need to physically visit the task locations. Workers are more likely to perform tasks closer to their home or workplace [129, 95, 13]. The work in [129] shows that 10% of all workers, denoted as super-agents, perform more than 80% of the tasks. Among super-agents, 90% have daily travel distance less than 40 miles, and the average travel distance per day is 27 miles. This property is referred to as task localness [95]. A related study [68] addresses the localness of contents posted by Flickr and Wikipedia users, and proposes a spatial content production model (SCPM) that computes the mean contribution distance (MCD) of each worker as follows: MCD(w i ) = n X j=1 d(L w i ,L c j ) n (4.3) where L(w i ) is the location of worker w i , and L c j are the locations of its n contributions. 41 Based on Eq. (4.3), we can find the maximum travel distance (MTD) that a high percentage of workers are willing to travel to perform their assigned tasks. For example, MTD of super-agents in crowdsourcing markets studied in [129] is 40 miles with 90% cumulative ratio of contributors. Besides communication overhead, task localness is thus another reason to impose an upper bound on geocast region size. Intuitively, the maximum geocast region is a square area with side size equal to 2× MTD . Hereafter, we refer to MTD as both the maximum travel distance and the maximum geocast region size. We denote by acceptance rate (AR) the probability p a (0≤ p a ≤ 1), that a worker accepts to complete a task for which s/he receives a request. We assume that all workers are identical and independent of each other in deciding to perform tasks. The work in [129] studies reward-based SC labor markets and shows that super agents have an average AR of 90.73% while other agents have an acceptance rate of 69.58%. Acceptance rate is much smaller in self-incentivized SC [95], where the workers voluntarily perform tasks, without receiving incentives. A worker is more willing to accept nearby tasks, so we model acceptance rate as a decreasing function of travel distance. We consider two cases: (i) linear, where AR decreases linearly with distance starting from an initial MAR (Maximum AR) value (when the worker is co-located with the task) and (ii) Zipf, where acceptance rate follows Zipf distribution with skewness parameter s. The higher s is, the faster p a drops. p a is maximized when the worker is co-located with the task and becomes negligible at MTD. If the distance is larger than MTD, p a = 0. Analytical Utility Model We develop an analytical utility model that allows the SC-server to quantify the probability that a task request disseminated in a certain GR is accepted by a worker. Intuitively, the utility depends on the AR and on the worker count ¯ w estimated to be enclosed within GR. An SC-server will typically establish an expected utility threshold EU which is the targeted success rate for a task (this is a system goal, rather than an outcome). Generally, EU is considerably larger than an individual worker’s p a , so the GR must contain multiple workers. 42 We define X as a random variable for the event that a worker accepts a received task: P(X = True) = p a and P(X = False) = 1− p a . Assuming w independent workers, X∼ Binomial(w, p a ). We define the utility of a geocast region covering w workers as: U = 1− (1−p a ) w (4.4) U measures the probability that at least one worker accepts the task. The utility definition can be extended to the case of redundant task assignment, where multiple workers are required to complete a task (please refer to Section 4.2.2). Geocast Region Construction Given taskt, the GR construction algorithm must balance two conflicting requirements: determine a region that (i) contains sufficient workers such that task t is accepted with high probability, and (ii) the size of the geocast region is small. The input to the algorithm is task t as well as the worker PSD, consisting of the two-level AG with a noisy worker count for each grid cell. The algorithm chooses as initial GR the level-2 cell that covers the task, and determines its U value. As long as utility is lower than threshold EU, it expands the GR by adding neighboring cells. Cells are added one at a time, based on their estimated increase in GR utility. Higher priority is given to closer cells. The algorithm stops either when the utility of the obtained GR exceeds threshold EU, or when the size of GR is larger than MTD, hence utility can no longer be increased. The GR construction algorithm is a greedy heuristic, as it always chooses the candidate cell that produces the highest utility increase at each step. The pseudocode of the greedy algorithm is depicted in Algorithm 1. In Line 5, Q is a heap of cells{c i }, sorted decreasingly according to cell utility U c i . U c i is computed according to Eq. (4.4), namely U c i = 1− (1−p a c i ) nc i , where n c i is the noisy worker count of c i , and p a c i is the acceptance rate of the workers inside c i . Since worker locations within a cell are not known, we assume they all have the same acceptance rate. Moreover, we assume the worker-task distance is equal to the average distance between the task and each four corners of cell c i . 43 Algorithm 1 Greedy Algorithm (GDY) 1: Input: task t, MTD, 0<EU<1 2: Output: geocast region GR 3: MTD is the square of size 2× MTD centered at t 4: Init GR ={}, utility U = 0 5: Init max-heap Q ={level-2 cell that covers t} 6: Remove{c i ,U ci }←Q, U ci is computed from Eq. (4.4) 7: If c i =Nil, return GR {GR is larger than MTD} 8: GR = GR∪c i 9: If U ci ≥ 0, U = 1− (1−U)(1−U ci ) 10: If U≥EU, return GR 11: Find neighbors ={{c i 0 s neighbors}− GR}∩ MTD 12: Q =Q∪ neighbors 13: Goto Line 6 When a candidate cell is removed fromQ (Line 6), it is added to GR (Line 8), and GR utility is updated in Line 9. Updated utilityU 0 is the probability that a worker in either the current geocast region or the newly added cell performs the task: U 0 =U(1−U c i ) + (1−U)U c i +UU c i = 1− (1−U)(1−U c i ) Line 11 computes the new neighboring cells that are not in GR, and are not situated farther than MTD. These cells are added to Q according to their utilities. If a cell resides partially outside MTD, it is pruned to its fraction contained within the MTD, and its noisy count is updated proportionally to the pruned area. In summary, the geocast region construction algorithm greedily expands the GR by choosing to include at each step the grid cell that results in the highest estimated increase in utility. Cell utility takes into account the noisy worker count, as well as the distance between the cell and the task location. Next, we consider two refinements to the heuristic: first, in Section 4.2.2 we investigate a finer-grained solution search space by allowing partial cell inclusion; second, in Section 4.2.2 we consider the effect that the GR shape has on hop-by-hop task request dissemination. Partial Cell Selection Even though the adaptation of AG proposed in Section 4.2.2 signifi- cantly reduces the granularity of level-2 cells, the number of workers can still be rather large, and 44 the resulting ANW can lead to high task request dissemination costs. Such workers may be unnec- essarily included in the GR, even if the required EU could be achieved with far fewer workers. We propose an optimization that allows partial inclusion in the GR of a level-2 cell. Before adding a new cellc i to the GR (Line 10 of Algorithm 1), the optimization checks whether the utility increase provided byc i will exceed the required utility EU. If so, the algorithm computes a sub-region ofc i whose utility is sufficient to reach EU. The pseudocode of the heuristic is depicted in Algorithm 2, which includes two steps. First, it computes the percentage of c i ’s area (Lines 3-7) that is likely to enclose sufficient users. Next, it finds a sub-cell with that area (Lines 8-9) which is uniquely determined by its shape and location. The optimization in Algorithm 2 can be inserted as a function call before Line 10 in the main Algorithm 1. Algorithm 2 Partial Cell Selection Heuristic 1: Input: task location t, last cell c i , current utility U curr 2: Output: sub-cell c 0 i of c i 3: dist =distance(t,c i ) 4: p a sub =acc_rate(dist) 5: U required = EU−Ucurr 1−Ucurr 6: Worker count needed to achieve w required = log 1−U required 1−p a sub 7: Area percentile =w required /w ci 8: If c i covers t, find sub-cell given area percentile 9: Otherwise, find sub-cell adjacent with current region To compute a sub-cell, two constraints need to be satisfied. First, the sub-cell needs to be completely inside the parent cell. Second, the sub-cell must be adjacent with the current GR to form a continuous region. Therefore, depending on whether or not the current GR contains one or multiple cells, we use two strategies to find the sub-cell. Figure 4.3a depicts the first case where the GR includes only one grid cell c i (i.e., the task t 0 is inside c i , the parent cell). Intuitively, to cover the closest workers to the task, the shape of the sub-cell c 0 i (dashed line) must be a square. The boundary of cell c 0 i can therefore be completely determined given its area. To satisfy the first constraint, the center of c 0 i needs to be in the shaded square, whose center is the same as that of c i , and its size is equal to the difference between the side lengths ofc i andc 0 i . In addition, the position of c 0 i is such that the distance between its center and the task is minimized. The distance is zero when the task (e.g., t 0 ) is inside the shaded region 45 Sub-cell i c ' i c 0 t 4 t 5 t 6 t 7 t 8 t 1 t 2 t 3 t (a) Case 1: splitting c i t 13 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 c 10 c 11 c 12 c 14 c 16 c 17 c 15 c 18 c 19 c 20 c 21 c 19 c 20 c 21 c (b) Case 2: splitting cell 7 Figure 4.3: Examples of partial cell selection. (the task is co-located with c 0 i ’s center). Otherwise, if the task is outside the shaded square, its closest sub-cell’s center must be on the border of the shaded square. Subsequently, depending on the relative position of the task to the shaded circle (i.e., eight possibilities t 1 -t 8 ), we can find the corresponding sub-cell’s center. E.g., the closest sub-cell’s center of t 1 is the left bottom corner of the shaded square. Figure 4.3b presents the second case, in which the GR comprises of multiple cells {4,7,10,13}. This example is a flat version of the AG in Figure 4.2. The arrows depict the expansion process of the geocast algorithm. For example, cells 4 and 13 are expanded from cell 10 while cell 7 is expanded from cell 13. To ensure the GR is a continuous region, we require the long edge of the sub-cell (dashed rectangle) to be adjacent to the neighbor cell (i.e., 13) that the splitting cell (i.e., 7) is expanded from. When its long edge is fixed, the sub-cell is uniquely specified given its area. The rationale behind this choice is to ensure the continuity constraint. Communication Cost Dissemination of a task request within the GR can be implemented in two ways: • Infrastructure-based Mode. In this mode, the CSP sends an individual message to each worker within the GR. The cost is proportional to ANW, which may be large. 46 • Infrastructure-less Mode. Workerswithinthe GR canrelaythetaskrequesthop-by-hop, using a mobile ad-hoc network protocol over WiFi. In this case, the CSP only needs to send several messages to workers (one single message may suffice if the worker network is connected). Geocasting using hop-by-hop communication is an attractive alternative. The SC-server does not know the actual worker placement, so the GR construction strategy cannot rely on detailed routing information, but fortunately, the shape of the GR is often a good predictor of ad-hoc routing performance. Intuitively, it is cheaper to geocast within a shape with less skew, such as a circle or a square, as opposed to skewed regions such as line-shaped areas, which have large network diameter. For instance, in Figure 4.3b, the region of cells {1,2,3,4} is more favorable for geocast than {2,4,5,6}, despite the fact that the two areas have equal size. We assume that the geocast cost is proportional to the minimum bounding circle that covers the GR. Thus, the more compact the GR, the lower the cost. One widely accepted measure proposed in [100] is the Digital Compactness Measurement (DCM), which measures region compactness as the ratio between the area of the region and the area of its smallest circumscribing circle. An efficient solution to find the smallest enclosing circle is a randomized algorithm [193] that runs in linear time to the number of data points in the region. The maximum value of DCM is 1 when the shape is a circle. We modify Algorithm 1 to choose new cells to add to GR based on compactness, instead of utility. At each iteration, the cell that increases most the compactness of the GR is chosen from the list of candidates. Due to the inclusion of the new cell, the potential compactness increase of all other candidates may need to be re-computed, to account for the change in shape. We also consider a hybrid method that factors in both utility and compactness in cell selection. The merit function of the hybrid is a linear combination of the resulting GR utility and compactness. To evaluate the effectiveness of using compactness in the GR search strategy, we use as metric an estimation of the hop count required to disseminate the task request to all workers, given the 47 communication range of the wireless network (e.g., 50-100 meters for WiFi). We approximate the hop count as the diameter of the network divided by the communication range: Hop count = farthest distance between two workers 2× communication range (4.5) In practice, the worker network needs to be connected for the ad-hoc geocast to succeed. A message from any worker (i.e., seed) should be able to reach any other in the GR, using hop-by-hop wireless communication. Otherwise, if the network contains multiple disconnected components, the task cannot be sent to all workers from a single seed. In the latter case, the CSP would need to send the task to multiple seeds within the ad-hoc network. However, this level of detail goes beyond the scope of our work, and we restrict ourselves to using the hop count metric as an estimation of geocast cost, in conjunction with ANW. Redundant Task Assignment In some spatial crowdsourcing applications, multiple workers may be required to complete a task [96]. In this section, we extend our technique to support redundant assignment. Eq. (4.4) can be extended as follows. U = 1− K X i=1 U i = 1− K X i=1 w i ! (p a ) i (1−p a ) w−i (4.6) where K is the number of workers required to perform the task and U i = w i (p a ) i (1−p a ) w−i is the probability that exactly i workers perform the task. The geocast algorithm can be extended to the case where at least K workers in either the current geocast region GR or the newly added cellc i perform the task. The updated utility can be computed based on the probability of having at mostK−1 workers perform the task. Particularly, Line 9 of Algorithm 1 can be updated as follows: U 0 = 1− K−1 X j=0 U j c i K−j−1 X l=0 U l (4.7) 48 whereU j c i = nc i j (p c i ) j (1−p c i ) nc i −j is the probability that exactlyj workers inc i perform the task and P K−j−1 l=0 U l is the probability that at most K−j− 1 workers in GR perform the task. Note that U j c i = 0 if j >n c i . 4.2.3 Protection for Dynamic Workers’ Locations Spatial crowdsourcing systems receive continuously requests for task assignment. Hence, it is important to keep track of whereabouts of moving users and to release a sequence of worker PSDs that allow effective spatial task assignment over multiple timestamps. We extend our solution for single-snapshot PSD to the case of dynamic worker datasets. Problem Statement and Baseline Solution We assume a discrete time model, where the SC server receives task requests at T discrete timestamps, {t 1 ,...,t T }. The SC generates a sequence of PSDs {PSD 1 ,...,PSD T }, one for each timestamp t k , and processes each task request received at timestamp t k according to private decomposition PSD k . Since many (or all) of the workers appear in multiple PSDs at distinct timestamps, an adversary has more opportunities to breach the location privacy of workers by correlating information from consecutive PSDs. In the context of differential privacy, this increased knowledge available to the adversary is modeled as an increase in L 1 -sensitivity. Specifically, in the worst case, the sensitivity grows from 2 for a single snapshot to 2×T for a set of T released PSDs. On the other hand, the amount of privacy budget remains unchanged. Hence, the problem of privacy-preserving spatial task assignment for multiple snapshots is significantly more challenging than its single-snapshot counterpart. Problem Statement. Given a set of discrete timestampsT ={t 1 ,...,t T }, T > 1, and a pri- vacy budget , release a set of private spatial decompositions{PSD 1 ,...,PSD T } that satisfies -differential privacy and supports effective spatial task assignment as measured by performance metrics ASR, WTD and ANW. A baseline solution, denoted as BasicD (basic dynamic), is to repeatedly apply our single- snapshot PSD algorithm at every timestamp, disregarding correlation between data at different 49 timestamps. At each timestamp, only a fraction T of the entire privacy budget is used, which according to the sequential composition theorem [35] guarantees that the overall algorithm satisfies -differential privacy. The BasicD pseudocode is presented in Algorithm 3. However, as T grows, the amount of budget received by the PSD at each timestamp decreases, resulting in a high amount of noise that needs to be added to each worker count. Consequently, the released sanitized data become unusable, and the quality of task assignment decreases significantly. Next, we propose a technique that addresses the limitations of BasicD. Algorithm 3 BasicD Algorithm 1: Input: T, worker location datasets{D 1 ,...,D T }, budget 2: Output:{PSD 1 ,..., PSD T } 3: For k = 1 to T: 4: Publish k th PSD of D k with budget /T Multiple-Snapshot Worker PSD We build upon our single-snapshot solution and augment it with an adaptation of FAST [46], a recent approach to continuously release private counts on top of adaptive grids. FAST constructs an internal process model that captures the temporal correlation of aggregated location traces. It improves accuracy of released data at each timestamp by performing posterior estimation to reduce perturbation error. Also, whenT is large, FAST samples the time series of aggregate data to reduce overall perturbation cost. The main idea of the proposed dynamic worker PSD algorithm is to spend a fraction of the privacy budget to build the structure of the adaptive grid in the first time instance. During subsequent time instances, the AG structure is unchanged, and only the counts of the level-2 cells are updated, according to the new configuration of the dataset. Figure 4.4 illustrates how FAST is integrated within the proposed worker PSD framework, and Algorithm 4 provides the pseudocode. At timestamp t 1 , the algorithm releases the noisy counts of level-2 cells using the Laplace mechanism. In each of the following timestamps, the worker PSD structure aggregates the new true count and the old published counts of every level-2 cell as the input to the FAST component. 50 Spatial Filtering Prediction Perturbation Sampling FAST error sampling rate Raw User Location WorkerPSD Sanitized Data Spatial Aggregation Prediction Correction Perturbation noisy observation Location Data Figure 4.4: Workflow for dynamic worker PSD computation A B C D ) 100 ( ' = A N ) 100 ( ' = B N ) 100 ( ' = C N ) 200 ( ' = D N 1 1 c 2 1 c 1 2 c .... 1 T c 2 2 c 2 T c FAST .... 9 1 c 10 1 c 11 1 c 12 1 c 13 1 c 14 8 c 15 1 c 16 1 c 17 1 c 18 1 c 19 1 c 20 1 c 21 1 c 21 1 c 21 2 c 21 T c FAST .... .... .... .... 1 1 c 2 1 c 3 1 c 4 1 c 5 1 c 6 1 c 7 1 c 8 1 c Figure 4.5: Two-level grid for dynamic worker PSD Figure 4.5 illustrates the two-level grid decomposition representing the worker PSD. At the first timestamp, a fraction 1 = f× of the privacy budget is used to determine the geometry of the level-2 grid (Line 3 in Algorithm 4). The resulting decomposition is represented by cells with continuous borders in Figure 4.5. The remaining budget, amounting to 2 = (1−f)×, is used to release noisy counts for timestamps t 1 to t T , according to FAST approach (Lines 4-8). With adaptive grid, level-2 cells do not overlap, so according to [35], each set of counts corresponding to the same cell at distinct timestamps receives budget 2 (Line 7). Algorithm 4 Dynamic Worker PSD 1: Input: T, locations{D 1 ,...D T }, budget 2: Output: PSD {each second-level cell maintains a list of noisy counts} 3: Build PSD structure of D 1 given budget ×f 4: For c i in second-level cells of PSD: 5: For k = 1 to T: 6: c i k = actual count of c i in D k 7: Apply FAST to {c i 1 ,c i 2 ,...,c i T } given budget (1−f)× 8: PSD← update noisy counts {n c i 1 ,n c i 2 ,...,n c i T } of PSD 51 Choosing an appropriatef value. A largerf value allocates more budget for PSD construction in the initial timestamp, which increases the number of level-2 cells, providing finer granularity. However, a larger f also reduces the budget for later timestamps, which may lead to high noise- to-real-count ratio. To balance these two factors, we propose a simple analytical model to select f. Assume a simplified version of Algorithm 4 where the noisy counts of individual PSD cells are obtained by adding Laplace noise directly with privacy budget (1−f) T . Denote by count PSD the value reported by PSD for a certain level-2 cell. Given the Lap( 2T (1−f) ) distribution, denote the probability that the expected noisy count is larger than zero by p h = 1− 1 2 exp(− count PSD (1−f) 2T ) Our goal is to have PSD count larger than the expected noise. From Section 4.2.2, the average number of workers in a level-2 cell is ¯ n = √ 2/ε 2 = 2 √ 2/(f). Thus, p h = 1− 1 2 exp(− √ 2(1−f) Tf ) Figure 4.6 illustrates the analytical dependence of p h on f. Note that, p h asymptotically decreases to 0.5 when f increases. This result suggests that to avoid excessive noise-to-real-count ratio, especially when T is large, f should be set to a relatively small value, e.g., f∈ [0.1, 0.2]. In Section 7, we explore the effect of varying f on system accuracy and performance. FAST Optimizations To improve accuracy, FAST employs two operations, filtering and sampling. Next, we describe the functionality of these steps. Filtering. The filtering module generates estimates of monitored aggregates to improve the accu- racy of counts released at each timestamp. Given the previously observed noisy values, filtering attempts to determine an optimal posterior estimate. Two operations, prediction and correction, 52 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ph f T=10 T=20 T=30 T=40 T=50 Figure 4.6: The effect of f on probability p h are recursively applied during the entire movement history T. The prediction step generates a pre- diction of the worker count in a cell at each timestamp based on previous estimates and an internal process model, which provides the temporal correlation between adjacent aggregate values. The correction step combines the noisy observation, when available, with the prediction to generate a posterior estimate. The correction mechanism varies according to the filtering method. To model workermovement, weusetheKalmanfilter[85], whichwasspecificallydesignedfortrackingmoving targets, and has high computational efficiency. We denote by{x i 1 ,x i 1 ,...,x i T } the time series values of true counts of a particular cell c i . We assume that x i k follows a constant process model by the following equations, in which ω i k is the white Gaussian noise with variance Q i . x i k+1 =x i k +ω i k ;ω i k ∼N(0,Q i ) (4.8) To achieve differential privacy with budget (1−f)/T, Laplace noise with scale 2T (1−f) is added to each real worker count (the factor of 2 is due to the fact that sensitivity of count queries on non-overlapping cells is 2). The perturbed value z i k is determined as: z i k =x i k +ν i k ;ν i k ∼Lap(0, 2T (1−f) ) (4.9) 53 Following the work in [46], we approximate the Laplace noiseν i k by the following white Gaussian noise so that the Kalman filter-based algorithms can be used for estimation: ν i k ∼N(0,R) (4.10) In our experiments, we choose the values forR according to the posterior analysis provided in [46]. Sampling. Another method to improve accuracy is by reducing sensitivity of the computation. If one chooses to suppress the release of a cell count at a certain timestamp, and instead publish a count that is derived from the values of previous timestamps, then sensitivity decreases. Hence, sampling decides whether the count of a specific timestamp is queried directly, or derived from previous values and the movement model considered. In our study, we adopt the adaptive sampling algorithm with proportional, integral, and deriva- tive errors (PID) to adjust the sampling interval dynamically for each cell count series. Following [46], we define the feedback error E i k for cell i at time k between prediction and correction values, which measures how well the internal process model describes the current data dynamicity. Once thePID error is updated, a new sampling intervalI 0 can be determined by the following equation: I 0 =max{1,I +θ(1−e Δ−ξ ξ )} (4.11) where I is the current sampling interval, and θ and ξ are user-specified parameters. Intuitively, a small PID error results in a decrease in the sampling rate, and vice versa. 4.2.4 Performance Evaluation Experimental Methodology We use two real-world datasets: Gowalla and Yelp. Gowalla contains the check-in history of users in a location-based social network. For our experiments, we use the check-in data in the area of San Francisco, California. We assume that Gowalla users are the workers of the SC system, and their locations are those of the most recent check-in points. We also model each check-in point as a task that was previously accepted for execution by a worker. Based on this model, we determine 54 Name Tasks Workers MTD Workers/km 2 Gowalla (Go.) 151,075 6,160 3.6 35 Yelp (Ye.) 15,583 70,817 13.5 4 Table 4.3: Dataset characteristics the mean contribution distances (MCDs) according to Eq. (4.3), and we compute maximum travel distance (MTD) as the 90% MCD percentile value, leading to a value of 3.6km. The Yelp data corresponds to the greater area of Phoenix, Arizona, and includes locations of 15, 583 restaurants, 70, 817 users and 335, 022 user reviews. We use restaurant locations as tasks, and a user review is equivalent to accepting a SC task. Table 4.3 presents the details of the datasets. For moving users experiments, we divide each dataset into T time instances based on the time of activities (i.e., review time for Yelp and check-in time for Gowalla). Using this methodology, 10-30% of the worker locations are updated at each timestamp. To evaluate the overhead of privacy, we compare our proposed solution with a non-private algorithm that has access to exact worker locations. Given a task and the actual worker locations, the algorithm keeps adding nearby workers one by one (1NN, 2NN, etc.) until the obtained utility exceeds threshold EU, or until the size of the GR is larger than MTD. The geocast query is the minimum bounding circle of the nearest workers. We consider privacy budget ∈ {0.1, .., 0.5, ..,1}, ranging from strict to loose privacy require- ments. We set the expected utility EU ∈ {0.3, 0.5, 0.7, 0.9} and the maximum acceptance rate MAR∈ {0.1, 0.3, 0.5, 0.7}. We vary the number of time instances T∈ {50, 60, ..., 100} and the budget percentile for grid structure f∈ {0.1, 0.2, .., 1}. We vary the number of workers required to perform the task K∈ {1, 2, 3, 4, 5}. Default values are shown in bold. For Zipf acceptance rate, skew parameter s is set to 1. Wireless communication range is 50 meters. We randomly generated 1, 000 tasks and measured the performance of the proposed solution with respect to the metrics in Section 4.2.1: ASR, ANW, and WTD. To compute ASR, we simulate a binomial model as discussed in Section 4.2.2. Each worker flips a biased coin and decides whether to accept or not a received task request, based on personalized threshold p a (recall that p a takes into account distance to task). A task is accepted if at least one worker agrees to perform it. 55 We consider two ASR variations: WTD NN and WTD FC . In the former case, the metric value is determined as the distance from the task to the nearest worker that accepts the task, whereas in the latter, the distance to the first worker that accepts the task is considered (i.e., first-come). We also measure the average hop count HOP required for geocast, according to Eq. (4.5). Finally, we also show the results obtained for the average number of cells in a GR (CELL) and the compactness of the GR. Although these metrics are not directly perceived by the end users, they help us to better understand the underpinnings of the proposed solution. All measured results are averaged over ten random seeds. Experimental Results Evaluation of Single-snapshot PSD Evaluation of GR Construction Heuristics: We evaluate the performance of the greedy algorithm for GR construction from Section 4.2.2 and its variations. GDY refers to the algorithm using the original AG PSD from [141], whereasG-GR uses our customized AG solution. The optimization allowing partial cell selection is denoted by G-PA, and the combination of both G-GR and G-PA by G-GP. Figure 4.7 illustrates the results. G-GP generally performs best in terms of minimizing ANW, WTD and HOP in all combination of datasets (Gowalla, Yelp) and acceptance rate functions (Linear, Zipf). Moreover, by comparing G-GP and G-PA with GDY and G-GR, we observe that customized AG granularity contributes mostly to the improvements. Partial cell selection proves useful mostly when the privacy budget is small (i.e., resulting grid is coarse). Compared to GDY, G-GP reduces ANW by up to a factor of 5, and the improvement is more significant when privacy budget is low. Specifically, increasing provides a more accurate estimation for the worker counts in the PSD, and also the granularity of the level-2 AG grows. As a result, ANW can be more tightly controlled. Moreover, G-GP also yields reduced WTD and HOP by up to a factor of 8 and 7, respectively. On the other hand, G-PA obtains lower ASR than the expected utility of 90%, particularly for small. This can be explained based on the fact that applying partial cell selection tends to reduce aggressively the number of workers included in the GR, which may result in under-provisioning (i.e., an insufficient number of workers receive task requests). All other methods achieve close to 56 the target EU of 90%, but most often this is a result of over-provisioning, which in turn increases ANW. 40 60 80 100 120 GDY G-GR G-PA G-GP 0 20 40 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (a) ANW, Go.- Linear 0.1 0.2 0.3 GDY G-GR 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (b) WTD NN ,Go.- Linear 0.2 0.3 0.4 0.5 GDY G-GR G-PA G-GP 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (c) WTD FC , Go.- Linear 4 6 8 GDY G-GR G-PA G-GP 0 2 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (d) HOP, Go.-Linear 40 60 80 100 GDY G-GR 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (e) ASR, Go.-Linear 80 120 160 GDY G-GR G-PA G-GP 0 40 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (f) ANW, Go.-Zipf 0.2 0.3 0.4 GDY G-GR 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (g) WTD NN , Go.- Zipf 0.2 0.4 0.6 0 0.2 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (h) WTD FC , Go.- Zipf 3 4 5 6 7 8 0 1 2 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (i) HOP, Go.-Zipf 40 60 80 100 GDY G-GR 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (j) ASR, Gow.-Zipf 250 GDY G-GR 150 200 GDY G-GR G-PA G-GP 100 150 0 50 0 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (k) ANW, Ye.- Linear 1 2 GDY G-GR G-PA G-GP 0 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (l) WTD NN , Ye.- Linear 1 2 3 GDY G-GR G-PA G-GP 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (m) WTD FC , Ye.- Linear 20 30 40 50 GDY G-GR G-PA G-GP 0 10 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (n) HOP, Ye.-Linear 40 60 80 100 GDY G-GR 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (o) ASR, Ye.-Linear 100 150 200 250 GDY G-GR G-PA G-GP 0 50 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (p) ANW, Ye.-Zipf 1 2 GDY G-GR G-PA G-GP 0 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-PA G-GP (q) WTD NN , Ye.- Zipf 1 2 3 GDY G-GR G-PA G-GP 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (r) WTD FC , Ye.-Zipf 20 30 40 50 GDY G-GR G-PA G-GP 0 10 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (s) HOP, Ye.-Zipf 40 60 80 100 GDY G-GR 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 GDY G-GR G-PA G-GP (t) ASR, Ye.-Zipf Figure 4.7: Comparison of GR construction heuristics by varying ε. Figure 4.8 captures in more detail the effect of G-GP and grid granularity on ASR, as well as the under/over-provisioning tendencies. With coarser-grained grids (i.e., large k 2 ) over-provision occurs, whereasfiner-grainedgridssufferfromexcessivenoise-to-real-countratio, resultinginunder- provision. Note that our choice of k 2 = √ 2∼ 1.41 obtains a good trade-off: it achieves near 90% utility and also reduces ANW, WTD and HOP. 57 50 60 70 80 90 100 0.1 0.2 0.4 0.8 1.41 1.6 3.2 6.4 12.8 25.6 ASR k2 Gowalla-Linear Gowalla-Zipf Yelp-Linear Yelp-Zipf Over-provision Under-provision k2 Figure 4.8: Average ASR over ∈{.1,.4,.7, 1}, varying k 2 . Evaluation of Compactness-Based Heuristics: We evaluate the effect of the compactness- guided heuristic for GR construction. For brevity, we only include Gowalla results (Yelp dataset shows similar trends). As shown in Figure 4.9, the compactness-based approach (G-GP-Compact significantly increases the compactness measure compared to its utility-based counterpart (G-GP- Pure). The hop count is also reduced, by up to 36%, particularly when the privacy budget is large. However, the compactness-only approach does not fare that well for lower privacy budgets. On the other hand, the hybrid heuristic that combines utility and compactness in the ranking of candidates (G-GP-Hybrid) manages to perform better than its counterparts for all values. We conclude that such a balanced approach is the best solution for GR construction. Overhead of Achieving Privacy: We compare the proposed solution with the non-private algorithmfortaskassignment,describedinSection4.2.4. Figure4.10presentstheoverheadincurred by privacy when varying (for brevity, we only show Gowalla results). As expected, when increases, thePSDoffersmoreaccuratedata, andtheoverhead(intermsof ANW, WTD and HOP) decreases. Interestingly though, ASR drops in value. This can be explained through significant over-provisioning that occurs for lower budgets, when the greedy heuristic enlarges the GR in the 58 0.4 0.6 0.8 1 G-GP-Pure 0 0.2 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Pure G-GP-Hybrid G-GP-Compact (a) CMP, Go.-Linear 1 2 3 G-GP-Pure 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Pure G-GP-Hybrid G-GP-Compact (b) HOP, Go.-Linear 20 30 40 G-GP-Pure 0 10 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (c) ANW, Go.- Linear 0.05 0.1 0.15 G-GP-Pure 0 0.05 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (d) WTD NN ,Go.- Linear 0.1 0.15 0.2 G-GP-Pure 0 0.05 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (e) WTD FC , Go.- Linear 0.4 0.6 0.8 1 G-GP-Pure 0 0.2 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Pure G-GP-Hybrid G-GP-Compact (f) CMP, Go.-Zipf 2 3 4 G-GP-Pure 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Pure G-GP-Hybrid G-GP-Compact (g) HOP, Go.-Zipf 40 60 80 G-GP-Pure 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (h) ANW, Go.-Zipf 0.1 0.2 0.3 G-GP-Pure 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (i) WTD NN , Go.- Zipf 0.1 0.2 0.3 G-GP-Pure 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 G-GP-Hybrid G-GP-Compact (j) WTD FC , Go.- Zipf Figure 4.9: Comparison of compactness-based heuristics by varying ε. ANW HOP WTD NN WTD FC Go.-Linear 161% 54% 25% 18% Go.-Zipf 103% 30% 22% 23% Ye.-Linear 202% 92% 19% 20% Ye.-Zipf 132% 41% 17% 25% Table 4.4: The average relative increase in percentage of different measurements when varying ∈ {0.1, 0.4, 0.7, 1} quest for achieving the desired EU. As a result, more workers are notified, and the chances of task acceptance are higher. However, overhead is also much higher. We also observe that privacy does not significantly increase WTD, proving that the greedy GR algorithm does a good job in selecting nearby workers. Table 4.4 summarizes the variation of considered metrics when adding privacy. Note that, the travel distance, which is perhaps the most important factor in SC, is not considerably impacted by privacy. We also observed that the overhead incurred is generally higher for the sparser Yelp data, which is not surprising, as it is a well-known fact that differentially private algorithms perform better on dense datasets. Table 4.4 also shows the effect of different acceptance rate functions. Zipf incurs lower overhead compared to Linear. The reason is that with Zipf distribution, the acceptance rate of the workers dropsfasterforthesamedistancetothetaskcomparedwiththelinearcase. Thesmalleracceptance 59 rateleadstolarger ANW inbothprivateandnon-privatecases; however, ANW increasesatafaster rate in the non-private case (Figure 4.10). 20 30 40 Privacy Non-Privacy 0 10 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (a) ANW, Go.- Linear 1 2 3 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (b) HOP, Go.-Linear 0.05 0.1 0.15 0 0.05 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (c) WTD NN , Go.- Linear 0.1 0.15 0.2 0 0.05 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (d) WTD FC , Go.- Linear 85 90 95 100 Privacy 75 80 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Non-Privacy (e) ASR, Go.-Linear 20 40 60 0 20 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (f) ANW, Go.-Zipf 2 3 4 0 1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (g) HOP, Go.-Zipf 0.1 0.2 0.3 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (h) WTD NN , Go.- Zipf 0.2 0.3 0.4 0 0.1 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Privacy Non-Privacy (i) WTD FC , Go.-Zipf 85 90 95 100 Privacy 75 80 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 Non-Privacy (j) ASR, Go.-Zipf Figure 4.10: Overhead of privacy (G-GP-Hybrid) compared to non-private algorithm. The Effect of Varying MAR We evaluate the performance of G-GP-Hybrid on the Yelp dataset by varying the maximum acceptance rate (similar trends were observed for Gowalla). Figure 4.11 shows the results. A higher acceptance rate yields lower overhead and shorter travel distance (workers are more willing to accept tasks). The GR size is also smaller, leading to a smaller network diameter and HOP value. 20 30 40 50 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 0 10 AR=0.1 AR=0.4 AR=0.7 AR=1 (a) ANW, Ye.- Linear 2 4 6 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 0 2 AR=0.1 AR=0.4 AR=0.7 AR=1 (b) HOP, Ye.-Linear 0.1 0.2 0.3 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 0 0.1 AR=0.1 AR=0.4 AR=0.7 AR=1 (c) WTD NN , Ye.- Linear 0.2 0.3 0.4 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 0 0.1 AR=0.1 AR=0.4 AR=0.7 AR=1 (d) WTD FC , Ye.- Linear 4 6 8 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 0 2 AR=0.1 AR=0.4 AR=0.7 AR=1 (e) CELL,Ye.-Linear Figure 4.11: Performance of geocast algorithm (G-GP-Hybrid) when varying Acceptance Rate (Ye.-Linear). Interestingly, Figures 4.11c and 4.11d show that MAR has a significant effect on decreasing WTD. This effect is more pronounced than the drop due to increase in privacy budget , as observed in previous experiments. Figure 4.11e shows that the number of grid cells in the GR 60 drops as MAR increases, due to increased utility of each cell. For the largest MAR value, a single cell is sufficient as GR, so CELL = 1. Effect of Varying K: We evaluate the performance of G-GP-Hybrid on the Yelp dataset by varying the number of workers required to complete a task. Figure 4.12 shows the results. As expected, higher K yields higher overhead as more workers are required to perform a task. Also, we observe that as the privacy budget gets larger, the impact of K on the performance metrics attenuates. (a) ANW, Ye.- Linear 0 10 20 30 40 50 60 K=2 K=3 K=4 K=5 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (b) HOP, Ye.-Linear 0 0.1 0.2 0.3 0.4 0.5 K=2 K=3 K=4 K=5 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (c) WTD NN , Ye.- Linear 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K=2 K=3 K=4 K=5 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (d) WTD FC , Ye.- Linear 0 5 10 15 K=2 K=3 K=4 K=5 Eps=0.1 Eps=0.4 Eps=0.7 Eps=1 (e) CELL,Ye.-Linear Figure 4.12: Performance of geocast algorithm (G-GP-Hybrid) when varying number of workers required to complete a task (Ye.-Linear). Comparison with pull-mode Benchmark We evaluate the performance of our technique in comparison with a benchmark that resembles the pull mode. With the pull mode, tasks are being broadcast to workers by the CSP, and there is no disclosure of worker locations to the SC server. However, each worker operates with knowledge of its own location only, which may lead to suboptimal assignment. Only workers within MTD distance of a task are notified. Table 4.5 summarizes the results. We consider as comparison metrics the average number of notified workers (ANW), average number of accepting workers (AAW), and travel distance (for the latter, our solution considers two cases, the NN and the FC approach). For brevity, we show the overhead of pull-mode compared to our approach, expressed as a multiplicative factor showing how many times higher (i.e., worse) the overhead of pull mode is. Asexpected, duetolackofgloballocationinformation, thebenchmarkperformspoorly. Specifi- cally, thenumberofnotifiedworkersishigherbyafactorrangingfrom71(whenourmethodreceives the lowest privacy budget) to 125 (for the largest privacy budget setting of 1.0). Such a high ANW value raises serious concerns about the practicality of pull mode, as a large number of workers must 61 be notified, leading to large consumption of communication bandwidth and battery, both scarce resources for mobile users. The number of accepting workers is larger for the pull mode by a factor ranging from 27 to 47, resulting in an excessive level of task assignment redundancy. The average travel distance is also significantly worse, by a factor ranging from approximately 7 (in the case of FC) to 12 (for our more accurate NN method). These results show that the push-mode approach performs far better than the pull-mode method, even when dealing with sanitized (i.e., perturbed) location information. ANW AAW WTD NN WTD FC 0.1 71.42 27.78 11.90 6.25 0.4 111.11 40 12.04 7.63 0.7 125 45.45 11.49 7.87 1.0 125 47.61 11.90 7.75 Table 4.5: Performance comparison with pull-mode benchmark. Evaluation of Multiple-snapshot PSD We evaluate the performance of the proposed algo- rithmsfordynamicworkerPSDfromSection4.2.3. Inourimplementation,weusetheG-GP-Hybrid as a single-snapshot base for the dynamic case, and we consider two multiple-snapshot instances: the first uses the Kalman filtering without sampling (denoted as Kalman) and the second uses Kalman filtering in conjunction with the PID adaptive sampling (denoted as KalmanPID). We compare the obtained results against two benchmarks: the non-private case, and the BasicD base- line introduced in Section 6, which divides privacy budget equally among all time instances. For brevity, we present the results only for linear acceptance rate function, as similar results have been observed for the zipf case (the focus of this experiment is on user movement, so the choice of acceptance rate function does not have a significant impact). Figure 4.13 presents the performance metrics measurements of the considered methods when varying . For higher , the PSD offers more accurate data, and the overhead (in terms of ANW, WTD and HOP) decreases. We observe that both Kalman-based algorithms are superior to BasicD by a significant amount. Furthermore, their performance is not far behind that of the non-private approach. Interestingly, the BasicD baseline obtains a high ASR, but this occurs due to excessive 62 over-provisioning at lower budgets. In such cases, the greedy heuristic enlarges the GR to achieve the desired utility, and the overhead is very high. We also observe that KalmanPID performs better than Kalman in all cases, and particularly when the budget is small (i.e., up to 14% in ANW, WTD and HOP). The reason for this behavior is the benefit brought by sampling in the case of KalmanPID. Due to its superior performance, we focus on KalmanPID for the rest of the experiments. Next, we investigate the performance of the multiple-snapshot algorithm when varying the privacy budget split captured by parameter f. Recall from Section 6 that a fractionf of the budget is used in the first timestamp to determine the adaptive grid structure. A higherf results in a more accurate initial structure, but in the detriment of accuracy when later determining noisy counts for level-2 cells. Conversely, a small value of f may impact negatively the ability of the structure to capture the initial worker density, but may result in higher accuracy for individual cell counts. 100 1000 10000 1 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BasicD Kalman Kalman-PID Non-Privacy (a) ANW, Ye.- Linear 60 80 100 120 140 160 180 BasicD Kalman Kalman-PID Non-Privacy 0 20 40 60 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) HOP, Ye.-Linear 1 1.5 2 2.5 3 3.5 BasicD Kalman Kalman-PID Non-Privacy 0 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) WTD, Ye.-Linear 30 40 50 60 70 80 90 100 0 10 20 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BasicD Kalman Kalman-PID Non-Privacy (d) ASR, Ye.-Linear 1 1.5 2 2.5 0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BasicD Kalman Kalman-PID (e) CELL,Ye.-Linear 60 80 100 120 140 160 BasicD Kalman Kalman-PID Non-Privacy 0 20 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (f) ANW, Go.- Linear 2 3 4 5 6 7 0 1 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BasicD Kalman Kalman-PID Non-Privacy (g) HOP, Go.-Linear 0.15 0.2 0.25 0.3 0.35 0.4 BasicD Kalman Kalman-PID Non-Privacy 0 0.05 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (h) WTD, Go.- Linear 30 40 50 60 70 80 90 100 BasicD Kalman 0 10 20 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 BasicD Kalman Kalman-PID Non-Privacy (i) ASR, Go.-Linear 1 1.5 2 2.5 BasicD Kalman Kalman-PID 0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (j) CELL, Go.- Linear Figure 4.13: Performance overhead for multiple-snapshot worker PSD when varying privacy budget ε. Figure 4.14 shows a decreasing trend for ANW, HOP and WTD as f increases. As expected, a higher f yields lower overhead (ANW) and shorter travel distance (WTD) in KalmanPID due to finer-grained grids. The GR size is also smaller, thus leading to a smaller network diameter and HOP value. However, Figures 4.14d and 4.14i show a significant drop in utility with respect to EU, i.e, 54% and 36%, respectively. To achieve the expected utility EU = 0.9, f must be set to lower values, such as 0.1. 63 30 40 50 60 70 80 BasicD Kalman-PID Non-Privacy 0 10 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) ANW, Go.- Linear 2 3 4 5 6 BasicD 0 1 2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Kalman-PID Non-Privacy (b) HOP, Go.-Linear 0.1 0.15 0.2 0.25 0.3 BasicD Kalman-PID Non-Privacy 0 0.05 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (c) WTD, Go.- Linear 30 40 50 60 70 80 90 100 BasicD Kalman-PID 0 10 20 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Kalman-PID Non-Privacy (d) ASR, Go.-Linear 1 1.5 2 2.5 BasicD 0 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 BasicD Kalman-PID (e) CELL, Go.- Linear 200 300 400 500 600 700 BasicD Kalman-PID 0 100 200 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Non-Privacy (f) ANW, Ye.-Linear 30 40 50 60 70 80 BasicD Kalman-PID Non-Privacy 0 10 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Non-Privacy (g) HOP, Ye.-Linear 0.6 0.8 1 1.2 1.4 1.6 BasicD Kalman-PID Non-Privacy 0 0.2 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Non-Privacy (h) WTD, Ye.-Linear 30 40 50 60 70 80 90 100 BasicD Kalman-PID 0 10 20 30 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Kalman-PID Non-Privacy (i) ASR, Ye.-Linear 0.6 0.8 1 1.2 1.4 1.6 1.8 2 BasicD Kalman-PID 0 0.2 0.4 0.6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Kalman-PID (j) CELL, Ye.-Linear Figure 4.14: Varying budget split f. 37.70 37.72 37.74 37.76 37.78 37.80 37.82 37.84 −0.22 −0.20 −0.18 −0.16 −0.14 −0.12 −0.10 −0.08 −0.06 −1.223e2 (a) KalmanPID, f = 10% 37.70 37.72 37.74 37.76 37.78 37.80 37.82 37.84 −0.22 −0.20 −0.18 −0.16 −0.14 −0.12 −0.10 −0.08 −0.06 −1.223e2 (b) KalmanPID, f = 90% Figure 4.15: AG Structure visualization for Gowalla dataset. To better understand the outcome of the variablef results, we perform another experiment that helps visualize the effect of f on the grid structure. Figure 4.15 shows the AG structures obtained for KalmanPID with different f values. The larger value of f = 90% produces finer-grained grids in Figure 4.15b, as 2 =f× is higher. Finer-grained grids result in small actual counts, but also in large relative errors (i.e., excessive noise-to-real-count ratio). Thus, the geocast algorithm tends to pick only a few cells, causing under-provisioning, and thus low ASR. Finally, we focus our attention to the influence of the movement history length T on the stud- ied performance metrics. Figure 4.16 shows the results on the Yelp dataset (similar trends were 64 0 100 200 300 400 500 600 700 50 60 70 80 90 100 BasicD Kalman-PID Non-Privacy (a) ANW, Ye.- Linear 20 30 40 50 60 70 80 BasicD Kalman-PID Non-Privacy 0 10 20 50 60 70 80 90 100 Non-Privacy (b) HOP, Ye.-Linear 0.6 0.8 1 1.2 1.4 1.6 BasicD Kalman-PID Non-Privacy 0 0.2 0.4 50 60 70 80 90 100 (c) WTD, Ye.-Linear 30 40 50 60 70 80 90 100 0 10 20 30 50 60 70 80 90 100 BasicD Kalman-PID Non-Privacy (d) ASR, Ye.-Linear 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 0.2 0.4 0.6 50 60 70 80 90 100 BasicD Kalman-PID (e) CELL,Ye.-Linear Figure 4.16: Varying number of timestamps T. observed for the Gowalla dataset). When increasingT, the considered performance metrics (ANW, HOP and WTD) for KalmanPID are stable (there is only a small increase, up to 10%) while the performance of BasicD decreases significantly whenT increases. In addition, KalmanPID achieves high ASR, very close to the desired utility threshold EU. On the other hand, to obtain a high ASR, BasicD needs to over-provision excessively, which explains its high overhead. 4.2.5 PrivGeoCrowd: A Tool for Tuning Parameters Figure 4.17: PrivGeoCrowd main GUI integrates several component module panels In Section 4.2.5 we outline the software architecture of the tool, followed by a presentation of the main GUI elements in Section 4.2.5 and the demonstration scenario in Section 4.2.5. 65 System Architecture Figure 4.18 presents the architecture of PrivGeoCrowd, which is currently integrated with MediaQ [101], a geospatial crowdsourcing system developed at USC. The GUI is provided to end- users (i.e., CSPs, SC-server administrators or task requesters) entirely in web-based form, which has the benefit of not requiring any additional software to be installed at the client. The web appli- cation is written using Javascript, Python, PHP5, and Google Maps API for map rendering. The GeoCrowd API provides SC services [101] such as profile management, etc. The worker location datasets are stored in a MySQL database, which can be queried using the CSP API to generate sanitized PSDs. Figure 4.18: Architecture of PrivGeoCrowd Graphical User Interface Figure 4.17 gives an overview of the main GUI, which comprises of several component modules: CSP Panel: it allows the CSP to sanitize and publish datasets according to privacy budget ε ∈ [0.1, 1]. The system ships with three datasets: Gowalla-San Francisco (SF), Gowalla-Los Angeles(LA)andYelp-Phoenix(Yelp), buttheCSPhastheoptionofuploadingadditionaldatasets 66 totherepository. TheCSPcanalsospecifythebudgetallocationsplitbetween AGlevels(0.5means equal split between levels 1 and 2). The CSP can customize the granularity of the level-1 AG grid usedforconstructingtheworkerPSD.Oncethe“PublishData”buttonisclicked, thecorresponding PSD of the dataset is generated and made available to other modules through the PSD API. SC Panel - Dataset Selection: The SC-server administrator can select among available sanitized datasets from the drop-down list. Several statistics of the selected dataset are automati- cally provided on the right-hand side, and the user has the option to visualize the dataset density heatmap, as well as the dataset boundary. SC Panel - GR Construction Tuning: The SC-server administrator is able to select one of the supported heuristics for GR construction (i.e., distance-based, compactness-based), as well as the parameters of the GR construction algorithm. The user can choose a threshold for acceptance success rate (ASR) in the range [0.6..0.9]. The maximum task acceptance rate AR threshold (i.e., when worker and task are co-located) can be varied beenween 0.1 and 1.0. The wireless communication range for the geocast is customizable between 25 and 100 meters. TaskRequesterPanel-GeocastRegionRendering: Therearethreewaysonecansubmit a task request: (1) by double clicking on the map, (2) by providing latitude/longitude values in the tasktextbox, or(3) by selecting aparticulartaskinthehistorytab. Thelattertasklistisextracted from MediaQ [101]. When one specifies a task, its GR is computed and rendered in real-time. In Figure 4.17, the pop-up dialog of the the last visited cell presents some statistics. Typically, the current utility (i.e., accumulated utility) measures the probability that at least one worker accepts the task if it is geocast. This number represents the task satisfaction of the task requester. The distance-to-task value of a cell can be considered as satisfaction of the workers within this cell. In addition, both the compactness and the area of the GR indicate the cost of the SC-server. The smaller and the more compact the GR, the smaller the geocast cost. Mobility Panel: The administrator can generate new datasets from existing ones by having workers moving toward a random direction (i.e., North/South/East/West) by a pre-defined step. The heat map of the updated dataset represents this movement. When the stop button button is clicked, the current snapshot of the worker locations are uploaded to the server, and a new PSD is constructed. 67 Demonstration Plan During the demonstration, we will highlight the role of PrivGeoCrowd in evaluating the effec- tiveness and efficiency of private spatial crowdsourcing in several prominent scenarios. Namely, we study the effect of: (i) varying AG granularity during sanitization, (ii) varying dataset density and (iii) varying the heuristic used in GR construction. Customized AG granularity: Figure 4.19 presents the effect of AG granularity on GR size. The work in [170] uses an adaptation of the original AG method from [141], which can consider- ably improve efficiency for SC. The visualization captures the finer granularity obtained using the customized AG, and highlights the corresponding GRs which are significantly more compact. (a) Original AG (b) Modified AG Figure 4.19: The effect of customized granularity on GR (SF data) Small vs Large Budget: Figure 4.20 demonstrates the effect of small budget ε = 0.1 on GR. The cells are much larger than those in the large budget case (ε = 1 in Fig. 4.19b). Therefore, the GRs enclose a few cells only. Dense vs Sparse Area: Figure 4.21 shows the effect of worker density on the size of obtained GRs. One can observe that the GR obtained in the left-hand side of the figure, corresponding to a sparse suburb area of Phoenix, AZ (Yelp dataset) is much larger than the GR obtained in a denser downtown area of the same dataset. On the other hand, the denser the population, the higher granularity of the AG. This confirms that our customized AG [170] adapts greatly to the worker density. 68 Figure 4.20: The effect of small budget (ε = 0.1) (SF data) Figure 4.21: The effect of worker density on GR size (Yelp data) EffectofAlternativeGRConstructionHeuristics: Figure4.22demonstratesthebehavior of different heuristics on the size and shape of the GR. In Figure 4.22a we illustrate the case where GR construction is guided solely by the expected probability of task acceptance success rate (ASR). In this case, populated grid cells tend to be selected first. Figure 4.22b demonstrates the strategy of adding the nearest cell to the GR, which results in GRs that are centered at the task request. Finally, Figure 4.22c shows the third strategy that attempts to form GRs with a balanced shape, that result to a low hop count when using ad-hoc geocast communication. 69 (a) ASR-centric (b) Distance (c) Compactness Figure 4.22: The effect of different heuristics on GR geometry 4.3 Protecting Locations of Workers and Tasks Without Trusted Third Party The remainder of this thesis is organized as follows. Section 4.3.1 introduces SCGuard, and Section 4.3.2 details our proposed solution. Experimental results are presented in Section 4.3.3. 4.3.1 The SCGuard Privacy Framework InthissectionwefirstpresentaframeworkforthetaskingphaseofSCwithoutcompromisingthe privacy of the locations of the individuals (both workers and requesters). We then identify potential privacy threats from the adversaries (server, requester and worker) and present countermeasures to prevent such threats from occurring. Finally, we introduce the performance metrics to evaluate and compare our different privacy-preserving algorithms. System Model and Assumptions We first define the notions of spatial task and worker. A spatial task t is required to be answered at a particular location l t . This means task t can be answered by a human only if he is physically located at the task’s location l t . A worker, denoted by w, is a carrier of a mobile device who volunteers for one of the spatial tasks. Each worker has a location l w and a spatial region R w , wherein the worker can accept spatial tasks. R w is represented by a circular region centered at the worker’s location. Hence, R w also refers to a reachable distance of the worker—task t is reachable 70 from worker w iff d(w, t)≤ R w . During tasking, we assume that each worker performs a single task, and all workers perform every task correctly so that every task needs to be assigned to one and only one worker. We also assume that tasks will not expire during the assignment period. We focus on online assignment where the set of workers W is given in advance while each task in the task set T arrives online (i.e., one at a time). Without privacy protection, the optimal algorithm in terms of maximizing the number of assigned tasks is Ranking [92]. The algorithm associates each worker with a random number (or rank) so that each task, upon its arrival, is assigned to an unmatched reachable worker of the highest rank 2 . This algorithm can entirely run on the server. However, with privacy protection, locations of workers and tasks become uncertain, which complicates the task assignment. We propose a privacy-aware framework, named SCGuard, for online tasking that involves three distinct stages as follows (see Figure 4.23). In the first stage, to ensure privacy, each worker w perturbs his location l w with a specified privacy level (,r) according to the Geo-I mechanism and sends the noisy locationl w 0 together with his reachable distance R w to the server. Upon the arrival of task t, the requester of t perturbs its locationl t with different privacy level (,r)-Geo-I and sends the perturbed onel t 0 to the server. The main role of the server is to identify a set of candidate workers for the task and then forward their information (i.e.,l w 0 andR w ) to the requester. We refer to the first stage as uncertain-to-uncertain (U2U) because the locations of both workers and tasks are uncertain to the server—the server knows only perturbed locations of workers and tasks. There is no location disclosure at this stage because locations of both task and worker are hidden from all three adversaries. The requester, on receipt of the information of the workers, conducts the second stage of SCGuard without any communication with the server. We isolate this stage from the server to ensure no disclosure to the server. During this stage, the requester sends her task location to the candidate workerw who is most likely reachable within distance R w . We call this stage uncertain- to-exact (U2E) because the requester knows the exact location of her task and needs to make a decision as to whether a particular candidate worker is reachable to the task given the worker’s 2 We also consider distance-based ranking for travel cost optimization in Section 4.3.2 71 Notation Description w,t,W,T a worker, a task, a worker set and a task set w 0 ,t 0 ,W,T a worker and a task after perturbation Rw reachable distance that w is willing to travel lw,lt actual locations of worker w and task t l w 0,l t 0 noisy (perturbed or observed) locations of w 0 and t 0 d(w,t) the distance between lw and lt d,d 0 notations of actual and noisy distances wmax the worker of the highest rank (,r) parameters that specify the privacy level of Geo-I N j a set of candidate workers to t j α,β reachability thresholds during U2U and U2E Table 4.6: Summary of notations. perturbed location. This process repeats until the task is assigned or no candidate worker is left. We discuss three approaches for selecting a candidate worker in Section 4.3.2. The third and final stage, exact-to-exact (E2E), is performed by the selected candidate workers. At this point, the requester releases the actual location of the task to the worker, resulting in some disclosure. The worker can verify if the task is reachable by comparing his distance to the task, i.e., d(w, t)≤ R w . If the task is reachable, the worker performs the task, and we consider that the task is assigned; otherwise, he rejects the task. Table 4.6 summarizes the notations used in the following sections of the thesis. Task # ( #& , # ) U2U by Server )& ′ ′ Worker U2E by Requester ) # , E2E by Worker ) # Figure 4.23: SCGuard: privacy-aware framework for spatial crowdsourcing. Our approach for task dissemination in U2E is sequential, meaning that the requester sends her task location to one worker in the candidate set at a time. However, one may argue for a parallel approach, where the server first sends the perturbed task location to all workers in the candidate set. Subsequently, the workers simultaneously and independently evaluate whether the task is 72 reachable or not. If so, they send their locations to the requester, who performs the final stage (E2E). This parallel approach may be more efficient but can potentially result in more disclosures. This is because multiple candidate workers may find the task reachable and together send their locations to the requester. Hence, we do not consider this parallel optimization any further. Another alternative design choice for the U2E stage is for the server to rank the candidate workers rather than the requester. In this case, the candidate workers receive the perturbed task location from the server and return their likelihoods to perform the task to the server. Thereafter, the server matches the task to the worker who most likely will perform the task. This scenario seems to be more efficient in terms of communication cost than the proposed U2E stage; however, the server may be able to learn extra information about the task by observing the responses of one or multiple candidate workers during U2E (reachable or not). The challenge is that these responses are no longer independent of each other since they are computed from the same task. Hence, to ensure the same privacy level, the privacy guarantee needs to be extended for a location set [16]. Ensuring Geo-I for a location set drastically reduces the utility of the privacy mechanism as the amount of noise increases linearly with the size of the location set. Thus, we do not consider this design option further. Adversary Model We assume the semi-honest model, which has two core assumptions. First, all participating entities (server, worker, requester) are curious but not malicious. This means that each entity may learn from what is exposed during the three stages of SCGuard, but they comply with the protocol. Second, the entities do not collude with each other to gain information about the third. These assumptions are realistic in our setting since there is little incentive for the requesters and workers to act maliciously if they want to get the tasks done. We briefly discuss the potential malicious behaviors of different parties and how to mitigate them in Section 4.3.4. In what follows, we analyze potential disclosure of location information during each stage of the protocol. During U2U the server takes as input the perturbed locations of both workers and tasks to perform effective task assignment; the amount of disclosure to the server is strictly controlled by a given level of privacy according to the Geo-I mechanism. Furthermore, locations of both workers 73 and tasks are protected from the server in all stages of the protocol. This is because the server does not participate in U2E and E2E. Specifically, the server recommends a set of candidate workers to a requester so that they can establish a direct communication channel among themselves. From the perspective of the server, the requester and the candidate workers autonomously decide on whether to accept the recommendation from the server. During U2E and E2E, special emphasis is on limiting the disclosure between the workers and the requesters since they may learn each other’s exact locations during the course of the protocol. From the viewpoint of a curious requester, locations of the candidate workers are not revealed to the requester during U2E. However, the requester may learn the proximity of the worker to the task (but not the exact location of the worker) by observing the response of a candidate worker during U2E (reachable or not). From the viewpoint of a curious worker, due to the uncertainty of workers’ locations during U2E, a task location can be disclosed to multiple candidate workers in E2E before being assigned. This kind of disclosure among requesters and workers, named false hit, is quantified in Section 4.3.1. After completing an assigned task, the worker reports the result of the task either to the server for quality control, payment, etc. or directly to the requesters to limit the disclosure to the server even further. Nevertheless, privacy threats during reporting are beyond the scope of this protocol. Performance Metrics Protecting locations of both workers and tasks significantly complicates task assignment and may reduce the effectiveness and efficiency of task assignment. Due to the noise introduced by Geo-I, a worker-task match observed as reachable in the noisy domain may be unreachable in the actual domain, or vice versa. Both cases may result in tasks remaining unassigned. Thus, to find a reachable worker for a task (i.e., a valid match), multiple messages may need to be sent between the requester and workers, which increases the amount of location disclosure. To measure these, we introduce the following end-to-end performance metrics: 74 • Utility. The performance of SCGuard is measured by the number of assigned tasks. Due to data uncertainty, the server may incorrectly identify candidate workers for a task. The challenge is to obtain a high number of assigned tasks in the presence of uncertainty. • Travel Cost. With imprecise locations, the server is no longer able to accurately estimate the distances between workers and tasks. Hence, workers may have to travel long distances to tasks. The challenge is to keep the worker travel distance low, even when exact locations are unknown. • System Overhead. Dealing with imprecise locations increases the complexity of assignment algorithms, which poses scalability problems. A significant metric to measure overhead is the size of the worker candidate set for a task. This number represents both the communica- tion overhead (server sends the candidate set to the requester) and computational overhead (requester ranks the workers in the candidate set based on certain criteria). In addition, we also report per-stage metrics to better understand the performance as well as potential location disclosure of each stage of SCGuard. • Precision/Recall (U2U). For each task we measure the ratio of the candidate workers who are reachable (precision) and the ratio of the reachable workers in the candidate set (recall). • False Hit/False Dismissal (U2E). A false hit is a privacy leak, occurring when a requester estimates an unreachable worker as reachable, measured by the number of times a task loca- tion is revealed to a candidate worker who eventually does not perform the task. A false dismissal occurs when a requester misses a reachable worker. 4.3.2 Online Task Assignment Baseline Solution In this section we show why existing solutions to the online tasking problem in the non-private setting may not be effective in the private setting. Hence, we introduce a baseline algorithm for the private setting. 75 Ranking Algorithm With the online assignment, workers are known and tasks arrive online (one-by-one). Upon arrival, each task needs to be immediately matched to an unmatched worker; the goal is to maximize the number of assigned tasks. A well-known solution to the online assign- ment problem is the Ranking algorithm [92]. Ranking randomly permutes the workers and assigns a random priority (or rank) to them. When a task arrives, it is matched to a worker who is reach- able to the task and has the highest rank. The expected size of matching obtained by Ranking is at least (1− 1 e )|T| = 0.63|T|, where|T| is the total number of tasks. This result is optimal in the non-private setting [92]. In other words, the competitiveness of any online bipartite matching algorithm is bounded above by 0.63. " # " # % % (a) Exact reachability graph ′ # ′ $ ′ # ′ & ′ $ ′ & (b) Noisy layout " ′ $ ′ " ′ $ ′ & ′ & ′ (c) Noisy reachability graph Figure 4.24: Example of online tasking with three known workers. Each task arrives one-by-one in the order of t 1 →t 2 →t 3 . Figure 1.1 shows the exact layout of the workers and the tasks. However, the ranking algorithm may not work well in the privacy setting. The reason is that a reachable worker-task pair can be observed as unreachable after perturbation, and vice versa. Figure 4.24b shows the reachability graph of the workers and tasks given their layout shown in Figure 1.1. The reachable pair (w 1 ,t 3 ) becomes unreachable in the noisy domain, while unreachable pairs (w 1 ,t 1 ) and (w 3 ,t 3 ) become reachable. In addition, Figure 4.24c shows an optimal matching in the noisy domain: (w 0 2 ,t 0 1 ), (w 0 1 ,t 0 2 ) and (w 0 3 ,t 0 3 ). Nonetheless, the assignment is actually not optimal because (w 0 3 ,t 0 3 ) is unreachable in the actual domain. 76 Baseline Algorithm Algorithm 5 presents the oblivious algorithm, which considers observed locations as true ones. First, locations of workers and requesters’ tasks are locally perturbed according to a specified privacy level (,r)-Geo-I [16] (Lines 3–4). During U2U, the server identifies candidate workers for each task such that the observed location of the task is reachable from the observed locations of the workers (Line 7). The server then forwards the candidate workers to the requester, who performs the U2E phase (Line 9). As mentioned in Section 4.3.1, the reason U2E is performed by the requester instead of the server is that multiple candidate workers may need to be selected until a reachable worker for the task is found. These back-and-forth communications with the workers need to be secure from the server to ensure privacy protection. During U2E, the requester— once receiving the candidate workers from the server—ranks those workers with respect to certain criteria, such as a random rank [92] or the worker-task distance (aka the nearest-neighbor strategy) (Line 12). Thereafter, the task is matched to the worker of the highest rank (Line 10), who subsequently receives the actual location of the task (Line 13). During E2E, the selected candidate worker confirms whether the task is actually reachable (Line 14). If so, this is a valid assignment. Otherwise, the worker rejects the task so that the requester can send the task to the candidate worker of the second highest rank (Line 17). This matching process (U2E and E2E) continues until either the task is assigned or no candidate worker is left (Line 8). This best-effort strategy clearly provides more opportunity for the task to be assigned, but at the expense of disclosing its location to more workers. This trade-off is illustrated in the following example. In Figure 4.24c, when task t 1 arrives, w 1 and w 2 are candidate workers. Location l t 1 is sent to w 1 because it has a higher rank. w 1 finds t 1 unreachable and declines t 1 , introducing a false hit. Subsequently,l t 1 is sent to the next candidate worker of the highest rank,w 2 . w 2 findst 1 reachable and performs t 1 . Next, t 2 arrives, and has two candidates, w 1 and w 3 . w 1 performs t 2 as they are reachable. Finally,t 3 arrives and is matched to candidatew 3 ; nevertheless,w 3 findst 3 unreachable and rejects t 3 . In sum, the two assigned tasks t 1 and t 2 are performed by w 2 and w 1 , respectively. Here, the two false hits include w 1 knowing l t 1 and w 3 knowing l t 3 . 77 Algorithm 5 Oblivious Algorithm (Baseline) 1: Input: W,T,R wi ,,r (refer to the notations in Table 4.6) 2: Output: a set of valid worker-task matches 3: Perturb locations of workers and tasks using Geo-I [16]: 4: l wi →l w 0 i , l tj →l t 0 j 5: For t j ∈T do: {assign it to the highest-rank worker} 6: U2U: Server identifies candidate workers N j for t j : 7: N j ={w i : d(w 0 i , t 0 j )≤ R wi } 8: If N j =∅: t j remains unassigned; go to Line 5 9: Server forwards candidate workers N j to t j ’s requester 10: U2E: Requester matches t j to w max , where 11: w max = argmax{Rank(w i ) : w i ∈ N j } 12: Rank(w i ) = precomputed random[0, 1] (or 1 d(wi,tj) ) 13: Requester sends exact task location l tj to w max 14: E2E: Worker w max checks if d(w max , t j )≤ R wmax : 15: If so, match (t j , w max ) is a valid assignment 16: Update W: W = W−{w max }; go to Line 5 17: Otherwise, update N j : N j = N j −{w max }, go to Line 10 We emphasize that the oblivious algorithm guarantees (,r)-Geo-I to both workers and tasks from the untrusted server. The reason is that the process of finding the candidate workers for a task (after location perturbation) is considered post-processing, which does not affect the privacy guarantees of differentially private mechanisms [119]. Quantifying Worker-Task Pair Reachability An issue with the oblivious algorithm is that the reachability between a worker and a task is a binary decision based on the perturbed locations, reachable or not, which does not utilize the planar Laplace distribution of the perturbed locations (see Section 4.1.2). As a result, the baseline solution may include non-reachable workers and miss reachable workers in the candidate set, e.g., (w 1 ,t 3 ) becomes unreachable in Figure 4.24. Therefore, the worker-task pair reachability should be quantified by the probability of reachability between the worker and the task. This allows a requester to accurately compare the reachability to her task from multiple candidate workers. The objective is to compute the reachability probability of a worker-task pair given their observed distance, i.e., Pr d(w, t)≤ R w |d(w 0 , t 0 ) (refer to the notations in Table 4.6). We present two approaches to this problem: one is based on approximation analysis (for efficient computation), while another is based on empirical results (requires precomputation on synthetic or historic data). 78 Analytical Approach An intuitive approach is to derive the distribution (pdf) of the actual distances between the locations of workers and tasksd(w,t) given the perturbed locationsl w 0,l t 0. Recall that the locations are perturbed using the planar Laplace distribution. Once the pdf is derived, the reachability probability can be computed efficiently with numerical libraries, such as Python, R and MATLAB. The problem of finding the pdf of the distance between two uncertain points is related to a family of line picking problems, such as disk line picking. 3 The disk line picking problem is to choose two points at random in a unit disk and find the distribution of the distances between the two points. Such problems have closed form solutions [180]. However, in our setting, the two points are drawn from a planar Laplace distribution with different centers rather than uniformly distributed on the same disk. This makes our problem more challenging, and a closed form solution may not exist. Because the planar Laplace distribution is difficult to analyze, we propose a two-phase method to parameterize the pdf: 1) approximating the planar Laplace distribution by a bivariate normal distribution (BND), and 2) deriving the closed form solution to the pdf of d(w,t). Approximated BND: According to [16], the pdf of the noise-adding mechanism follows a planar Laplace distribution (see Section 4.1.2) with center at the true location. We approximate the planar Laplace distribution by a BND with the same mean and variance. These are the first two moments, which represent the most important information of a distribution. Since the planar Laplace distribution is symmetric to its center, the approximated BND should be symmetric to the same center (i.e., circular bivariate normal distribution). Subsequently, the approximated distribution is BND(μ, Σ), where μ is a 2-dimensional mean vector (w x ,w y ) 4 representing the worker location, and Σ is a diagonal covariance matrix σ 2 σ 2 , where σ = √ 2 r is the standard diviation of the planar Laplace distribution ( and r are privacy parameters). 3 http://mathworld.wolfram.com/DiskLinePicking.html 4 The subscripts x and y represent the corresponding axis. 79 Consequently, the distribution of the distance between the perturbed location and its original location is approximated by a normal distribution N(0, 2r 2 / 2 ). This means that when the per- turbed (observed) location is known, the original location is approximated by BND(μ, Σ), centering at the observed point with mean μ = (w 0 x ,w 0 y ) and variance Σ = 2r 2 2 2r 2 2 . Given observedw 0 andt 0 , we can approximate the original location ofw andt both with BND. Next, we derive the pdfs of d(w,t) for both U2U and U2E stages. PDF of d(w,t) for U2U: In the U2U stage, given the uncertain locations l w 0 and l t 0, our goal is to estimate the pdf of d(w,t)—the actual distance between the original locations l w and l t (see Figure 4.25a). As presented, l w is approximated by BND(μ w , Σ w ), centering at the observed worker locationl w 0, andl t is approximated by BND(μ t , Σ t ), centering at the observed task location l t 0. We have d = d(w, t) = q 2 x + z 2 y , where z equals to the difference in vector space z = l w −l t , which follows BND(μ w −μ t , Σ w + Σ t ). We approximate the distribution of d 2 = z 2 x + z 2 y , and then use the following lemma to derive the reachability probability of d: Pr(d≤ R w ). Lemma 1. Pr( √ X≤ √ C) is equal to Pr(X≤ C), where X is non-negative random variable and C is a non-negative constant. Proof. Thisistruebecause √ X≤ √ C ⇐⇒ X≤ C forX,C≥ 0. Thismeansthatthesetofevents {A∈ Ω : √ X≤ √ C} equals the set{A 0 ∈ Ω : X≤ C}, so are their probabilities Pr( √ X≤ √ C) and Pr(X≤ C). Applying the lemma to our context where X = d 2 and C = R 2 w , we have the reachability prob- ability of a worker-task pair Pr(d≤ R w ) = Pr(d 2 ≤ R 2 w ). Sinced 2 has a quadratic form in the bivariate random variable z, the moment-generating func- tion (mgf) of d 2 has the following form [116]: M(e tD ) = e t P 2 j=1 b 2 j λ j 1−2λ j t 2 Y j=1 (1− 2λ j t) −1/2 where b j is the linear function of μ (b 1 =μ wx −μ tx , b 2 =μ wy −μ ty ) and λ j are the eigenvalues of Σ = Σ w + Σ t . 80 Given the mgf of d 2 , mean and variance of d 2 can be calculated as follows. Mean μ equals to the first derivative of the mgf at t = 0: μ = E[D] = M 0 t (0). Variance σ 2 can be computed by evaluating the second derivative of the mgf at t = 0: σ 2 = E[D 2 ]− (E[D]) 2 = M 00 t (0)− (M 0 t (0)) 2 . Consequently, we approximate the pdf of d 2 by a normal distribution N(μ,σ 2 ), where mean and variance can be computed efficiently using the built-in python library Scipy. For simplicity, we derive mean and variance for a special case where both the worker and the task’s requester set the same privacy level (,r). In such case, the eigenvalues are equal, λ = 4r 2 / 2 and the mgf can be derived as follows, where ν is the observed worker-task distance ν =d(w 0 ,t 0 ) = q b 2 1 +b 2 2 . M(e tD ) =e ν 2 λt 1−2λt 1 1− 2λt (4.12) We derive the first and second derivatives: M 0 t (0) = λ(2 +ν 2 ) and M 00 t (0) = 8λ 2 + 8λ 2 ν 2 +λ 2 ν 4 ; thus d 2 follows approximately a normal distribution with mean and variance: μ = λ(2 +ν 2 ) and σ 2 = 4λ 2 (1 +ν 2 ). PDF ofd(w,t) for U2E: In the U2E stage, given the true location of taskl t and the perturbed locationofworkerl w 0, weneedtoestimatethepdf oftheactualdistanced(w,t). Whentasklocation l t is fixed and worker location l w follows BND(μ, Σ) centering at the observed worker location l w 0, the distance betweenw andt follows the Rice distribution [157] with parameters (ν,σ). ν =d(w 0 ,t) is the distance from the task’s locationl t (i.e., the reference point) to the center of the approximated BND, l w 0 (see Figure 4.25b). σ is the scale parameter and equals the square root of the variance of the approximated BND, √ 2r/. The pdf of the Rice distribution is: f(x|ν,σ) = x σ 2 exp −(x 2 +ν 2 ) 2σ 2 I 0 ( xν σ 2 ) whereI 0 (.) is the modified Bessel function of the first kind with order zero [10]. We used the Scipy library to efficiently compute the pdf of the Rice distribution. 81 ′ (, & ) Approx. BND ′ (, & ) Approx. BND (a) For U2U (, % ) Approx. BND ′ (b) For U2E Figure 4.25: Estimation of the pdf of d(w,t) for each stage. Empirical Approach The analytical approach above provides a fast but approximate way to compute the reachability probability. We present an empirical approach that computes the probability from synthetically generated or past data. We show the simulation for each stage of SCGuard as follows. For U2U, we generate random locations for a large number of worker-task pairs in a certain region of interest (i.e., Beijing City). All generated locations are perturbed according to (,r)-Geo-I using random seeds. For each worker-task pair, both the actual distance d and the corresponding noisydistanced 0 arecalculated. Thenoisydistancesaregroupedintodisjointranges: [0...s), [s...2s), ... , [120s...∞), wheres=100 meters; each range maps to a set of actual distances. We first compute a distribution ofd for each range ofd 0 . For example, Figure 4.26a shows the distribution of d for a particular range ofd (1900≤d 0 < 2000). This distribution centers at the corresponding range ofd 0 . Then, the distribution of d can be precomputed for every range of d 0 and every privacy level (,r). Consequently, given ,r,d 0 from the distribution of d, we can compute the reachability probability between a worker and a task as Pr d(w, t)≤ R w |, r, d(w 0 , t 0 ) . For U2E, we need to precompute the distribution ofd for every range of,r andd 0 , but nowd 0 is between a random pair of perturbed 82 and exact locations: Pr d(w, t)≤ R w |, r, d(w 0 ,t) . For E2E, the reachability probability becomes a step function because d(w,t) is exactly computed at this stage (see Figure 4.26b). 0 100 200 300 400 500 600 200 1000 1800 2600 3400 4200 5000 5800 6600 7400 Number of samples Actual distance d (meters) (a) Distribution of d when 1900≤d 0 < 2000 (U2U) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 400 700 1000 1300 1600 1900 2200 2500 2800 3100 3400 3700 Reachability Probability Observed distance d' (meters) Rw=2000 meters =0.7, r=700 meters U2U U2E E2E (Binary) (b) Reachability probability by varying d 0 Figure 4.26: Distributions of d and Pr(d≤R w |d 0 ). Figure 4.26b illustrates the pdf of the reachability probability when varying the observed dis- tance d 0 . Unlike the E2E stage where the worker-task pair reachability is a step function (binary model), the pdf values of U2U and U2E decrease linearly with the increase of d 0 . We also observe that compared to U2E, U2U underestimates when d 0 is small but overestimates when d 0 is large. It is worth noting that our empirical approach for precomputing the worker-task pair reachability uses the synthetic location datasets, and therefore does not breach individual information. The precomputed reachability information is used to enhance the performance of task assignment but is not useful to an adversary in gaining information of any individual’s location. It is also possible to use either public datasets or past assignment data of completed tasks in the precomputation. Probability-based Solution Given the reachability probability, we present a probability-based algorithm that enhances Algorithm 5 at both U2U and U2E stages. The key improvement is to use the probabilistic model 83 (either the analytical or the empirical approach), rather than the binary model, for quantifying reachability between a worker and a task. Improvement to U2U With Algorithm 5, a worker is a candidate for a task if his observed distance is upper-bounded by the reachable distance of the worker (Line 7). Hence, Line 7 of Algorithm 5 may introduce a false positive and false negative. Figure 4.27 shows the precision/recall scores by varying privacy guarantee r. Ensuring high recall is important because low recall means most reachable workers are not in the candidate set, likely resulting in large utility loss. Increasing the recall score: Algorithm 5 can be updated to ensure high recall during the U2U stage as follows. The candidate workers are selected such that their probability of reachability to a task is greater than a given thresholdα, termed U2U threshold (see Line 7 of Algorithm 6). By decreasingα, recall is higher but at the cost of lower precision, resulting in an increase of the ratio of unreachable workers in the candidate set. This may incur penalties in the later stage (U2E), such as higher system overhead. We will evaluate the impact of varying α in Section 4.3.3. Algorithm 6 Probability-based Algorithm 1: Input: W,T,R wi ,,r,α,β (refer to the notations in Table 4.6) 2: Output: a set of valid worker-task matches 3: Perturb locations of workers and tasks using Geo-I [16]: 4: l wi →l w 0 i , l tj →l t 0 j 5: For t j ∈T do: {assign it to the highest-rank worker} 6: U2U: Server identifies candidate workers N j for t j : 7: N j ={w i : Pr d(w i , t j )≤ R wi |d(w 0 i , t 0 j ≥α} 8: If N j =∅: t j remains unassigned; go to Line 5 9: Server forwards candidate workers N j to t j ’s requester 10: U2E: Requester matches t j to w max , where 11: w max = argmax{Rank(w i ) : w i ∈ N j } 12: Rank(w i ) = Pr(d(w, t)≤ R w ) given d 0 = d(w, t 0 ) 13: If Rank(w max )<β: go to Line 5 14: Requester sends exact task location l tj to w max 15: E2E: Worker w max checks if d(w max , t j )≤ R wmax : 16: If so, match (t j ,w max ) is a valid assignment 17: Update W = W− w max ; go to Line 5 18: Otherwise, update N j = N j −{w max }; go to Line 10 84 Improving runtime performance: Line 7 of Algorithm 6 linearly checks all workers, which may be time-consuming for a large number of workers. Hence, we propose a technique to quickly prune workers who are most likely not reachable to a particular task. The technique has two steps. First, each worker (or task) corresponds to a disk with radius r R , centering at the perturbed location such that the actual location is within the disk with probability at least γ (see Section 5 in [16] on how to computer R ). The disks are depicted by the solid circles in Figure 4.28, denoted as disk(l w 0,r R ) anddisk(l t 0,r R ). Hence, the outer dashed circle represents a regiondisk(l w 0,r R +R w ) that encloses any point in the worker’s spatial regionR w with probability at leastγ. Subsequently, we approximate the worker by the larger minimum bounding box (MBR) and the task by the smaller MBR in Figure 4.28. Inthesecondstep, webuildanindextoquicklyprunefar-awayworkersfromataskwithoutafull linear scan by applying existing techniques in the uncertain database field [163, 19]. For example, the fuzzy search technique in [163] is suitable when both data and query (workers and tasks in our context) are represented by rectangles. Applying this pruning technique on the approximated MBRs of workers and tasks gives us a lower bound ofγ on the reachability probability during U2U. This is because a worker and a task are not reachable from each other if their MBRs do not overlap. Note that pruning workers during U2U makes sense since it is the most time-consuming stage of SCGuard. 0 0.5 1 r=200 r=800 r=1400 r=2000 =0.7, Rw=1400 meters Precision Recall Figure 4.27: Accuracy of the baseline algorithm. ′ ′ % ' % Figure 4.28: Pruning during U2U. 85 Improvement to U2E The oblivious algorithm ranks candidate workers by either their distances to a task or a random rank associated with each worker (Line 12 of Algorithm 5). However, the nearest (or a random) worker may not be reachable to the task. This is because the reachability probability between worker w and task t not only depends on their distance d(w,t) but also on the worker’s spatial regionR w . To illustrate, in Figure 4.24b t 0 2 is closer tow 0 1 than tow 0 3 ; however,t 0 2 is more likely to be reachable from w 0 3 than from w 0 1 because R w 3 is much greater than R w 1 . Rankingcandidateworkersprobabilistically: Wearguethatthecandidateworkersshould be ranked based on their reachability to a task. The reason for this is to reduce the number of workers who are being notified of the task location, which results in a small number of false hits (task location disclosures) while reducing travel cost. Therefore, we modify the U2E stage of Algorithm 5 to capture the probability of reachability, quantifiedinSection4.3.2. Particularly,therequesterevaluatesthereachabilityprobabilitybetween all candidate workers and her task. Subsequently, each candidate worker is associated with a rank Rank(w i ) that equals the corresponding probability of reachability (see Line 12 of Algorithm 6). The task is matched to the candidate worker of the highest rank w max (Line 10). Reducingfalsehits: Algorithm5maydiscloseatask’slocationtoalargenumberofcandidate workers before the task gets assigned. In the worst case, the task’s location can be disclosed to all candidate workers, yet none of them are reachable to the task. Thus, to reduce the number of false hits (or privacy disclosures) during U2E, we propose a thresholding heuristic as follows. A requester cancels her task when the reachability probability from the highest-rank worker w max is smaller than a given threshold β (Line 13 of Algorithm 6). Subsequently, the requester does not send her task’s location to more candidate workers. The choice of β affects the number of assigned tasks, false hits and false dismissals. The smaller β, the more likely the requester sends the actual location of her task to the candidate workers. This results in the task having a higher chance of being assigned, but at the expense of a higher number of false hits. On the other hand, high β may lead to false dismissals. We empirically find a good value of β in Section 4.3.3. 86 4.3.3 Performance Evaluation We conducted several experiments on real-world data to evaluate the performance of our pro- posed framework, SCGuard. Below, we present the experimental setup in Section 4.3.3, followed by results in Section 4.3.3. Experimental Setup We performed experiments on the T-Drive dataset [208]. We used one day of the data on Jan 11, 2012, which contains trajectories of more than 9,019 taxis and hundreds of thousands of passengers. We assumed that T-Drive drivers were SC workers and T-Drive passengers were SC requesters. The workers’ locations were those of the most recent drop-off locations while tasks were at the pick-up locations. The arrival order of the tasks was determined based on the sorting of their pick-up times. In all of our experiments, we randomly sampled 500 tasks and 500 workers from T-Drive. These numbers are relatively small when compared to the size of the dataset because we focus on privacy andutilitytrade-offsratherthanruntimeperformance. Wechosetypicalrangesofvaluesfor,r,R w as follows. Without loss of generality, we assumed the requesters and the workers have the same privacy level (,r), where ∈ {0.1, 0.4, 0.7, 1.0} and r ∈ {2,000, 1400, 800, 200} in meters, ranging from strict to loose privacy requirements. We set the reachable distance of each worker to a random value in meters, 1000≤ R w ≤3000. We varied the U2U threshold α∈ {.05, .1, .15, .2, .25, .3, .35, .4} and U2E threshold β∈ {.1, .15, .2, .25, .3, .35, .4}. Default values are shown in boldface. It is our intention to have different default values forα andβ. The reason for this is that, in Algorithm 6, the U2U threshold is applied prior to the U2E threshold; therefore, the values of α must be upper-bounded by the default value of β, and the values of β must be at least equal to the default value of α. In the following, we compare the performance of the proposed algorithms in terms of the per- formance metrics in Section 4.3.1. In particular, we reported the total number of assigned tasks (utility) and the average travel distance (travel cost) across the assigned tasks. We also calculated the average number of candidate workers per task (system overhead) as well as avearge of the 87 precision and recall scores (utility during U2U). We measured the total number of false hits (aka privacy leak or disclosure) and the total number of false dismissals (system overhead during U2E) over all tasks. All measured results were averaged over ten random seeds. Experimental Results We compare the performance of the variations of the three algorithms in Section 4.3.2: the oblivious algorithm, the probability-based algorithm and the Ranking algorithm that have access to exact location information (ground truth). First, the ground truth has two variants, GroundTruth- RR that uses the random rank strategy, and GroundTruth-NN that uses the nearest-neighbor strategy. Second, Oblivious-RR and Oblivious-RN refer to Algorithm 5 that uses the corresponding strategy (random rank or nearest-neighbor) to rank the candidate workers. Third, Probabilistic- Model and Probabilistic-Data are two variants of Algorithm 6 which correspond to the analytical and empirical approaches for quantifying the worker-task pair reachability in Section 4.3.2. Overview of Results We present the overview of results for comparing 1) analytical vs. empirical approaches for quantifying the reachability 2) random rank vs nearest-neighbor strategies for ranking candidate workers, and 3) performance of the algorithms. We first compare the analytical and empirical approaches. The graphs in the first row of Figure 4.29 show the results by varying privacy guarantee r. We observe that Probabilistic-Model performs as well as Probabilistic-Data in terms of utility, and even slightly better in terms of travel cost and privacy leak. This result shows that the analytical model is as accurate as the empirical counterpart for estimating the worker-task pair reachability. Therefore, we use Probabilistic-Model from now on because it does not require precomputation. Next, we compare the two strategies for ranking candidate workers, a random rank and the nearest-neighbor. The graphs in the second row of Figure 4.29 show the results by varying privacy guarantee r. We observe that GroundTruth-NN yields marginally lower utility when compared to GroundTruth-RR (321 tasks vs. 314 tasks in Figure 4.29d) but at a much smaller travel cost (1353 meters vs. 700 meters when r = 200 in Figure 4.29e). This is because GroundTruth-RR focuses 88 solely on the competitive ratio 5 without any spatial consideration, such as distance or reachability of a worker-task pair. For the same reason, when compared to Oblivious-RR, Oblivious-RN yields slightly lower utility at significantly lower travel cost and lower privacy leak. Hence, we will use GroundTruth-NN as the ground truth and Oblivious-RN as the baseline. 0 100 200 300 400 r=200 r=800 r=1400 r=2000 Probabilistic-Model Probabilistic-Data (a) Utility (#Tasks) 0 200 400 600 800 1000 r=200 r=800 r=1400 r=2000 Probabilistic-Model Probabilistic-Data (b) Travel cost (m) 0 0.5 1 1.5 2 r=200 r=800 r=1400 r=2000 Probabilistic-Model Probabilistic-Data (c) #False hits 0 200 400 r=200 r=800 r=1400 r=2000 GroundTruth-RN GroundTruth-RR Oblivious-RN Oblivious-RR (d) Utility (#Tasks) 0 1000 2000 r=200 r=800 r=1400 r=2000 GroundTruth-RN GroundTruth-RR Oblivious-RN Oblivious-RR (e) Travel cost (m) 0 500 1000 1500 2000 r=200 r=800 r=1400 r=2000 Oblivious-RN Oblivious-RR (f) #False hits Figure 4.29: Comparison of the variants of the algorithms by varying privacy guarantee r. Last but not least, Figure 4.30 compares the performance of different algorithms by varying privacy guarantee. We report two main results. First, Probabilistic-Model outperforms Oblivious- RN in all key metrics, including higher utility (×2 in Figure 4.30a), smaller travel cost (×2/3 in Figure 4.30b), and much lower disclosure of task location (/500 in Figure 4.30c), with only a slight increase in overhead (20% in Figure 4.30d). These improvements are more significant with higher privacy level (low ). The results confirm that the proposed probabilistic models are superior to the binary model in estimating the worker-task pair reachability. Second, when compared to the 5 The ratio between its performance and the offline algorithm’s performance in terms of utility. 89 ground truth, privacy provided by the probability-based algorithm does not significantly affect utility (Figure 4.30a) and travel cost (Figure 4.30b), proving that tasks can be effectively assigned to nearby workers without compromising the key metrics. This result is significant because utility and travel cost are perhaps the most important factors in SC. 0 100 200 300 400 =0.1 =0.4 =0.7 =1 GroundTruth-RN Oblivious-RN Probabilistic-Model (a) Utility (#Tasks) 0 500 1000 1500 =0.1 =0.4 =0.7 =1 GroundTruth-RN Oblivious-RN Probabilistic-Model (b) Travel cost (meters) 0 500 1000 1500 =0.1 =0.4 =0.7 =1 Oblivious-RN Probabilistic-Model (c) Privacy leak (#False hits) 0 500 1000 1500 2000 =0.1 =0.4 =0.7 =1 Oblivious-RN Probabilistic-Model (d) Overhead (#Workers) 0 0.5 1 =0.1 =0.4 =0.7 =1 Oblivious-RN (Recall) Oblivious-RN (Precision) Probabilistic-Model (Recall) Probabilistic-Model (Precision) (e) Precision/recall Figure 4.30: Comparison of the algorithms by varying . Details of Results We further compare the algorithms with respect to each performance metric. Utility (Number of Assigned Tasks) Figure 4.30a shows the results when varying privacy loss . GroundTruth-NN achieves the highest utility followed by Probabilistic-Model, which obtains up to 200% higher utility than Oblivious-RR, especially with higher privacy level (smaller ). We also observe that when increases (less privacy), the utility of both Probabilistic-Model and Oblivious- RN asymptotically increases to the utility of GroundTruth-NN. This is because when less noise is injected, the perturbed locations tend to be closer to the actual ones. This yields a higher number of assigned tasks. Worker Travel Cost Figure 4.30b shows the results when varying privacy loss . It is expected that GroundTruth-NN obtains the lowest travel cost as it has access to actual location data. Probabilistic-Model achievessignificantlylowertravelcost(upto30%)whencomparedto Oblivious- RN. The improvement is higher at smaller (higher privacy). We also observe that as grows (less privacy), the worker travel cost of both Probabilistic-Model and Oblivious-RN asymptotically reduces to the travel cost of GroundTruth-NN. 90 Overhead and Privacy Leak Figures 4.30c, 4.30d show the results when varying privacy loss . Although the overhead of Probabilistic-Model is slightly higher than Oblivious-RN’s (i.e., up to 500 workers vs. up to 400 workers in the candidate set), Probabilistic-Model has a much smaller disclosure of location information (i.e., up to 2 false hits vs. up to 1500 false hits). This means that before a task can be assigned, Oblivious-RN needs to send the task to∼4.75 workers on averagewhilethenumberis∼1.04forProbabilistic-Model. UnlikeOblivious-RN,Probabilistic-Model usually identifies a candidate worker who is reachable to a task at the first try, without the need of sending the task to multiple workers during U2E. This result shows that our proposed approaches, the probability-based ranking and the thresholding heuristic in Section 4.3.2, effectively limit the disclosure of location information. This is crucial because the disclosure of location information (among requesters and workers) is the only privacy leak in SCGuard. We also show the impact of privacy loss on system overhead and privacy leak. As expected, when increases (less privacy), system overhead and privacy leak decrease while both precision and recall increase. Effect of Parameter Settings We evaluate the performance of the Probabilistic-Model by varying the U2U and U2E thresholds (α, β). 0 100 200 300 400 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Utility (#Tasks) Overhead (#Workers) (a) Countable met- rics 0 500 1000 1500 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 =0.7 =1.0 (b) Travel cost (meters) 0 0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Precision Recall (c) U2U metrics 0 10 20 30 40 50 60 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 False hit False dismissal (d) U2E metrics 0 0.5 1 1.5 2 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 U2U U2E (e) Average runtime (s) Figure 4.31: Performance of Probabilistic-Model by decreasing U2U threshold (α). Impact of varying U2U threshold α We first show the impact of varyingα on both the U2U and U2E stages (see Figure 4.31). The main observation is that by decreasing the value of α, the algorithm achieves higher utility (Figure 4.31a), lower travel distance (Figure 4.31b) at the expense 91 of higher system overhead (Figure 4.31a). The reason for this is that the smaller U2U threshold α, the larger the worker candidate set (see Line 7 of Algorithm 6), providing the task more chances to be assigned at the U2E phase. The impact of α on recall in Figure 4.31c confirms our intuition of achieving higher recall in Section 4.3.2. Note that α does not directly impact false hit and false dismissal because they are U2U metrics (Figure 4.31d. Consequently, to achieve high SC utility, the value of U2U threshold α should be as small as possible, while the size of the worker candidate set is manageable by the requester in terms of runtime (e.g., α = 0.1). Figure 6 4.31e shows the runtime of the U2E stage when varying α. As expected, the smaller α, the higher the runtime due to having a larger candidate set per task. Impact of varying U2E threshold β We present the impact of varying U2E threshold 7 . β on the U2E stage (see Figure 4.32). In most cases, we observe that as U2E threshold β grows, utility, travel cost and the number of false dismissals decrease slightly while location disclosure decreases linearly. This result confirms our aim to reduce privacy leak by introducing U2E threshold β in Section 4.3.2. However, false dismissal increases at a certain value of β (i.e., 0.25), which obviously decreases utility. The reason for this is that the higher U2E thresholdβ, the more likely a requester misses a candidate worker who is reachable to a task. In sum, to reduce privacy leak, the value of U2E threshold β should be as large as possible, but at the same time should not incur significant utility loss (e.g., β = 0.25). 0 200 400 600 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Utility (#Tasks) Overhead (#Workers) (a) Countable metrics 0 100 200 300 400 0.1 0.15 0.2 0.25 0.3 0.35 0.4 False hit False dismissal (b) U2E metrics 0 200 400 600 800 0.1 0.15 0.2 0.25 0.3 0.35 0.4 =0.7 =1.0 (c) Travel cost (m) Figure 4.32: Performance of Probabilistic-Model by increasing U2E threshold (β). 6 We are not applying the runtime optimization proposed in Section 4.3.2. 7 β does not have any impact on system overhead and the precision/recall scores as they are U2E metrics 92 4.3.4 Extensions and Open Problems Protection from malicious adversaries: Our framework assumes the semi-honest model, which is an important first step towards constructing protocols with stronger security under the malicious model. Using general crypto tools such as zero-knowledge proofs, the protocols can be usually transformed into secure protocols under the malicious model. Under the malicious model, the requesters, for example, can send multiple fake tasks to estimate the workers’ locations. We can avoid or mitigate such threats by complementary measures such as: (1) a reputation system that rates requesters (or workers) and helps detect those who send fake locations, and (2) a payment scheme that requires a payment for each task and increases the cost for attacks. Recent studies proposed techniques for detecting fake events in the spatial crowdsourcing app Waze [72] which can also be potentially adapted in our setting to detect malicious requesters or workers. We believe these are important research challenges for future research. Protection for dynamic workers and tasks: Our current framework deals with the task assignment at one time point, i.e. given the locations of workers and tasks at that time point. When the locations of tasks and workers change dynamically, if we assume the location sets of each worker or task are independent of each other, the same guarantees of geo-indistinguishability still hold. However, the locations can be correlated in practice, for example, the workers’ traces can follow a specific (movement, or driving) pattern, and the task locations of individual requesters can be in the proximity of their whereabout. In this case, we can use the extended geo-indistinguishability, as discussed in Section III.E of [16], which gives a privacy guarantee for the location set and the amount of noise for each location increases linearly with the size of the location set. This inevitably will reduce the utility of the system drastically. We note that our three-stage framework is orthogonal to the underlying privacy mechanism and can be extended to work with other privacy notions for better utility in dynamic situations. For example, if we use δ location set-based differential privacy [198], we only need to modify the component of quantifying worker-task reachability and estimating worker-task distance (Section IV.B) based on the corresponding perturbation mechanisms, which can be an interesting future research direction. 93 Redundant task assignment: Our framework focus on SC apps (e.g., Uber, TaskRabbit, GigWalk, FieldAgent) where performing a spatial task does not require multiple workers, such as taxi ride sharing, package delivery, house cleaning and home improvement. However, there are spatial tasks that may need to be performed redundantly to ensure the quality of response, such as taking a picture of a particular restaurant, reporting how crowded a restaurant or a gym is at a certain time. In those cases, our proposed algorithm can be extended such that one task can be performed by multiple workers. Particularly, the U2E stage of Algorithm 6 can be updated while the U2U and E2E stages stay the same. During U2E, given the perturbed locations of candidate workers N j , a requester identifies the most likely reachable K workers and sends the task location to these workers where K is the number of workers required to perform the task. If K is greater than or equal to|N j |, the requester sends the task location to all candidate workers. Thereafter, during E2E, the selected workers accept the task if it is enclosed within their spatial regions; otherwise, they reject the task. 94 Chapter 5 Conclusion With the popularity of mobile devices, spatial crowdsourcing is rising as a framework that enables human workers to solve tasks in the physical world. With spatial crowdsourcing, requesters outsource a set of spatiotemporal tasks to a set of workers, i.e., individuals with mobile devices that perform the tasks by physically traveling to the specified locations of interest. However, current solutions require a worker to disclose his location to the server and/or to other requesters even before accepting a task—or a requester to disclose his tasks’ locations, which can be used to infer his own location, to untrusted entities. In this thesis, we identified potential privacy threats from the adversaries (server, requester and worker) and present countermeasures to prevent such threats from occurring. We introduced novel privacy-aware frameworks to protect locations of both workers and tasks in spatial crowdsourcing. The first framework [170] enables the participation of workers without compromising their location privacy. We identified geocasting as a needed step to ensure that privacy is protected prior to workersconsentingtoatask. Wealsoprovidedheuristicsandoptimizationsfordeterminingeffective geocast regions that achieve high task assignment rate with low overhead. Our experimental results on real data demonstrated that the proposed techniques are effective, and the cost of privacy is practical. The second framework [175] protects locations of both workers and tasks in spatial crowdsourcing without any trusted entity. We proposed models for quantifying the probability of reachability between a worker and a task, from which the probability-based algorithm was introduced to assign tasks to workers in an online manner. We introduced the performance metrics to evaluate and compare our different privacy-preserving algorithms. Our experimental results on real data demonstrated that the proposed techniques, algorithms, and heuristics achieve high utility, small worker travel cost, and low disclosure of location information. 95 Asfuturedirections, weplantousetheoptimalgeo-indistinguishablemechanisms[21]andadopt the elastic distinguishability metric [26] to further improve the utility of the task assignment. The reason is that Geo-I is based on Euclidean distance, meaning that privacy protection is uniform in space. However, in spatial crowdsourcing, we may want to tune the amount of noise injected to each location, such as less noise to dense areas and more noise to sparse areas. Another promising directionistoconsiderpowerfuladversarieswithknowledgeabouttemporalcorrelationsofamoving user’s locations [198]. We may also consider collusion between workers and the server; for example, some workers may work for the spatial crowdsourcing company or the company may use driverless cars. 96 Reference List [1] Openstreetmap: http://www.openstreetmap.org/, 2004. [2] Amazon mechanical turk: http://www.mturk.com/, 2005. [3] oDesk: https://www.odesk.com/, 2005. [4] Wikimapia: http://wikimapia.org/, 2006. [5] Googlemapmaker: http://www.google.com/mapmaker/, 2008. [6] UCB: http://traffic.berkeley.edu/, 2008. [7] Crowdflower: http://www.crowdflower.com/, 2009. [8] Field Agent: http://www.fieldagent.net/, 2010. [9] Gigwalk: http://gigwalk.com, 2010. [10] M. Abramowitz, I. A. Stegun, et al. Handbook of mathematical functions. Applied mathe- matics series, 55(62):39, 1966. [11] A. Alfarrarjeh, T. Emrich, and C. Shahabi. Scalable spatial crowdsourcing: A study of distributed algorithms. In Mobile Data Management (MDM), 2015 16th IEEE International Conference on, volume 1, pages 134–144. IEEE, 2015. [12] O. Alonso, D. E. Rose, and B. Stewart. Crowdsourcing for relevance evaluation. In ACM SigIR Forum, volume 42, pages 9–15. ACM, 2008. [13] F. Alt, A. S. Shirazi, A. Schmidt, U. Kramer, and Z. Nawaz. Location-based crowdsourcing: extending crowdsourcing to the real world. In Proceedings of the 6th Nordic Conference on Human-Computer Interaction: Extending Boundaries, pages 13–22. ACM, 2010. [14] F. Alt, A. S. Shirazi, A. Schmidt, U. Kramer, and Z. Nawaz. Location-based crowdsourc- ing: extending crowdsourcing to the real world. In Proc. 6th Nord. Conf. Human-Computer Interact. Extending Boundaries - Nord. ’10, page 13, 2010. [15] N. An, R. Wang, Z. Luan, D. Qian, J. Cai, and H. Zhang. Adaptive assignment for quality- aware mobile sensing network with strategic users. In Proc. - 2015 IEEE 17th Int. Conf. High Perform. Comput. Commun. 2015 IEEE 7th Int. Symp. Cybersp. Saf. Secur. 2015 IEEE 12th Int. Conf. Embed. Softw. Syst. H, pages 541–546. IEEE, aug 2015. 97 [16] M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi. Geo- indistinguishability: Differential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 901–914. ACM, 2013. [17] M. Asghari, D. Deng, C. Shahabi, U. Demiryurek, and Y. Li. Price-aware real-time ride- sharing at scale: an auction-based approach. In The 24th ACM SIGSPATIAL, page 3. ACM, 2016. [18] M.Asghari, C.Shahabi, L.Fan, S.Rallapalli, H.Qiu, A.Bency, S.Karthikeyan, R.Govindan, B.Manjunath, R.Urgaonkar, etal. Auction-sc–anauction-basedframeworkforreal-timetask assignment in spatial crowdsourcing. Resource, 12:928. [19] T. Bernecker, T. Emrich, H.-P. Kriegel, N. Mamoulis, M. Renz, and A. Züfle. A novel probabilistic pruning approach to speed up similarity queries in uncertain databases. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 339–350. IEEE, 2011. [20] K. Bessai and F. Charoy. Optimization of Orchestration of Geocrowdsourcing Activities. In Third Int. Conf. Inf. Syst. Cris. Response Manag. Mediterr. Ctries. (ISCRAM-med 2016), pages 1 – 15. Springer, 2016. [21] N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi. Optimal geo-indistinguishable mechanisms for location privacy. In The 2014 ACM SIGSAC (CCS 2014), pages 251–262. ACM, 2014. [22] I. Boutsis and V. Kalogeraki. Privacy preservation for participatory sensing data. In 2013 IEEE International Conference on Pervasive Computing and Communications (PerCom), pages 103–113. IEEE, mar 2013. [23] I. Boutsis and V. Kalogeraki. On task assignment for real-time reliable crowdsourcing. In Proc. - Int. Conf. Distrib. Comput. Syst., pages 1–10, 2014. [24] A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. In Proceedings of the 21st international conference on World Wide Web, pages 1009–1018. ACM, 2012. [25] M. F. Bulut, Y. S. Yilmaz, and M. Demirbas. Crowdsourcing location-based queries. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2011 IEEE International Conference on, pages 513–518. IEEE, 2011. [26] K. Chatzikokolakis, C. Palamidessi, and M. Stronati. Constructing elastic distinguishability metrics for location privacy. In PET 2015, 2015(2):156–170, 2015. [27] C. Chen, S.-f. Cheng, A. Gunawan, and A. Misra. TRACCS: Trajectory-Aware Coordinated Urban Crowd-Sourcing. Proc. Second AAAI Conf. Hum. Comput. Crowdsourcing (HCOMP 2014) TRACCS, (Hcomp):30–40, 2014. [28] C. Chen, S. F. Cheng, H. C. Lau, and A. Misra. Towards city-scale mobile crowdsourcing: Task recommendations under trajectory uncertainties. In IJCAI Int. Jt. Conf. Artif. Intell., volume 2015-Janua, pages 1113–1119, 2015. 98 [29] K.-T. Chen, C.-C. Wu, Y.-C. Chang, and C.-L. Lei. A crowdsourceable qoe evaluation framework for multimedia content. In Proceedings of the 17th ACM international conference on Multimedia, pages 491–500. ACM, 2009. [30] Z. Chen, R. Fu, Z. Zhao, Z. Liu, L. Xia, L. Chen, P. Cheng, C. C. Cao, Y. Tong, and C. J. Zhang. gmission: A general spatial crowdsourcing platform. Proceedings of the VLDB Endowment, 7(13):1629–1632, 2014. [31] P. Cheng, X. Lian, L. Chen, J. Han, and J. Zhao. Task assignment on multi-skill oriented spatial crowdsourcing. IEEE Trans. Knowl. Data Eng., 28(8):2201–2215, 2016. [32] P. Cheng, X. Lian, Z. Chen, R. Fu, L. Chen, J. Han, and J. Zhao. Reliable diversity-based spatial crowdsourcing by moving workers. Proceedings of the VLDB Endowment, 8(10):1022– 1033, 2015. [33] S. Choi, G. Ghinita, H.-S. Lim, and E. Bertino. Secure knn query processing in untrusted cloud environments. Knowledge and Data Engineering, IEEE Transactions on, 26(11):2818– 2831, 2014. [34] D. Christin. Privacy in mobile participatory sensing: Current trends and future challenges. In Journal of Systems and Software, volume 116, pages 57–68, 2016. [35] G. Cormode, C. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Differentially private spatial decompositions. In ICDE, pages 20–31, 2012. [36] C. Cornelius, A. Kapadia, D. Kotz, D. Peebles, M. Shin, and N. Triandopoulos. AnonySense: privacy-aware people-centric sensing. In Intl. Conf. on Mobile systems, applications, and services, pages 211–224, 2008. [37] H. Dang, T. Nguyen, and H. To. Maximum complex task assignment: Towards tasks corre- lation in spatial crowdsourcing. In Proceedings of International Conference on Information Integration and Web-based Applications & Services, IIWAS ’13, pages 77:77–77:81, New York, NY, USA, 2013. ACM. [38] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific Reports, 3:1376, 2013. [39] G. Demartini, B. Trushkowsky, T. Kraska, and M. J. Franklin. Crowdq: Crowdsourced query understanding. In CIDR, 2013. [40] D. Deng, C. Shahabi, and U. Demiryurek. Maximizing the number of worker’s self-selected tasks in spatial crowdsourcing. In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 324–333. ACM, 2013. [41] D. Deng, C. Shahabi, U. Demiryurek, and L. Zhu. Task selection in spatial crowdsourcing from worker’s perspective. Geoinformatica, 20(3):529–568, jul 2016. [42] D. Deng, C. Shahabi, and L. Zhu. Task matching and scheduling for multiple workers in spatial crowdsourcing, 2015. 99 [43] R. Dewri, I. Ray, and D. Whitley. Query m-invariance: Preventing query disclosures in continuous location-based services. In Mobile Data Management (MDM), 2010 Eleventh International Conference on, pages 95–104. IEEE, 2010. [44] C. Dwork. Differential privacy. In Automata, languages and programming, pages 1–12. Springer, 2006. [45] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, pages 265–284. Springer, 2006. [46] L.FanandL.Xiong. Anadaptiveapproachtoreal-timeaggregatemonitoringwithdifferential privacy. Knowledge and Data Engineering, IEEE Transactions on, 26(9):2094–2106, Sept 2014. [47] Y. Fan, H. Sun, and X. Liu. Truthful incentive mechanisms for dynamic and heterogeneous tasks in mobile crowdsourcing. In Proc. - Int. Conf. Tools with Artif. Intell. ICTAI, volume 2016-Janua, pages 881–888. IEEE, nov 2016. [48] Y. Fan, H. Sun, Y. Zhu, X. Liu, and J. Yuan. A Truthful Online Auction for Tempo-spatial Crowdsourcing Tasks. In 2015 IEEE Symp. Serv. Syst. Eng., pages 332–338. IEEE, mar 2015. [49] Z. Feng, Y. Zhu, Q. Zhang, L. M. Ni, and A. V. Vasilakos. TRAC: Truthful auction for location-aware collaborative sensing in mobile crowdsourcing. In Proc. - IEEE INFOCOM, pages 1231–1239, 2014. [50] A. S. Fonteles. Heuristics for Task Recommendation in Spatiotemporal Crowdsourcing Sys- tems. Proc. 13th Int. Conf. Adv. Mob. Comput. Multimed. - MoMM 2015, pages 1–5, 2015. [51] A. S. Fonteles, S. Bouveret, and J. Gensel. Towards matching improvement between spatio- temporal tasks and workers in mobile crowdsourcing market systems. In Proc. Third ACM SIGSPATIAL Int. Work. Mob. Geogr. Inf. Syst. - MobiGIS ’14, pages 43–50, 2014. [52] A. S. Fonteles, S. Bouveret, and J. Gensel. Opportunistic trajectory recommendation for task accomplishment in crowdsourcing systems. In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), volume 9080, pages 178–190. Springer International Publishing, 2015. [53] M.J.Franklin, D.Kossmann, T.Kraska, S.Ramesh, andR.Xin. Crowddb: answeringqueries with crowdsourcing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 61–72. ACM, 2011. [54] D. Gao, Y. Tong, J. She, T. Song, L. Chen, and K. Xu. Top-k Team Recommendation in Spatial Crowdsourcing. In Web-Age Inf. Manag., pages 191–204. Springer International Publishing, 2016. [55] B.GedikandL.Liu. Protectinglocationprivacywithpersonalizedk-anonymity: Architecture and algorithms. Mobile Computing, IEEE Transactions on, 7(1):1–18, 2008. [56] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, and K.-L. Tan. Private queries in location based services: anonymizers are not necessary. In SIGMOD, pages 121–132, 2008. 100 [57] Y. Gong, L. Wei, Y. Guo, C. Zhang, and Y. Fang. Optimal task recommendation for mobile crowdsourcing with privacy control. Internet of Things Journal, 2015. [58] Y. Gong, C. Zhang, Y. Fang, and J. Sun. Protecting Location Privacy for Task Allocation in Ad Hoc Mobile Cloud Computing. In IEEE Transactions on Emerging Topics in Computing, pages 1–1, 2015. [59] Y. Gong, C. Zhang, Y. Fang, and J. Sun. Protecting location privacy for task allocation in ad hoc mobile cloud computing. IEEE Transactions on Emerging Topics in Computing, 2015. [60] M.C.Gonzalez, C.A.Hidalgo, andA.-L.Barabasi. Understandingindividualhumanmobility patterns. Nature, 453(7196):779–782, 2008. [61] S. Goryczka and L. Xiong. A comprehensive comparison of multiparty secure additions with differential privacy. In IEEE TDSC 2015, 2015. [62] M. Gruteser and D. Grunwald. Anonymous usage of location-based services through spatial and temporal cloaking. In Proceedings of the 1st international conference on Mobile systems, applications and services, pages 31–42. ACM, 2003. [63] B. Guo, H. Chen, Z. Yu, W. Nan, X. Xie, D. Zhang, and X. Zhou. TaskMe: Toward a Dynamic and Quality-Enhanced Incentive Mechanism for Mobile Crowd Sensing. Int. J. Hum. Comput. Stud., 2016. [64] B. Guo, Y. Liu, W. Wu, Z. Yu, and Q. Han. ActiveCrowd: A Framework for Optimized Multitask Allocation in Mobile Crowdsensing Systems. IEEE Trans. Human-Machine Syst., pages 1–12, 2016. [65] I. Guy, A. Perer, T. Daniel, O. Greenshpan, and I. Turbahn. Guess who?: enriching the social graph through a crowdsourcing game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1373–1382. ACM, 2011. [66] M.Hadano, M.Nakatsuji, H.Toda, andY.Koike. AssigningTaskstoWorkersbyReferringto Their Schedules in Mobile Crowdsourcing. Third AAAI Conf. Hum. Comput. Crowdsourcing, 2015. [67] S. He, D. H. Shin, J. Zhang, and J. Chen. Toward optimal allocation of location dependent tasks in crowdsensing. In Proc. - IEEE INFOCOM, pages 745–753, 2014. [68] B. J. Hecht and D. Gergle. On the localness of user-generated content. In Proceedings of the 2010 ACM conference on Computer supported cooperative work, pages 229–232. ACM, 2010. [69] K. Heimerl, B. Gawalt, K. Chen, T. Parikh, and B. Hartmann. CommunitySourcing: Engag- ing Local Crowds to Perform Expert Work Via Physical Kiosks. In Proc. 2012 ACM Annu. Conf. Hum. Factors Comput. Syst. - CHI ’12, page 1539, 2012. [70] S. Hettich and M. J. Pazzani. Mining for proposal reviewers: lessons learned at the national science foundation. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 862–871. ACM, 2006. 101 [71] K. Hill. ’god view’: Uber allegedly stalked users for party-goers’ viewing pleasure (updated). 2014. [72] K. Hill. If you use waze, hackers can stalk you. http://fusion.net/, 2016. [73] R. Hirson. Uber: The big data company. 2015. [74] H. Hu, Y. Zheng, Z. Bao, G. Li, J. Feng, and R. Cheng. Crowdsourced POI labelling: Location-aware result inference and Task Assignment. In 2016 IEEE 32nd Int. Conf. Data Eng. ICDE 2016, pages 61–72. IEEE, may 2016. [75] J. Hu, L. Huang, L. Li, M. Qi, and W. Yang. Protecting Location Privacy in Spatial Crowd- sourcing. In Asia-Pacific Web Conference, pages 113–124. Springer International Publishing, 2015. [76] J. Hu, L. Huang, L. Li, M. Qi, and W. Yang. Protecting location privacy in spatial crowd- sourcing. In Web Technologies and Applications, pages 113–124. Springer, 2015. [77] L. Hu and C. Shahabi. Privacy assurance in mobile sensing networks: go beyond trusted servers. In Pervasive Computing and Communications, pages 613–619, 2010. [78] K. L. Huang, S. S. Kanhere, and W. Hu. Towards privacy-sensitive participatory sensing. In Pervasive Computing and Communications, pages 1–6, 2009. [79] B. Hull, V. Bychkovsky, Y. Zhang, K. Chen, M. Goraczko, A. Miu, E. Shih, H. Balakrishnan, and S. Madden. Cartel: a distributed mobile sensor computing system. In Proceedings of the 4th international conference on Embedded networked sensor systems, pages 125–138. ACM, 2006. [80] N. Jaiman, R. Tandriansyah, T. Kandappu, and A. Misra. A campus-scale mobile crowd- tasking platform. In Proc. 2016 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput. Adjun. - UbiComp ’16, pages 297–300, New York, New York, USA, 2016. ACM Press. [81] L. G. Jaimes, I. Vergara-Laurens, and M. a. Labrador. A location-based incentive mechanism for participatory sensing systems with budget constraints. 2012 IEEE Int. Conf. Pervasive Comput. Commun., (March):103–108, mar 2012. [82] H. Jin, L. Su, D. Chen, K. Nahrstedt, and J. Xu. Quality of Information Aware Incentive Mechanisms for Mobile Crowd Sensing Systems. In Proc. 16th ACM Int. Symp. Mob. Ad Hoc Netw. Comput. - MobiHoc ’15, pages 167–176, New York, New York, USA, 2015. ACM Press. [83] X.Jin, R.Zhang, Y.Chen, T.Li, andY.Zhang. DPSense: Differentiallyprivatecrowdsourced spectrum sensing. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 296–307. ACM, 2016. [84] J. V.-D. Julia Angwin. Apple, google collect user data. http://www.wsj.com, 2011. [85] R. E. Kalman et al. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960. 102 [86] P. Kalnis, G. Ghinita, K. Mouratidis, and D. Papadias. Preventing location-based identity inference in anonymous spatial queries. Knowledge and Data Engineering, IEEE Transactions on, 19(12):1719–1733, 2007. [87] B. Kalyanadundaram and K. R. Pruhs. Online weighted matching. Journal of Algorithms, pages 478–488, 1993. [88] B. Kalyanasundaram and K. R. Pruhs. An optimal deterministic algorithm for online b- matching. Theoretical Computer Science, 233(1):319–325, 2000. [89] T. Kandappu, N. Jaiman, R. Tandriansyah, A. Misra, S.-F. Cheng, C. Chen, H. C. Lau, D. Chander, and K. Dasgupta. TASKer: Behavioral Insights via Campus-based Experimental Mobile Crowd-sourcing. In Proc. 2016 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput. - UbiComp ’16, pages 392–402, New York, New York, USA, 2016. ACM Press. [90] T. Kandappu, A. Misra, S.-f. Cheng, N. Jaiman, and R. Tandriansiyah. Campus-Scale Mobile Crowd-Tasking: Deployment&BehavioralInsights. InProc. 19th ACM Conf. Comput. Coop. Work\& Soc. Comput., pages 798–810, New York, New York, USA, 2016. ACM Press. [91] Y. Kang, X. Miao, K. Liu, L. Chen, and Y. Liu. Quality-aware online task assignment in mobile crowdsourcing. In Proc. - 2015 IEEE 12th Int. Conf. Mob. Ad Hoc Sens. Syst. MASS 2015, pages 127–135. IEEE, oct 2015. [92] R. M. Karp, U. V. Vazirani, and V. V. Vazirani. An optimal algorithm for on-line bipar- tite matching. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, pages 352–358. ACM, 1990. [93] L. Kazemi and C. Shahabi. A privacy-aware framework for participatory sensing. ACM SIGKDD Explorations Newsletter, 13(1):43–51, 2011. [94] L. Kazemi and C. Shahabi. Towards preserving privacy in participatory sensing. In Pervasive Computing and Communications, pages 328–331. IEEE, 2011. [95] L. Kazemi and C. Shahabi. GeoCrowd: enabling query answering with spatial crowdsourcing. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems. ACM, 2012. [96] L. Kazemi, C. Shahabi, and L. Chen. Geotrucrowd: trustworthy query answering with spatial crowdsourcing. In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 314–323. ACM, 2013. [97] A. Khoshgozaran and C. Shahabi. Blind evaluation of nearest neighbor queries using space transformation to preserve location privacy. In Advances in Spatial and Temporal Databases, pages 239–257. Springer, 2007. [98] S. Khuller, S. G. Mitchell, and V. V. Vazirani. On-line algorithms for weighted bipartite matching and stable marriages. Theoretical Computer Science, pages 255–267, 1994. [99] H. Kido, Y. Yanagisawa, and T. Satoh. An anonymous communication technique using dummies for location-based services. In Pervasive Services, 2005. ICPS’05. Proceedings. International Conference on, pages 88–97. IEEE, 2005. 103 [100] C. E. Kim and T. A. Anderson. Digital disks and a digital compactness measure. In Proceed- ings of the sixteenth annual ACM symposium on Theory of computing, pages 117–124. ACM, 1984. [101] S. H. Kim, Y. Lu, G. Constantinou, C. Shahabi, G. Wang, and R. Zimmermann. Mediaq: mobile multimedia management system. In Proceedings of the 5th ACM Multimedia Systems Conference, pages 224–235. ACM, 2014. [102] A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pages 1301–1318. ACM, 2013. [103] J. Krumm. Inference attacks on location tracks. In Pervasive Computing, pages 127–143. Springer, 2007. [104] E. L. Lawler, J. K. Lenstra, A. R. Kan, and D. B. Shmoys. The traveling salesman problem: a guided tour of combinatorial optimization, volume 3. Wiley New York, 1985. [105] J. S. Lee and B. Hoh. Dynamic pricing incentive for participatory sensing. In Pervasive Mob. Comput., volume 6, pages 693–708, 2010. [106] Q. Li and G. Cao. Providing privacy-aware incentives in mobile sensing systems. In IEEE Trans. Mob. Comput., volume 15, pages 1485–1498, 2016. [107] Y.Li, M.L.Yiu, andW.Xu. Orientedonlinerouterecommendationforspatialcrowdsourcing task workers. In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), volume 9239, pages 137–156. Springer International Publishing, 2015. [108] A. Liu, Z.-X. Li, G.-F. Liu, K. Zheng, M. Zhang, Q. Li, and X. Zhang. Privacy-preserving task assignment in spatial crowdsourcing. Journal of Computer Science and Technology, 32(5):905–918, 2017. [109] A. Liu, W. Wang, S. Shang, Q. Li, and X. Zhang. Efficient task assignment in spatial crowdsourcing with worker and task privacy protection. GeoInformatica, pages 1–28, 2017. [110] B. Liu, L. Chen, X. Zhu, Y. Zhang, C. Zhang, and W. Qiu. Protecting location privacy in spatial crowdsourcing using encrypted data. In EDBT, pages 478–481, 2017. [111] Q.Liu, T.Abdessalem, H.Wu, Z.Yuan, andS.Bressan. Costminimizationandsocialfairness for spatial crowdsourcing tasks. In Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), volume 9642, pages 3–17. Springer International Publishing, 2016. [112] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. Proceedings of the VLDB Endowment, 5(10):1040–1051, 2012. [113] Y. Liu, B. Guo, Y. Wang, W. Wu, Z. Yu, and D. Zhang. TaskMe: Multi-Task Allocation in Mobile Crowd Sensing. In Proc. 2016 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput. - UbiComp ’16, pages 403–414, New York, New York, USA, 2016. ACM Press. 104 [114] Z. Liu, X. Niu, X. Lin, T. Huang, Y. Wu, and H. Li. A task-centric cooperative sensing scheme for mobile crowdsourcing systems. Sensors (Switzerland), 16(5), 2016. [115] A. Marcus, E. Wu, D. Karger, S. Madden, and R. Miller. Human-powered sorts and joins. Proceedings of the VLDB Endowment, 5(1):13–24, 2011. [116] A. M. Mathai and S. B. Provost. Quadratic forms in random variables: theory and applica- tions. Dekker, 1992. [117] R. Mcmillan. The hidden privacy threat of ... flashlight apps? www.wired.com, 2014. [118] F. McSherry and I. Mironov. Differentially private recommender systems: building privacy intothenet. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 627–636. ACM, 2009. [119] F.D.McSherry. Privacyintegratedqueries: anextensibleplatformforprivacy-preservingdata analysis. InProceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30. ACM, 2009. [120] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani. Adwords and generalized online matching. Journal of the ACM (JACM), 54(5):22, 2007. [121] C. Miao, H. Yu, Z. Shen, and C. Leung. Balancing quality and budget considerations in mobile crowdsourcing. Decis. Support Syst., 2016. [122] P. Micholia, M. Karaliopoulos, I. Koutsopoulos, L. M. Aiello, G. D. F. Morales, and D. Quer- cia. Incentivizing Social Media Users for Mobile Crowdsourcing. Int. J. Hum. Comput. Stud., 2016. [123] D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 500–509. ACM, 2007. [124] mobiThinking. Global mobile statistics 2014 part a: Mobile subscribers; handset market share; mobile operators. May 16 2014. [125] P. Mohan, V. N. Padmanabhan, and R. Ramjee. Nericell: rich monitoring of road and traffic conditions using mobile smartphones. In Proceedings of the 6th ACM conference on Embedded network sensor systems, pages 323–336. ACM, 2008. [126] M. F. Mokbel, C.-Y. Chow, and W. G. Aref. The new Casper: query processing for location services without compromising privacy. In Proceedings of the 32nd international conference on Very large data bases, pages 763–774. VLDB Endowment, 2006. [127] P. Mrazovic, M. Matskin, and N. Dokoohaki. Trajectory-Based Task Allocation for Reliable Mobile Crowd Sensing Systems. In Proc. - 15th IEEE Int. Conf. Data Min. Work. ICDMW 2015, pages 398–406, 2016. [128] M. Musthag and D. Ganesan. The Role of Super Agents in Mobile Crowdsourcing. In Work. Twenty-Sixth AAAI Conf. ..., pages 143–149, 2012. 105 [129] M. Musthag and D. Ganesan. Labor dynamics in a mobile micro-task market. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 641–650. ACM, 2013. [130] J. C. Navas and T. Imielinski. Geocast: geographic addressing and routing. In Proceedings of the 3rd annual ACM/IEEE international conference on Mobile computing and networking, pages 66–76. ACM, 1997. [131] P. Nguyen. iRain: new mobile app to promote citizen-science and support water management: http://en.unesco.org/news/irain-new-mobile-app-promote-citizen-science-and- support-water-management. 2016. [132] J. Ni, X. Lin, K. Zhang, and Y. Yu. Secure and deduplicated spatial crowdsourcing: A fog- based approach. In Global Communications Conference (GLOBECOM), 2016 IEEE, pages 1–6. IEEE, 2016. [133] B. Palanisamy and L. Liu. Mobimix: Protecting location privacy with mix-zones over road networks. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 494–505. IEEE, 2011. [134] A. Parameswaran, A. D. Sarma, H. Garcia-Molina, N. Polyzotis, and J. Widom. Human- assisted graph search: it’s okay to ask questions. Proceedings of the VLDB Endowment, 4(5):267–278, 2011. [135] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 361–372. ACM, 2012. [136] D. Perry. Sex and uber’s ’rides of glory’: The company tracks your one-night stands – and much more. 2014. [137] A. Pham, I. Dacosta, B. Jacot-Guillarmod, K. Huguenin, T. Hajar, F. Tramèr, V. Gligor, and J.-P. Hubaux. PrivateRide: A privacy-enhanced ride-hailing service. Proceedings on Privacy Enhancing Technologies, 2017(2):38–56, 2017. [138] L. Pournajaf, D. A. Garcia-Ulloa, L. Xiong, and V. Sunderam. Participant privacy in mobile crowd sensing task management: A survey of methods and challenges. SIGMOD Record, 44(4):23, 2015. [139] L. Pournajaf, L. Xiong, and V. Sunderam. Dynamic data driven crowd sensing task assign- ment. Procedia Computer Science, 29:1314–1323, 2014. [140] L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd sensing with cloaked locations. In Mobile Data Management (MDM), 2014 IEEE 15th Inter- national Conference on, volume 1, pages 73–82. IEEE, 2014. [141] W. Qardaji, W. Yang, and N. Li. Differentially private grids for geospatial data. In Data Engineering (ICDE), 2012 IEEE 28th International Conference on. IEEE, 2013. [142] W.Qardaji, W.Yang, andN.Li. Understandinghierarchicalmethodsfordifferentiallyprivate histograms. Proceedings of the VLDB Endowment, 6(14):1954–1965, 2013. 106 [143] A. J. Quinn and B. B. Bederson. Human computation: a survey and taxonomy of a growing field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1403–1412. ACM, 2011. [144] J. P. Rula, V. Navda, F. E. Bustamante, R. Bhagwan, and S. Guha. No "one-size fits all": Towards a principled approach for incentives in mobile crowdsourcing. In Proc. 15th Work. Mob. Comput. Syst. Appl. - HotMobile ’14, pages 1–5, New York, New York, USA, 2014. ACM Press. [145] A. Sadilek, J. Krumm, and E. Horvitz. Crowdphysics: Planned and Opportunistic Crowd- sourcing for Physical Tasks. In Proc. Seventh Int. AAAI Conf. Weblogs Soc. Media, pages 536–545, 2013. [146] A. Sales Fonteles, S. Bouveret, and J. Gensel. Trajectory recommendation for task accom- plishment in crowdsourcing âĂŞ a model to favour different actors. J. Locat. Based Serv., 10(2):125–141, apr 2016. [147] H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006. [148] M. Sauter. Beyond 3g: Bringing networks, terminals and the web together: Lte, wimax, ims, 4g devices and the mobile web 2.0. 2011. [149] J. Scheck. Stalkers exploit cellphone GPS. http://www.wsj.com, 2010. [150] H. Shah-Mansouri and V. W. S. Wong. Profit maximization in mobile crowdsourcing: A truthful auction mechanism. In IEEE Int. Conf. Commun., volume 2015-Septe, pages 3216– 3221. IEEE, jun 2015. [151] Y. Shen, L. Huang, L. Li, X. Lu, S. Wang, and W. Yang. Towards preserving worker loca- tion privacy in spatial crowdsourcing. In 2015 IEEE Global Communications Conference, GLOBECOM 2015, 2016. [152] M. Shin, C. Cornelius, D. Peebles, A. Kapadia, D. Kotz, and N. Triandopoulos. AnonySense: A system for anonymous opportunistic sensing. Pervasive and Mobile Computing, 7(1):16–30, 2011. [153] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast—but is it good?: evalu- ating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pages 254–263. Association for Computa- tional Linguistics, 2008. [154] T. Song, Y. Tong, L. Wang, J. She, B. Yao, L. Chen, and K. Xu. Trichromatic online matching in real-time spatial crowdsourcing. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pages 1009–1020. IEEE, 2017. [155] Z. Song, C. H. Liu, J. Wu, J. Ma, and W. Wang. QoI-aware multitask-oriented dynamic participant selection with budget constraints. IEEE Trans. Veh. Technol., 63(9):4618–4632, 2014. 107 [156] A. Sorokin and D. Forsyth. Utility data annotation with amazon mechanical turk. In Com- puter Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pages 1–8. IEEE, 2008. [157] G. L. Stüber. Principles of mobile communication, volume 2. Springer, 2001. [158] D. Sun, Y. Gao, and D. Yu. Efficient and Load Balancing Strategy for Task Scheduling in Spatial Crowdsourcing. In Int. Conf. Web-Age Inf. Manag., pages 161–173. Springer Inter- national Publishing, 2016. [159] Y. Sun, A. Liu, Z. Li, G. Liu, L. Zhao, and K. Zheng. Anonymity-based privacy-preserving task assignment in spatial crowdsourcing. In International Conference on Web Information Systems Engineering, pages 263–277. Springer, 2017. [160] Y.-H. Sun, J. Ma, Z.-P. Fan, and J. Wang. A hybrid knowledge and model approach for reviewer assignment. Expert Systems with Applications, 34(2):817–824, 2008. [161] K. Tan and Q. Tao. Market-Driven Optimal Task Assignment in Spatial Crowdsouring. In Int. Conf. Web-Age Inf. Manag., pages 224–235. Springer International Publishing, 2016. [162] W. Tang, J. Tang, T. Lei, C. Tan, B. Gao, and T. Li. On optimization of expertise matching with various constraints. Neurocomputing, 76(1):71–83, 2012. [163] Y. Tao, X. Xiao, and R. Cheng. Range search on multidimensional uncertain data. ACM Transactions on Database Systems (TODS), 32(3):15, 2007. [164] R. Teodoro, P. Ozturk, M. Naaman, W. Mason, and J. Lindqvist. The motivations and experiences of the on-demand mobile workforce. In The 17th CSCW, pages 236–247. ACM, 2014. [165] J. Thebault-Spieker, L. G. Terveen, and B. Hecht. Avoiding the south side and the suburbs: The geography of mobile crowdsourcing markets. In The 18th CSCW, pages 265–275. ACM, 2015. [166] H. To, M. Asghari, D. Deng, and C. Shahabi. Scawg: A toolbox for generating synthetic workload for spatial crowdsourcing. In Pervasive Computing and Communication Workshops (PerCom Workshops), 2016 IEEE International Conference on, pages 1–6. IEEE, 2016. [167] H. To, L. Fan, L. Tran, and C. Shahabi. Real-time task assignment in hyperlocal spatial crowdsourcingunderbudgetconstraints. In2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), pages 1–8. IEEE, 2016. [168] H. To, R. Geraldes, C. Shahabi, S. H. Kim, and H. Prendinger. An empirical study of workers’ behavior in spatial crowdsourcing. In Proceedings of the Third International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data, page 8. ACM, 2016. [169] H. To, G. Ghinita, L. Fan, and C. Shahabi. Differentially Private Location Protection for Worker Datasets in Spatial Crowdsourcing. In IEEE Transactions on Mobile Computing, volume PP, pages 1–1, 2016. 108 [170] H. To, G. Ghinita, and C. Shahabi. A framework for protecting worker location privacy in spatial crowdsourcing. Proceedings of the VLDB Endowment, 7(10):919–930, 2014. [171] H. To, G. Ghinita, and C. Shahabi. PrivGeoCrowd: A toolbox for studying private spatial Crowdsourcing. InProceedings - International Conference on Data Engineering, volume2015- May, pages 1404–1407. IEEE, apr 2015. [172] H. To, S. H. Kim, and C. Shahabi. Effectively crowdsourcing the acquisition and analy- sis of visual data for disaster response. In Big Data (Big Data), 2015 IEEE International Conference on, pages 697–706. IEEE, 2015. [173] H. To and C. Shahabi. Location privacy in spatial crowdsourcing. Book Chapter in Springer Handbook on Mobile Data Privacy (to appear), 2018. [174] H. To, C. Shahabi, and L. Kazemi. A server-assigned spatial crowdsourcing framework. ACM Transactions on Spatial Algorithms and Systems, 1(1):2, 2015. [175] S. C. To, Hien and L. Xiong. Privacy-preserving online task assignment in spatial crowd- sourcing with untrusted server. 34th IEEE International Conference on Data Engineering (revision), 2018. [176] Y. Tong, J. She, B. Ding, L. Chen, T. Wo, and K. Xu. Online minimum matching in real-time spatial data: experiments and analysis. Proceedings of the VLDB Endowment, 9(12):1053– 1064, 2016. [177] Y. Tong, J. She, B. Ding, L. Wang, and L. Chen. Online mobile micro-task allocation in spatial crowdsourcing. In In ICDE 2016, pages 49–60. IEEE, 2016. [178] P. Toth and D. Vigo. The vehicle routing problem. Siam, 2001. [179] L. Tran, H. To, L. Fan, and C. Shahabi. A real-time framework for task assignment in hyperlocal spatial crowdsourcing. In ACM TIST 2017, 2017. [180] S.-J. Tu and E. Fischbach. Random distance distribution for spherical objects: general theory and applications to physics. Journal of Physics A: Mathematical and General, 35(31):6557, 2002. [181] U. ul Hassan and E. Curry. A multi-armed bandit approach to online spatial task assignment. In11th IEEE International Conference on Ubiquitous Intelligence and Computing UIC 2014), 2014. [182] U. ul Hassan and E. Curry. Flag-verify-fix: adaptive spatial crowdsourcing leveraging location-based social networks. Proc. 23rd SIGSPATIAL Int. Conf. Adv. Geogr. Inf. Syst. - GIS ’15, pages 1–4, 2015. [183] U. ul Hassan and E. Curry. Efficient task assignment for spatial crowdsourcing: A com- binatorial fractional optimization approach with semi-bandit learning. Expert Systems with Applications, 2016. 109 [184] K. R. Varadarajan. A divide-and-conquer algorithm for min-cost perfect matching in the plane. In Foundations of Computer Science, 1998. Proceedings. 39th Annual Symposium on, pages 320–329. IEEE, 1998. [185] L. Von Ahn and L. Dabbish. Designing games with a purpose. Communications of the ACM, 51(8):58–67, 2008. [186] K. Vu, R. Zheng, and J. Gao. Efficient algorithms for K-anonymous location privacy in participatory sensing. In Proceedings - IEEE INFOCOM, pages 2399–2407, 2012. [187] G. Wang, B. Wang, T. Wang, A. Nika, H. Zheng, and B. Y. Zhao. Defending against Sybil Devices in Crowdsourced Mapping Services. Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys ’16, pages 179–191, 2016. [188] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 5(11):1483–1494, 2012. [189] L. Wang, D. Yang, X. Han, T. Wang, D. Zhang, and X. Ma. Location privacy-preserving task allocation for mobile crowdsensing with differential geo-obfuscation. In Proceedings of the 26th International Conference on World Wide Web, pages 627–636. International World Wide Web Conferences Steering Committee, 2017. [190] L.Wang, D.Zhang, A.Pathak, C.Chen, H.Xiong, D.Yang, andY.Wang. CCS-TA-Quality- Guaranteed Online Task Allocation in Compressive Crowdsensing. In Proc. 2015 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput. - UbiComp ’15, pages 683–694, 2015. [191] Y. Wang, D. Zhang, Q. Liu, F. Shen, and L. H. Lee. Towards enhancing the last-mile delivery: An effective crowd-tasking model with scalable solutions. Transp. Res. Part E Logist. Transp. Rev., 93:279–293, 2016. [192] E. Weise and J. Guynn. Uber tracking raises privacy concerns. 2014. [193] E. Welzl. Smallest enclosing disks (balls and ellipsoids). Springer, 1991. [194] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043, 2009. [195] R. C.-W. Wong, Y. Tao, A. W.-C. Fu, and X. Xiao. On efficient spatial matching. In Proceedings of the 33rd international conference on Very large data bases, pages 579–590. VLDB Endowment, 2007. [196] M. Xiao, J. Wu, L. Huang, Y. Wang, and C. Liu. Multi-task assignment for crowdsensing in mobile social networks. In Proc. - IEEE INFOCOM, volume 26, pages 2227–2235, 2015. [197] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms. Knowledge and Data Engineering, IEEE Transactions on, 23(8):1200–1214, 2011. [198] Y. Xiao and L. Xiong. Protecting locations with differential privacy under temporal correla- tions. In The 22nd ACM SIGSAC (CCS 2015), pages 1298–1309. ACM, 2015. 110 [199] Y. Xiao, L. Xiong, and C. Yuan. Differentially private data release through multidimensional partitioning. In Secure Data Management, pages 150–168. Springer, 2010. [200] H. Xiong, D. Zhang, G. Chen, L. Wang, and V. Gauthier. CrowdTasker: Maximizing cov- erage quality in Piggyback Crowdsensing under budget constraint. In 2015 IEEE Int. Conf. Pervasive Comput. Commun. PerCom 2015, pages 55–62, 2015. [201] M. Xue, P. Kalnis, and H. K. Pung. Location diversity: Enhanced privacy protection in location based services. In Location and Context Awareness, pages 70–87. Springer, 2009. [202] T. Yan, V. Kumar, and D. Ganesan. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In Proceedings of the 8th international conference on Mobile systems, applications, and services, pages 77–90. ACM, 2010. [203] D. Yang, G. Xue, X. Fang, and J. Tang. Crowdsourcing to Smartphones : Incentive Mecha- nismDesignforMobilePhoneSensing. 18th Annu. Int. Conf. Mob. Comput. Netw. (MobiCom 2012), pages 173–184, 2012. [204] B. Yao, F. Li, and X. Xiao. Secure nearest neighbor revisited. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pages 733–744. IEEE, 2013. [205] M. L. Yiu, C. S. Jensen, X. Huang, and H. Lu. Spacetwist: Managing the trade-offs among location privacy, query performance, and query accuracy in mobile services. In Data Engi- neering, 2008. ICDE 2008. IEEE 24th International Conference on, pages 366–375. IEEE, 2008. [206] M. L. Yiu, K. Mouratidis, N. Mamoulis, et al. Capacity constrained assignment in spatial databases. In Proceedings of the 2008 ACM SIGMOD international conference on Manage- ment of data, pages 15–28. ACM, 2008. [207] H. Yu, C. Miao, Z. Shen, and C. Leung. Quality and budget aware task allocation for spatial crowdsourcing. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1689–1690. International Foundation for Autonomous Agents and Multiagent Systems, 2015. [208] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-Drive: driving directions based on taxi trajectories. In Proceedings of the 18th SIGSPATIAL International conference on advances in geographic information systems, pages 99–108. ACM, 2010. [209] B. Zhang, C. H. Liu, J. Lu, Z. Song, Z. Ren, J. Ma, and W. Wang. Privacy-preserving QoI- aware participant coordination for mobile crowdsourcing. In Computer Networks, volume 101, pages 29–41, 2016. [210] D. Zhang, H. Xiong, L. Wang, and G. Chen. CrowdRecruiter: selecting participants for piggyback crowdsensing under probabilistic coverage constraint. Proc. 2014 ACM Int. Jt. Conf. Pervasive Ubiquitous Comput. - UbiComp ’14 Adjun., pages 703–714, 2014. [211] H. Zhang, Z. Xu, X. Du, Z. Zhou, and J. Shi. CAPR: Context-aware participant recruitment mechanism in mobile crowdsourcing, oct 2016. 111 [212] L. Zhang, X. Lu, P. Xiong, and T. Zhu. A Differentially Private Method for Reward-Based Spatial Crowdsourcing. In International Conference on Applications and Techniques in Infor- mation Security, pages 153–164. Springer Berlin Heidelberg, 2015. [213] L. Zhang, X. Lu, P. Xiong, and T. Zhu. A differentially private method for reward-based spatial crowdsourcing. In Applications and Techniques in Information Security, pages 153– 164. Springer, 2015. [214] X. Zhang, Z. Yang, Y. Gong, Y. Liu, and S. Tang. SpatialRecruiter: Maximizing Sensing Coverage in Selecting Workers for Spatial Crowdsourcing. IEEE Trans. Veh. Technol., pages 1–1, 2016. [215] X. Zhang, Z. Yang, Y. Liu, J. Li, and Z. Ming. Towards Efficient Mechanisms for Mobile Crowdsensing. IEEE Trans. Veh. Technol., 9545(c):1–1, 2016. [216] X. Zhang, Z. Yang, Y. Liu, and S. Tang. On Reliable Task Assignment for Spatial Crowd- sourcing. IEEE Trans. Emerg. Top. Comput., pages 1–1, 2016. [217] Y. Zhao, Y. Li, Y. Wang, H. Su, and K. Zheng. Destination-aware task assignment in spatial crowdsourcing. InProceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, pages 297–306, 2017. [218] Y. Zhao, C. C. Liao, T. Y. Lin, J. Yin, N. Do, C. H. Hsu, and N. Venkatasubramanian. SmartSource: A Mobile Q&A Middleware Powered by Crowdsourcing. In Proc. - IEEE Int. Conf. Mob. Data Manag., volume 1, pages 145–156, 2015. [219] Z. Zhao, W. Ng, and Z. Zhang. Crowdseed: query processing on microblogs. In Proceedings of the 16th International Conference on Extending Database Technology, pages 729–732. ACM, 2013. [220] B. Zhu, S. Zhu, X. Liu, Y. Zhong, and H. Wu. A novel location privacy preserving scheme for spatial crowdsourcing. In 2016 6th Int. Conf. Electron. Inf. Emerg. Commun., pages 34–37. IEEE, jun 2016. 112
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Enabling query answering in a trustworthy privacy-aware spatial crowdsourcing
PDF
Dynamic pricing and task assignment in real-time spatial crowdsourcing platforms
PDF
GeoCrowd: a spatial crowdsourcing system implementation
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Privacy-aware geo-marketplaces
PDF
Mechanisms for co-location privacy
PDF
Location-based spatial queries in mobile environments
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Differentially private learned models for location services
PDF
Partitioning, indexing and querying spatial data on cloud
PDF
Scalable processing of spatial queries
PDF
Query processing in time-dependent spatial networks
PDF
MOVNet: a framework to process location-based queries on moving objects in road networks
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Responsible AI in spatio-temporal data processing
PDF
Realistic and controllable trajectory generation
PDF
From raw sensor data to moving object trajectories at right resolution, quality, and abstraction
PDF
Scalable evacuation routing in dynamic environments
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
Asset Metadata
Creator
To, Hien
(author)
Core Title
Location privacy in spatial crowdsourcing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
01/31/2018
Defense Date
10/24/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data privacy,differential privacy,geospatial data management,OAI-PMH Harvest,privacy,spatial crowdsourcing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Georgiou, Panayiotis (
committee member
), Nakano, Aiichiro (
committee member
), Xiong, Li (
committee member
)
Creator Email
hto@usc.edu,ubriela@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-465447
Unique identifier
UC11268095
Identifier
etd-ToHien-5979.pdf (filename),usctheses-c40-465447 (legacy record id)
Legacy Identifier
etd-ToHien-5979.pdf
Dmrecord
465447
Document Type
Dissertation
Rights
To, Hien
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
data privacy
differential privacy
geospatial data management
spatial crowdsourcing