Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 968 (2016)
(USC DC Other)
USC Computer Science Technical Reports, no. 968 (2016)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DifferentiallyPrivatePublicationofLocationEntropy (TechnicalReport) Hien To hto@usc.edu Kien Nguyen kien.nguyen@usc.edu Cyrus Shahabi shahabi@usc.edu Department of Computer Science, University of Southern California Los Angeles, CA 90007 ABSTRACT Location entropy (LE) is an eminent metric for measuring the popularity of various locations (e.g., points-of-interest). It is used in numerous applications in geo-marketing, crime analysis, epidemiology, trac incident analysis, spatial crowd- sourcing, and geosocial networks. Unlike other metrics com- puted from only the number of (unique) visits to a location, namely frequency, LE also captures the diversity of the users' visits, and is thus more accurate than other metrics. Cur- rent solutions for computing LE require full access to the past visits of users to locations, which poses privacy threats. This paper discusses, for the rst time, the problem of per- turbing location entropy for a set of locations according to dierential privacy. The problem is challenging, inasmuch as removing a single user from the dataset will impact mul- tiple records of the database; i.e., all the visits made by that user to various locations. Towards this end, we rst derive non-trivial, tight bounds for both local and global sensitivity of LE, and show that to satisfy-dierential privacy, a large amount of noise must be introduced, rendering the published results useless. Hence, we propose a thresholding technique to limit the number of users' visits, which signicantly re- duces the perturbation error but introduces an approxima- tion error. To achieve better utility, we extend the technique by adopting two weaker notions of privacy: smooth sensitiv- ity (slightly weaker) and crowd-blending (strictly weaker). Extensive experiments on synthetic and real-world datasets show that our proposed techniques preserve original data distribution without compromising location privacy. CategoriesandSubjectDescriptors H.2.4 [DatabaseManagement]: Database Applications| Spatial databases and GIS; H.1.1 [Models and Princi- ples]: Systems and Information Theory|Information the- ory; F.2.2 [Theory of Computation]: Analysis of Algo- rithms and problem complexity|Nonnumerical Algorithms and Problem Keywords Dierential Privacy, Location Entropy, Algorithms, Theory 1. INTRODUCTION Due to the pervasiveness of GPS-enabled mobile devices and the popularity of location-based services such as map- ping and navigation apps (e.g., Google Maps, Waze), or spa- tial crowdsourcing apps (e.g., Uber, TaskRabbit, Gigwalk), These authors contributed equally to this work. or apps with geo-tagging (e.g., Twitter, Picasa, Instagram, Flickr), or check-in functionality (e.g., Foursquare, Face- book), numerous industries are now collecting ne-grained location data from their users. While the collected loca- tion data can be used for many commercial purposes by these industries (e.g., geo-marketing), other companies and non-prot organizations (e.g., academia, CDC) can also be empowered if they can use the location data for the greater good (e.g., research, preventing the spread of disease). Un- fortunately, despite the usefulness of the data, industries do not publish their location data due to the sensitivity of their users' location information. However, many of these organizations do not need access to the raw location data but aggregate or processed location data would satisfy their need. One example of using location data is to measure the pop- ularity of a location that can be used in many application domains such as public health, criminology, urban planning, policy, and social studies. One accepted metric to measure the popularity of a location is location entropy (or LE for short). LE captures both the frequency of visits (how many times each user visited a location) as well as the diversity of visits (how many unique users visited a location) with- out looking at the functionality of that location; e.g., is it a private home or a coee shop? Hence, LE has shown that it is able to better quantify the popularity of a location as compared to the number of unique visits or the number of check-ins to the location [9]. For example, [9] shows that LE is more successful in accurately predicting friendship from location trails over simpler models based only on the number of visits. LE is also used to improve online task assignment in spatial crowdsourcing [24, 43] by giving priority to work- ers situated in less popular locations because there may be no available worker visiting those locations in the future. Obviously, LE can be computed from raw location data collected by various industries; however, the raw data cannot be published due to serious location privacy implications [18, 11, 42]. Without privacy protection, a malicious adver- sary can stage a broad spectrum of attacks such as physical surveillance and stalking, and breach of sensitive informa- tion such as an individual's health issues (e.g., presence in a cancer treatment center), alternative lifestyles, political and religious preferences (e.g., presence in a church). Hence, in this paper we propose an approach based on dierential pri- vacy (DP) [12] to publish LE for a set of locations without compromising users' raw location data. DP has emerged as the de facto standard with strong protection guarantees for publishing aggregate data. It has been adapted by major in- dustries for various tasks without compromising individual privacy, e.g., data analytics with Microsoft [15], discovering users' usage patterns with Apple 1 , or crowdsourcing statis- tics from end-user client software [15] and training of deep neural networks [1] with Google. DP ensures that an adver- sary is not able to reliably learn from the published sanitized data whether or not a particular individual is present in the original data, regardless of the adversary's prior knowledge. It is sucient to achieve-DP ( is privacy loss) by adding Laplace noise with mean zero and scale proportional to the sensitivity of the query (LE in this study) [12]. The sen- sitivity of LE is intuitively the maximum amount that one individual can impact the value of LE. The higher the sen- sitivity, the more noise must be injected to guarantee -DP. Even though DP has been used before to compute Shan- non Entropy [4] (the formulation adapted in LE), the main challenge in dierentially private publication of LE is that adding (or dropping) a single user from the dataset would impact multiple entries of the database, resulting in a high sensitivity of LE. To illustrate, consider a user that has con- tributed many visits to a single location; thus, adding or removing this user would signicantly change the value of LE for that location. Alternatively, a user may contribute visits to multiple locations and hence impact the entropy of all those visited locations. Another unique challenge in pub- lishing LE (vs. simply computing the Shannon Entropy) is due to the presence of skewness and sparseness in real-world location datasets where the majority of locations have small numbers of visits. Towards this end, we rst compute a non-trivial tight bound for the global sensitivity of LE. Given the bound, a sucient amount of noise is introduced to guarantee - DP. However, the injected noise linearly increases with the maximum number of locations visited by a user (denoted by M) and monotonically increases with the maximum num- ber of visits a user contributes to a location (denoted byC), and such an excessive amount of noise renders the published results useless. We refer to this algorithm asBaseline. Ac- cordingly, we propose a technique, termed Limit, to limit user activity by thresholding M and C, which signicantly reduces the perturbation error. Nevertheless, limiting an in- dividual's activity entails an approximation error in calculat- ing LE. These two con icting factors require the derivation of appropriate values for M and C to obtain satisfactory results. We empirically nd such optimal values. Furthermore, to achieve a better utility, we extend Limit by adopting two weaker notions of privacy: smooth sensi- tivity [31] and crowd-blending [17] (strictly weaker). We denote the techniques as Limit-SS and Limit-CB, respec- tively. Limit-SS provides a slightly weaker privacy guaran- tee, i.e., (;)-dierential privacy by using local sensitivity with much smaller noise magnitude. We propose an ecient algorithm to compute the local sensitivity of a particular lo- cation that depends on C and the number of users visiting the location (represented by n) such that the local sensitiv- ity of all locations can be precomputed, regardless of the dataset. Thus far, we publish entropy for all locations; how- ever, the ratio of noise to the true value of LE (noise-to-true- entropy ratio) is often excessively high when the number of users visiting a location n is small (i.e., the entropy of a location is bounded by log(n)). For example, given a loca- tion visited by only two users with an equal number of visits 1 https://www.wired.com/2016/06/apples-dierential- privacy-collecting-data/ (LE is log 2), removing one user from the database drops the entropy of the location to zero. To further reduce the noise- to-true-entropy ratio,Limit-CB aims to publish the entropy of locations with at least k users (n k) and suppress the other locations. By thresholding n, the global sensitivity of LE signicantly drops, implyling much less noise. We prove that Limit-CB satises (k;)-crowd-blending privacy. We conduct an extensive set of experiments on both syn- thetic and real-world datasets. We rst show that the trun- cation technique (Limit) reduces the global sensitivity of LE by two orders of magnitude, thus greatly enhancing the util- ity of the perturbed results. We also demonstrate thatLimit preserves the original data distribution after adding noise. Thereafter, we show the superiority of Limit-SS andLimit- CB overLimit in terms of achieving higher utility (measured by KL-divergence and mean squared error metrics). Par- ticularly, Limit-CB performs best on sparse datasets while Limit-SS is recommended overLimit-CB on dense datasets. We also provide insights on the eects of various parameters: , C, M and k on the eectiveness and utility of our pro- posed algorithms. Based on the insights, we provide a set of guidelines for choosing appropriate algorithms and parame- ters. The remainder of this paper is organized as follows. In Section 2, we dene the problem of publishing LE according to dierential privacy. Section 3 presents the preliminaries. Section 4 introduces the baseline solution and our threshold- ing technique. Section 5 presents our utility enhancements by adopting weaker notions of privacy. Experimental results are presented in Section 6, followed by a survey of related work in Section 7, and conclusions in Section 8. 2. PROBLEMDEFINITION In this section we present the notations and the formal denition of the problem. Each locationl is represented by a point in two-dimensional space and a unique identier l (180 l lat 180) and (90l lon 90) 2 . Hereafter, l refers to both the location and its unique identier. For a given location l, let O l be the set of visits to that location. Thus, c l =jO l j is the to- tal number of visits to l. Also, let U l be the set of distinct users that visited l, and O l;u be the set of visits that user u has made to the location l. Thus, c l;u =jO l;u j denotes the number of visits of useru to locationl. The probability that a random draw fromO l belongs toO l;u isp l;u = jc l;u j jc l j , which is the fraction of total visits to l that belongs to user u. The location entropy for l is computed from Shannon entropy [37] as follows: H(l) =H(p l;u 1 ;p l;u 2 ;:::;p l;u jU l j ) = X u2U l p l;u logp l;u (1) In our study the natural logarithm is used. A location has a higher entropy when the visits are distributed more evenly among visiting users, and vice versa. Our goal is to publish location entropy of all locations L =fl1;l2;:::;l jLj g, where each location is visited by a set of usersU =fu1;u2;:::;u jUj g, while preserving the location privacy of users. Table 1 sum- marizes the notations used in this paper. 3. PRELIMINARIES 2 l lat ;l lon are real numbers with ten digits after the decimal point. l;L; jLj a location, the set of all locations and its cardinality H(l) location entropy of location l ^ H(l) noisy location entropy of location l H l sensitivity of location entropy for location l H sensitivity of location entropy for all locations O l the set of visits to location l u;U; jUj a user, the set of all users and its cardinality U l the set of distinct users who visits l O l;u the set of visits that user u has made to location l c l the total number of visits to l c l;u the number of visits that user u has made to location l C maximum number of visits of a user to a location M maximum number of locations visited by a user p l;u the fraction of total visits to l that belongs to user u Table 1: Summary of notations. We now present properties of the Shannon entropy and the dierential privacy notion that will be used throughout the paper. 3.1 ShannonEntropy Shannon [37] introduces entropy as a measure of the un- certainty in a random variable with a probability distribu- tion U = (p1;p2;:::;p jUj ): H(U) = X i pi logpi (2) where P i pi = 1. H(U) is maximal if all the outcomes are equally likely: H(U)H( 1 jUj ;:::; 1 jUj ) = logjUj (3) AdditivityPropertyofEntropy: LetU1 andU2 be non- overlapping partitions of a database U including users who contribute visits to a location l, and 1 and 2 are proba- bilities that a particular visit belongs to partition U1 and U2, respectively. Shannon discovered that using logarithmic function preserves the additivity property of entropy: H(U) =1H(U1) +2H(U2) +H(1;2) Subsequently, adding a new person u into U changes its entropy to: H(U + ) = c l c l +c l;u H(U) +H c l;u c l +c l;u ; c l c l +c l;u (4) where U + =U[u and c l is the total number of visits to l, and c l;u is the number of visits to l that is contributed by useru. Equation (4) can be derived from Equation (4) if we consider U + includes two non-overlapping partitions u and U with associated probabilities c l;u c l +c l;u and c l c l +c l;u . We note that the entropy of a single user is zero, i.e., H(u) = 0. Similarly, removing a personu fromU changes its entropy as follows: H(U ) = c l c l c l;u H(U)H c l;u c l ; c l c l;u c l (5) where U =Unfug. 3.2 DifferentialPrivacy Dierential privacy (DP) [12] has emerged as the de facto standard in data privacy, thanks to its strong protection guarantees rooted in statistical analysis. DP is a semantic model which provides protection against realistic adversaries with background information. Releasing data according to DP ensures that an adversary's chance of inferring any in- formation about an individual from the sanitized data will not substantially increase, regardless of the adversary's prior knowledge. DP ensures that the adversary does not know whether an individual is present or not in the original data. DP is formally dened as follows. Definition 1. -indistinguishability [13] Consider that a database produces a set of query results ^ D on the set of queries Q =fq1;q2;:::;q jQj g, and let > 0 be an arbitrar- ily small real constant. Then, transcript U produced by a randomized algorithm A satises -indistinguishability if for every pair of sibling datasets D1, D2 that dier in only one record, it holds that ln Pr[Q(D1) =U] Pr[Q(D2) =U] In other words, an attacker cannot reliably learn whether the transcript was obtained by answering the query set Q on dataset D1 or D2. Parameter is called privacy bud- get, and species the amount of protection required, with smaller values corresponding to stricter privacy protection. To achieve -indistinguishability, DP injects noise into each query result, and the amount of noise required is propor- tional to the sensitivity of the query setQ, formally dened as: Definition 2 (L1-Sensitivity). [13] Given any arbi- trary sibling datasetsD1 andD2, the sensitivity of query set Q is the maximum change in the query results of D1 and D2. (Q) = max D 1 ;D 2 jjQ(D1)Q(D2)jj1 An essential result from [13] shows that a sucient condition to achieve DP with parameter is to add to each query result randomly distributed Laplace noise with mean 0 and scale =(Q)=. 4. PRIVATEPUBLICATIONOFLE In this section we present a baseline algorithm based on a global sensitivity of LE [13] and then introduce a thresh- olding technique to reduce the global sensitivity by limiting an individual's activity. 4.1 GlobalSensitivityofLE To achieve -dierential privacy, we must add noise pro- portional to the global sensitivity (or sensitivity for short) of LE. Thus, to minimize the amount of injected noise, we rst propose a tight bound for the sensitivity of LE, denoted by H. H represents the maximum change of LE across all locations when the data of one user is added (or removed) from the dataset. With the following theorem, the sensitiv- ity bound is a function of the maximum number of visits a user contributes to a location, denoted by C (C 1). Theorem 1. Global sensitivity of location entropy is H = maxflog 2; logC log(logC) 1g Proof. We prove this theorem by rst deriving a tight bound for the sensitivity of a particular location l (visited by n users), denoted by H l (Theorem 2). The bound is a function ofC andn. Thereafter, we generalize the bound to hold for all locations as follows. We take the derivative of the bound derived for H l with respect to variable n and nd the extremal point where the bound is maximized. The detailed proof can be found in Appendix A.1. Theorem 2. Local sensitivity of a particular location l with n users is: log 2 when n = 1 log n+1 n when C = 1 maxflog n1 n1+C + C n1+C logC; log n n+C + C n+C logC; log(1+ 1 exp(H(Cncu )) )g whereC is the maximum number of visits a user contributes to a location (C 1) and H(Cncu) = log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. Proof. We prove the theorem considering both cases| when a user is added (or removed) from the database. We rst derive a proof for the adding case by using the additivity property of entropy from Equation 4. Similarly, the proof for the removing case can be derived from Equation 5. The detailed proofs can be found in Appendix A.2. Baseline Algorithm: In this section we present a base- line algorithm that publishes location entropy for all loca- tions (see Algorithm 1). Since adding (or removing) a single user from the dataset would impact the entropy of all loca- tions he visited, the change of adding (or removing) a user to all locations is bounded by MmaxH, where Mmax is the maximum number of locations visited by a user. Thus, Line 6 adds randomly distributed Laplace noise with mean zero and scale = Mmax H to the actual value of location entropy H(l). It has been proved [13] that this is sucient to achieve dierential privacy with such simple mechanism. Algorithm 1 Baseline Algorithm 1: Input: privacy budget, a set of locationsL =fl 1 ;l 2 ;:::;l jLj g, maximum number of visits of a user to a locationCmax, max- imum number of locations a user visits Mmax. 2: Compute sensitivity H from Theorem 1 for C =Cmax. 3: For each location l in L 4: Count #visits each user made tol: c l;u and computep l;u 5: Compute H(l) = P u2U l p l;u logp l;u 6: Publish noisy LE: ^ H(l) =H(l) +Lap( Mmax H ) 4.2 ReducingtheGlobalSensitivityofLE 4.2.1 LimitAlgorithm Limitation of the Baseline Algorithm: Algorithm 1 provides privacy; however, the added noise is excessively high, rendering the results useless. To illustrate, Figure 1 shows the bounds of the global sensitivity (Theorem 1) when C varies. The gure shows that the bound monotonically in- creases when C grows. Therefore, the noise introduced by Algorithm 1 increases as C and M increase. In practice, C and M can be large because a user may have visited either many locations or a single location many times, resulting in large sensitivity. Furthermore, Figure 2 depicts dierent values of noise magnitude (in log scale) used in our various algorithms by varying the number of users visiting a loca- tion, n. The graph shows that the noise magnitude of the baseline is too high to be useful (see Table 2). 0 10 20 30 40 0.7 0.8 0.9 1 1.1 1.2 1.3 C Global Sensitivity Figure 1: Global sensitiv- ity bound of location entropy when varying C. 0 20 40 60 80 100 0 1 2 3 4 5 6 Noise magnitude Number of users (n) Baseline Limit Limit−SS Limit−CB Figure 2: Noise magnitude in natural log scale ( = 5, Cmax=1000, Mmax=100, C=20, M=5, =10 8 , k=25). Improving Baseline by Limiting User Activity: To reduce the global sensitivity of LE, and inspired by [27], we propose a thresholding technique, named Limit, to limit an individual's activity by truncatingC andM. Our technique is based on the following two observations. First, Figure 3b shows the maximum number of visits a user contributes to a location in the Gowalla dataset that will be used in Sec- tion 6 for evaluation. Although most users have one and only one visit, the sensitivity of LE is determined by the worst-case scenario|the maximum number of visits 3 . Sec- ond, Figure 3a shows the number of locations visited by a user. The gure conrms that there are many users who contribute to more than ten locations. (a) A user may visit many loca- tions (b) The largest number of visits a user contributes to a location Figure 3: Gowalla, New York. Since the introduced noise linearly increases with M and monotonically increases with C, the noise can be reduced by capping them. First, to truncate M, we keep the rst M location visits of the users who visit more than M loca- tions and throw away the rest of the locations' visits. As a result, adding or removing a single user in the dataset af- fects at most M locations. Second, we set the number of visits of the users who have contributed more than C visits to a particular location of C. Figure 2 shows that the noise magnitude used in Limit drops by two orders of magnitude when compared with the baseline's sensitivity. At a high-level, Limit (Algorithm 2) works as follows. Line 3 limits user activity across locations, while Line 7 lim- its user activity to a location. The impact of Line 3 is the introduction of approximation error on the published data. This is because the number of users visiting some locations may be reduced, which alters their actual LE values. Conse- quently, some locations may be thrown away without being published. Furthermore, Line 7 also alters the value of lo- 3 This suggests that users tend not to check-in at places that they visit the most, e.g., their homes, because if they did, the peak of the graph would not be at 1. cation entropy, but by trimming the number of visits of a user to a location. The actual LE value of location l (after thresholdingM andC) is computed in Line 8. Consequently, the noisy LE is published in Line 9, where Lap( MH ) de- notes a random variable drawn independently from Laplace distribution with mean zero and scale parameter MH . Algorithm 2 Limit Algorithm 1: Input: privacy budget, a set of locationsL =fl 1 ;l 2 ;:::;l jLj g, maximum threshold on the number of visits of a user to a locationC, maximum threshold on the number of locations a user visits M 2: For each user u in U 3: TruncateM: keep the rstM locations' visits of the users who visit more than M locations 4: Compute sensitivity H from Theorem 1. 5: For each location l in L 6: Count #visits each user made tol: c l;u and computep l;u 7: Threshold C: c l;u = min(C;c l;u ), then compute p l;u 8: Compute H(l) = P u2U l p l;u log p l;u 9: Publish noisy LE: ^ H(l) = H(l) +Lap( MH ) The performance of Algorithm 2 depends on how we setC andM. There is a trade-o on the choice of values forC and M. Small values of C and M introduce small perturbation error but large approximation error and vice versa. Hence, in Section 6, we empirically nd the values ofM andC that strike a balance between noise and approximation errors. 4.2.2 PrivacyGuaranteeoftheLimitAlgorithm The following theorem shows that Algorithm 2 is dier- entially private. Theorem 3. Algorithm 2 satises -dierential privacy. Proof. For all locations, let L1 be any subset of L. Let T =ft1;t2;:::;t jL 1 j g2Range(A) denote an arbitrary pos- sible output. Then we need to prove the following: Pr[A(O 1(org) ;:::;O jL 1 j(org) ) =T ] Pr[A(O 1(org) nO l;u(org) ;:::;O jL 1 j(org) nO l;u(org) ) =T ] exp() The details of the proof and notations used can be found in Appendix A.3. 5. RELAXATIONOFPRIVATELE This section presents our utility enhancements by adopt- ing two weaker notions of privacy: smooth sensitivity [31] (slightly weaker) and crowd-blending [17] (strictly weaker). 5.1 RelaxationwithSmoothSensitivity We aim to extend Limit to publish location entropy with smooth sensitivity (or SS for short). We rst present the notions of smooth sensitivity and the Limit-SS algorithm. We then show how to precompute the SS of location entropy. 5.1.1 Limit-SSAlgorithm Smooth sensitivity is a technique that allows one to com- pute noise magnitude|not only by the function one wants to release (i.e., location entropy), but also by the database itself. The idea is to use the local sensitivity bound of each location rather than the global sensitivity bound, resulting in small injected noise. However, simply adopting the lo- cal sensitivity to calibrate noise may leak the information about the number of users visiting that location. Smooth sensitivity is stated as follows. Letx;y2D N denote two databases, whereN is the num- ber of users. Let l x ;l y denote the location l in database x and y, respectively. Let d(l x ;l y ) be the Hamming distance between l x and l y , which is the number of users at location l on which x and y dier; i.e., d(l x ;l y ) =jfi : l x i 6= l y i gj; l x i represents information contributed by one individual. The local sensitivity of location l x , denoted by LS(l x ), is the maximum change of location entropy when a user is added or removed. Definition 3. Smooth sensitivity [31] For > 0,-smooth sensitivity of location entropy is: SS (l x ) = max l y 2D N LS(l y )e d(l x ;l y ) = max k=0;1;:::;N e k max y:d(l x ;l y )=k LS(l y ) Smooth sensitivity of LE of locationl x can be interpreted as the maximum ofLS(l x ) andLS(l y ) where the eect ofy at distancek fromx is dropped by a factor ofe k . Thereafter, the smooth sensitivity of LE can be plugged into Line 3 of Algorithm 2, producing the Limit-SS algorithm. Algorithm 3 Limit-SS Algorithm 1: Input: privacy budget , privacy parameter , L = fl 1 ;l 2 ;:::;l jLj g, C;M 2: Copy Lines 2-8 from Algorithm 2 3: Publish noisy LE ^ H(l) = H(l) + M2SS (l) , where Lap(1), where = 2 ln( 2 ) 5.1.2 PrivacyGuaranteeof Limit-SS The noise of Limit-SS is specic to a particular location as opposed to those of the Baseline and Limit algorithms. Limit-SS has a slightly weaker privacy guarantee. It satises (;)-dierential privacy, where is a privacy parameter, = 0 in the case of Denition 1. The choice of is generally left to the data releaser. Typically, < 1 number of users (see [31] for details). Theorem 4. Calibrating noise to smooth sensitivity [31] If 2 ln( 2 ) and 2 (0; 1), the algorithm l 7! H(l) + 2SS (l) , where Lap(1), is (;)-dierentially private. Theorem 5. Limit-SS is (;)-dierentially private. Proof. Using Theorem 4,A l satises (0)-dierential pri- vacy when l = 2 L1\L(u), and satises ( M ; M )-dierential privacy when l2L1\L(u). 5.1.3 PrecomputationofSmoothSensitivity This section shows that the smooth sensitivity of a lo- cation visited by n users can be eectively precomputed. Figure 2 illustrates the precomputed local sensitivity for a xed value of C. Let LS(C;n);SS(C;n) be the local sensitivity and the smooth sensitivity of all locations that visited byn users, re- spectively. LS(C;n) is dened in Theorem 2. LetGS(C) be the global sensitivity of the location entropy givenC, which is dened in Theorem 1. Algorithm 4 computes SS(C;n). At a high level, the algorithm computes the eect of all locations at every possible distance k from n, which is non- trivial. Thus, to speed up computations, we propose several stopping conditions based on the following observations. Let n x ;n y be the number of users visited l x ;l y , respec- tively. If n x > n y , Algorithm 4 stops when e k GS(C) is less than the current value of smooth sensitivity (Line 7). If n x <n y , given the fact thatLS(l y ) starts to decrease when n y > C logC1 + 1, ande k also decreases whenk increases, Algorithm 4 stops when n y > C logC1 + 1 (Line 8). In ad- dition, the algorithm tolerates a minimum value of smooth sensitivity . Thus, when n > n0 (LS(C;n0) < ), the precomputation ofSS(C;n) is stopped andSS(C;n) is con- sidered as for all n>n0 (Line 8). Algorithm 4 Precompute Smooth Sensitivity 1: Input: privacy parameters: ;;; C, maximum number of possible users N 2: Set = 2 ln( 2 ) 3: For n = [1;:::;N] 4: SS(C;n) = 0 5: For k = [1;:::;N]: 6: SS(C;n) = max SS(C;n);e k max(LS(C;n k);LS(C;n +k)) 7: Stop when e k GS(C;nk)<SS(C;n) and n +k > C logC1 + 1 8: Stop when n> C logC1 + 1 and LS(C;n)< 5.2 RelaxationwithCrowd-BlendingPrivacy 5.2.1 Limit-CBAlgorithm Thus far, we publish entropy for all locations; however, the ratio of noise to the true value of LE (noise-to-true- entropy ratio) is often excessively high when the number of users visiting a location n is small (i.e., Equation 3 shows that entropy of a location is bounded by log(n)). The large noise-to-true-entropy ratio would render the published re- sults useless since the introduced noise outweighs the actual value of LE. This is an inherent issue with the sparsity of the real-world datasets. For example, Figure 4 summarizes the number of users contributing visits to each location in the Gowalla dataset. The gure shows that most locations have check-ins from fewer than ten users. These locations have LE values of less than log(10), which are particularly prone to the noise-adding mechanism in dierential privacy. Figure 4: Sparsity of location visits (Gowalla, New York). 0 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Number of users (n) Global sensitivity C = 10 C = 20 Figure 5: Global sensitivity bound when varying n. Therefore, to reduce the noise-to-true-entropy ratio, we propose a small sensitivity bound of location entropy that depends on the minimum number of users visiting a location, denoted by k. Subsequently, we present Algorithm 5 that satises (k;)-crowd-blending privacy [17]. We prove this in Section 5.2.2. The algorithm aims to publish entropy of locations with at leastk users (nk) and throw away the other locations. We refer to the algorithm asLimit-CB. Lines 3-6 publish the entropy of each location according to (k;)-crowd-blending privacy. That is, we publish the entropy of the locations with at leastk users and suppress the others. The following lemma shows that for the locations with at least k users we have a tighter bound on H, which depends on C and k. Figure 2 shows that the sensitivity used in Limit-CB is signicantly smaller than Limit's sensitivity. Theorem 6. Global sensitivity of location entropy for lo- cations with at least k users, k C logC1 +1, whereC is the maximum number of visits a user contributes to a location, is the local sensitivity at n =k. Proof. We prove the theorem by showing that local sen- sitivity decreases when the number of usersn C logC1 + 1. Thus, whenn C logC1 + 1, the global sensitivity equals to the local sensitivity at the smallest value of n, i.e, n = k. The detailed proof can be found in Appendix A.2. Algorithm 5 Limit-CB Algorithm 1: Input: all users U, privacy budget ; C;M;k 2: Compute global sensitivity H based on Theorem 6. 3: For each location l2L 4: Count number of users who visit l, n l 5: If n l k, publish ^ H(l) according to Algorithm 2 with budget using a tighter bound on H 6: Otherwise, do not publish the data 5.2.2 PrivacyGuaranteeof Limit-CB Before proving the privacy guarantee of Limit-CB, we rst present the notion of crowd-blending privacy, a strict relaxation of dierential privacy [17]. k-crowd blending pri- vate sanitization of a database requires each individual in the database to blend with k other individuals in the database. This concept is related to k-anonymity [39] since both are based on the notion of \blending in a crowd." However, unlike k-anonymity that only restricts the published data, crowd-blending privacy imposes restrictions on the noise- adding mechanism. Crowd-blending privacy is dened as follows. Definition 4 (Crowd-blending privacy). An algo- rithmA is (k;)-crowd-blending private if for every database D and every individual t2 D, either t -blends in a crowd of k people in D, or A(D) A(Dnftg) (or both). A result from [17] shows that dierential privacy implies crowd-blending privacy. Theorem 7. DP ! Crowd-blending privacy [13] Let A be any -dierentially private algorithm. Then, A is (k;)-crowd-blending private for every integer k 1. The following theorem shows that Algorithm 5 is (k;)- crowd-blending private. Theorem 8. Algorithm 5 is (k;)-crowd-blending private. Sparse Dense Gow. # of locations 10,000 10,000 14,058 # of users 100K 10M 5,800 Max LE 9.93 14.53 6.45 Min LE 1.19 6.70 0.04 Avg. LE 3.19 7.79 1.45 Variance of LE 1.01 0.98 0.6 Max #locations per user 100 100 1407 Avg. #locations per user 19.28 19.28 13.5 Max #visits to a loc. per user 20,813 24,035 162 Avg. #visits to a loc. per user 2578.0 2575.8 7.2 Avg. #users per loc. 192.9 19,278 5.6 Table 2: Statistics of the datasets. Proof. First, if there are at least k people in a location, then individual u -blends with k people in U. This is be- cause Line 5 of the algorithm satises -dierential privacy, which infers (k;)-crowd-blending private (Theorem 7). Oth- erwise, we haveA(D)0 A(Dnftg) sinceA suppresses each location with less than k users. 6. PERFORMANCEEVALUATION We conduct several experiments on real-world and syn- thetic datasets to compare the eectiveness and utility of our proposed algorithms. Below, we rst discuss our exper- imental setup. Next, we present our experimental results. 6.1 ExperimentalSetup Datasets: We conduct experiments on one real-world (Gowalla) and two synthetic datasets (Sparse and Dense). The statistics of the datasets are shown in Table 2. Gowalla contains the check-in history of users in a location-based so- cial network. For our experiments, we use the check-in data in an area covering the city of New York. For synthetic data generation, in order to study the im- pact of the density of the dataset, we consider two cases: Sparse and Dense. Sparse contains 100,000 users while Dense has 10 million users. The Gowalla dataset is sparse as well. We add the Dense synthetic dataset to emulate the case for large industries, such as Google, who have access to large- and ne-granule user location data. To gener- ate visits, without loss of generality, the location with id x 2 [1; 2;:::; 10; 000] has a probability 1=x of being vis- ited by a user. This means that locations with smaller ids tend to have higher location entropy since more users would visit these locations. In the same fashion, the user with id y2f1; 2;:::; 100; 000g (Sparse) is selected with probability 1=y. This follows the real-world characteristic of location data where a small number of locations are very popular and then many locations have a small number of visits. In all of our experiments, we use ve values of privacy bud- get2f0:1; 0:5; 1;5; 10g. We vary the maximum number of visits a user contributes to a locationC2f1; 2;:::;5;:::; 50g and the maximum number of locations a user visits M 2 f1; 2;5; 10; 20; 30g. We also vary thresholdk2f10; 20; 30; 40, 50g. Default values are shown in boldface. Metrics: We use KL-divergence as one measure of pre- serving the original data distribution after adding noise. Given two discrete probability distributions P and Q, the KL-divergence of Q from P is dened as follows: DKL(PjjQ) = X i P (i) log P (i) Q(i) (6) In this paper the location entropy of location l is the prob- ability thatl is chosen when a location is randomly selected from the set of all locations; P and Q are respectively the published and the actual LE of locations after normalization; i.e., normalized values must sum to unity. We also use mean squared error (MSE) over a set of loca- tions L as the metric of accuracy using Equation 7. MSE = 1 jLj X l2L LEa(l)LEn(l) 2 (7) where LEa(l) and LEn(l) are the actual and noisy entropy of the location l, respectively. SinceLimit-CB throws away more locations as compared to Limit and Limit-SS, we consider both cases: 1) KL- divergence and MSE metrics are computed on all locations L, where the entropy of the suppressed locations are set to zero (default case); 2) the metrics are computed on the subset of locations thatLimit-CB publishes, termed Throw- away. 6.2 ExperimentalResults We rst evaluate our algorithms on the synthetic datasets. 6.2.1 OverallEvaluationoftheProposedAlgorithms We evaluate the performance of Limit from Section 4.2.1 and its variants (Limit-SS and Limit-CB). We do not in- clude the results for Baseline since the excessively high amount of injected noise renders the perturbed data useless. Figure 6 illustrates the distributions of noisy vs. actual LE on Dense and Sparse. The actual distributions of the dense (Figure 6a) and sparse (Figure 6e) datasets conrm our method of generating the synthetic datasets; locations with smaller ids have higher entropy, and entropy of lo- cations in Dense are higher than that in Sparse. We ob- serve that Limit-SS generally performs best in preserving the original data distribution for Dense (Figure 6c), while Limit-CB performs best for Sparse (Figure 6h). Note that as we show later, Limit-CB performs better than Limit-SS and Limit given a small budget (see Section 6.2.2). Due to the truncation technique, some locations may be discarded. Thus, we report the percentage of perturbed lo- cations, named published ratio. The published ratio is com- puted as the number of perturbed locations divided by the total number of eligible locations. A location is eligible for publication if it contains check-ins from at least K users (K 1). Figure 7 shows the eect of k on the published ratio of Limit-CB. Note that the published ratio of Limit and Limit-SS is the same as Limit-CB when k = K. The gure shows that the ratio is 100% with Dense, while that of Sparse is less than 10%. The reason is that with Dense, each location is visited by a large number of users on aver- age (see Table 2); thus, limitingM andC would reduce the average number of users visiting a location but not to the point where the locations are suppressed. This result sug- gests that our truncation technique performs well on large datasets. 6.2.2 Privacy-UtilityTrade-off(Varying) We compare the trade-o between privacy and utility by varying the privacy budget. The utility is captured by the KL-divergence metric introduced in Section 6.1. We also use the MSE metric. Figure 8 illustrates the results. As expected, when increases, less noise is injected, and values 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Actual entropy (a) Actual (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (b) Limit (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (c) Limit-SS (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (d) Limit-CB (Dense) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Actual entropy (e) Actual (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (f) Limit (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (g) Limit-SS (Sparse) 0 2000 4000 6000 8000 10000 0 2 4 6 8 10 12 Location id Noisy entropy (h) Limit-CB (Sparse) Figure 6: Comparison of the distributions of noisy vs. actual location entropy on the dense and sparse datasets. 20 25 30 35 40 45 50 0 20 40 60 80 100 Published ratio k Sparse dataset Dense dataset Figure 7: Published ratio of Limit-CB when varyingk (K = 20). of KL-divergence and MSE decrease. Interestingly though, KL-divergence and MSE saturate at = 5, where reducing privacy level (increase ) only marginally increases utility. This can be explained through a signicant approximation error in our thresholding technique that outweighs the im- pact of having smaller perturbation error. Note that the approximation errors are constant in this set of experiments since the parameters C, M and k are xed. Another observation is that the observed errors incurred are generally higher for Dense (Figures 8b vs. 8c), which is surprising, as dierentially private algorithms often perform better on dense datasets. The reason for this is because limiting M and C has a larger impact on Dense, resulting in a large perturbation error. Furthermore, we observe that the improvements of Limit-SS and Limit-CB over Limit are more signicant with small . In other words, Limit-SS and Limit-CB would have more impact with a higher level of privacy protection. Note that these enhancements come at the cost of weaker privacy protection. 6.2.3 TheEffectofVaryingM andC We rst evaluate the performance of our proposed tech- niques by varying threshold M. For brevity, we present the 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 KL−Divergence epsilon LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 2 4 6 8 10 0 1000 2000 3000 4000 5000 6000 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (b) Dense 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 3500 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (c) Sparse 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 3500 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (d) Sparse, Throwaway Figure 8: Varying results only for MSE, as similar results have been observed for KL-divergence. Figure 9 indicates the trade-o between the approximation error and the perturbation error. Our thresholding technique decreasesM to reduce the perturba- tion error, but at the cost of increasing the approximation error. As a result, at a particular value of M, the tech- nique balances the two types of errors and thus minimizes the total error. For example, in Figure 9a, Limit performs best atM = 5, while Limit-SS and Limit-CB work best at M 30. In Figure 9b, however, Limit-SS performs best at M = 10 and Limit-CB performs best at M = 20. We then evaluate the performance of our techniques by varying threshold C. Figure 10 shows the results. For 0 10 20 30 0 500 1000 1500 2000 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 10 20 30 0 200 400 600 800 1000 1200 1400 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (b) Sparse, Throwaway Figure 9: Varying M brevity, we only include KL-divergence results (MSE metric shows similar trends). The graphs show that KL-divergence increases as C grows. This observation suggests that C should be set to a small number (less than 10). By com- paring the eect of varying M and C, we conclude that M has more impact on the trade-o between the approximation error and the perturbation error. 0 10 20 30 40 50 0 0.02 0.04 0.06 0.08 0.1 KL−Divergence C LIMIT LIMIT−SS LIMIT−CB (a) Dense 0 10 20 30 40 50 0 0.05 0.1 0.15 0.2 KL−Divergence C LIMIT LIMIT−SS LIMIT−CB (b) Sparse Figure 10: Varying C 6.2.4 ResultsontheGowallaDataset In this section we evaluate the performance of our algo- rithms on the Gowalla dataset. Figure 11 shows the dis- tributions of noisy vs. actual location entropy. Note that we sort the locations based on their actual values of LE as depicted in Figure 11a. As expected, due to the sparse- ness of Gowalla (see Table 2), the published values of LE in Limit and Limit-SS are scattered while those in Limit- CB preserve the trend in the actual data but at the cost of throwing away more locations (Figure 11d). Furthermore, we conduct experiments on varying various parameters (i.e., ;C;M;k) and observe trends similar to the Sparse dataset; nevertheless, for brevity, we only show the impact of varying and M in Figure 12. Recommendations for Data Releases: We summa- rize our observations and provide guidelines for choosing ap- propriate techniques and parameters. Limit-CB generally performs best on sparse datasets because it only focuses on publishing the locations with large visits. Alternatively, if the dataset is dense, Limit-SS is recommended over Limit- CB since there are sucient locations with large visits. A dataset is dense if most locations (e.g., 90%) have at least nCB users, where nCB is the threshold for choosing Limit- CB. Particularly, given xed parametersC;;;k|nCB can be found by comparing the global sensitivity of Limit-CB and the precomputed smooth sensitivity. In Figure 2, nCB is a particular value of n where SS(C;nCB) is smaller than the global sensitivity of Limit-CB. In other words, the noise 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Actual entropy (a) Actual 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (b) Limit 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (c) Limit-SS 0 2000 4000 6000 8000 10000 12000 14000 0 2 4 6 8 10 12 14 16 18 Location id Noisy entropy (d) Limit-CB Figure 11: Comparison of the distributions of noisy vs. actual location entropy on Gowalla, M = 5. 0 2 4 6 8 10 0 500 1000 1500 2000 2500 3000 Mean Squared Error epsilon LIMIT LIMIT−SS LIMIT−CB (a) Vary 0 10 20 30 0 200 400 600 800 1000 1200 Mean Squared Error M LIMIT LIMIT−SS LIMIT−CB (b) Vary M Figure 12: Varying and M (Gowalla). magnitude required for Limit-SS is smaller than that for Limit-CB. Regarding the choice of parameters, to guarantee strong privacy protection, should be as small as possible, while the measured utility metrics are practical. Finally, the value ofC should be small ( 10), while the value ofM can be tuned to achieve maximum utility. 7. RELATEDWORK Locationprivacy has largely been studied in the context of location-based services [19, 30, 26, 18], participatory sens- ing [8, 6] and spatial crowdsourcing [42, 34]. Most studies use the model of spatialk-anonymity [39], where the location of a user is hidden among k other users [19, 30]. However, there are known attacks on k-anonymity, e.g., when all k users are at the same location. Other studies focused on space transformation to preserve location privacy [26]. Nev- ertheless, such techniques assume a centralized architecture with a trusted third party, which is a single point of attack. Consequently, a technique that makes use of cryptographic techniques such as private information retrieval is proposed that does not rely on a trusted third party to anonymize locations [18]. Recent studies on location privacy have fo- cused on leveraging dierential privacy (DP) to protect the privacy of users [2, 7, 35, 42, 40, 47, 22]. Location entropy has been extensively used in various areas of research, including multi-agent systems [45], wire- less sensor networks [46], geosocial networks [9, 5, 33, 32], personalized web search [28], image retrieval [49] and spa- tial crowdsourcing [24, 43, 41]. The study that most closely relates to ours focuses on privacy-preserving location-based services [3, 29, 48, 44] in which location entropy is used as the measure of privacy or the attacker's uncertainty [3, 48, 44]. In [48], a privacy model is proposed that discloses a lo- cation on behalf of a user only if the location has at least the same popularity (quantied by location entropy) as a public region specied by a user. In fact, we nd that locations with high entropy are more likely to be shared (checked-in) than places with low entropy [44]. However, directly using loca- tion entropy compromises the privacy of individuals. For ex- ample, an adversary certainly knows whether people visiting a location based on its entropy value, e.g., low value means a small number of people visit the location, and if they are all in a small geographical area, their privacy is compromised. Unlike the studies that use true value of LE [3, 29, 48, 44], our techniques add noise to the actual LE so that an attacker cannot reliably learn whether or not a particular individual is present in the original data. Despite the fact that DP has been used to publish various kinds of statistical data, such as counting query [13], 1D [21] and 2D [7] histograms, high- dimensional data [50], trajectory [38, 22], time series [36, 16, 25], and graph [20, 27, 23, 10], to the best of our knowledge, there is no study that uses dierential privacy for publishing location entropy. 8. CONCLUSIONS We introduced the problem of publishing the entropy of a set of locations according to dierential privacy. A baseline algorithm was proposed based on the derived tight bound for global sensitivity of the location entropy. We showed that the baseline solution requires an excessively high amount of noise to satisfy -dierential privacy, which renders the published results useless. A simple yet eective trunca- tion technique was then proposed to reduce the sensitivity bound by two orders of magnitude, and this enabled publica- tion of location entropy with reasonable utility. The utility was further enhanced by adopting smooth sensitivity and crowd-blending. We conducted extensive experiments and concluded that the proposed techniques are practical. Acknowledgement We would like to thank Prof. Aleksandra Korolova for her constructive feedback during the course of this research. This research has been funded in in part by NSF grants IIS-1320149 and CNS-1461963, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201500003B, the USC Integrated Media Systems Center, and unrestricted cash gifts from Google, Northrop Grumman and Oracle. Any opinions, ndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily re ect the views of any of the sponsors. 9. REFERENCES [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with dierential privacy. arXiv:1607.00133, 2016. [2] M. E. Andr es, N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi. Geo-indistinguishability: Dierential privacy for location-based systems. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 901{914. ACM, 2013. [3] A. R. Beresford and F. Stajano. Location privacy in pervasive computing. IEEE Pervasive computing, 2(1):46{55, 2003. [4] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: The SuLQ framework. In PODS, pages 128{138. ACM, 2005. [5] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in location-based social networks. In SIGKDD, pages 1082{1090. ACM, 2011. [6] D. Christin, A. Reinhardt, S. S. Kanhere, and M. Hollick. A survey on privacy in mobile participatory sensing applications. Journal of Systems and Software, 84(11):1928{1946, 2011. [7] G. Cormode, C. Procopiuc, D. Srivastava, E. Shen, and T. Yu. Dierentially private spatial decompositions. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), pages 20{31. IEEE, 2012. [8] C. Cornelius, A. Kapadia, D. Kotz, D. Peebles, M. Shin, and N. Triandopoulos. Anonysense: privacy-aware people-centric sensing. In Proceedings of the 6th international conference on Mobile systems, applications, and services, pages 211{224. ACM, 2008. [9] J. Cranshaw, E. Toch, J. Hong, A. Kittur, and N. Sadeh. Bridging the gap between physical location and online social networks. In UbiComp. ACM, 2010. [10] W.-Y. Day, N. Li, and M. Lyu. Publishing graph degree distribution with node dierential privacy. In Proceedings of the 2016 International Conference on Management of Data, pages 123{138. ACM, 2016. [11] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientic Reports, 2013. [12] C. Dwork. Dierential privacy. In Automata, languages and programming, pages 1{12. Springer, 2006. [13] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265{284. Springer, 2006. [14] C. Dwork and A. Roth. The algorithmic foundations of dierential privacy. Foundations and Trends r in Theoretical Computer Science, 9(3-4):211{407, 2014. [15] U. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In SIGSAC, pages 1054{1067. ACM, 2014. [16] L. Fan and L. Xiong. Real-time aggregate monitoring with dierential privacy. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 2169{2173. ACM, 2012. [17] J. Gehrke, M. Hay, E. Lui, and R. Pass. Crowd-blending privacy. In Advances in Cryptology, pages 479{496. Springer, 2012. [18] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi, and K.-L. Tan. Private queries in location based services: anonymizers are not necessary. In SIGMOD, pages 121{132. ACM, 2008. [19] M. Gruteser and D. Grunwald. Anonymous usage of location-based services through spatial and temporal cloaking. In MobiSys, pages 31{42. ACM, 2003. [20] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of private networks. In 2009 Ninth IEEE International Conference on Data Mining, pages 169{178. IEEE, 2009. [21] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of dierentially private histograms through consistency. Proceedings of the VLDB Endowment, 3(1-2):1021{1032, 2010. [22] X. He, G. Cormode, A. Machanavajjhala, C. M. Procopiuc, and D. Srivastava. Dpt: dierentially private trajectory synthesis using hierarchical reference systems. Proceedings of the VLDB Endowment, 8(11):1154{1165, 2015. [23] Z. Jorgensen, T. Yu, and G. Cormode. Publishing attributed social graphs with formal privacy guarantees. In Proceedings of the 2016 International Conference on Management of Data, pages 107{122. ACM, 2016. [24] L. Kazemi and C. Shahabi. GeoCrowd: enabling query answering with spatial crowdsourcing. In SIGSPATIAL 2012, pages 189{198. ACM, 2012. [25] G. Kellaris, S. Papadopoulos, X. Xiao, and D. Papadias. Dierentially private event sequences over innite streams. Proceedings of the VLDB Endowment, 7(12):1155{1166, 2014. [26] A. Khoshgozaran and C. Shahabi. Blind evaluation of nearest neighbor queries using space transformation to preserve location privacy. In International Symposium on Spatial and Temporal Databases, pages 239{257. Springer, 2007. [27] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas. Releasing search queries and clicks privately. In WWW, pages 171{180. ACM, 2009. [28] K. W.-T. Leung, D. L. Lee, and W.-C. Lee. Personalized web search with location preferences. In ICDE, pages 701{712. IEEE, 2010. [29] J. Meyerowitz and R. Roy Choudhury. Hiding stars with reworks: location privacy through camou age. In Proceedings of the 15th annual international conference on Mobile computing and networking, pages 345{356. ACM, 2009. [30] M. F. Mokbel, C.-Y. Chow, and W. G. Aref. The new casper: query processing for location services without compromising privacy. In VLDB, pages 763{774, 2006. [31] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, pages 75{84. ACM, 2007. [32] H. Pham and C. Shahabi. Spatial in uence - measuring followship in the real world. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 529{540, May 2016. [33] H. Pham, C. Shahabi, and Y. Liu. Inferring social strength from spatiotemporal data. ACM Trans. Database Syst., 41(1):7:1{7:47, Mar. 2016. [34] L. Pournajaf, L. Xiong, V. Sunderam, and S. Goryczka. Spatial task assignment for crowd sensing with cloaked locations. In 2014 IEEE 15th International Conference on Mobile Data Management, volume 1, pages 73{82. IEEE, 2014. [35] W. Qardaji, W. Yang, and N. Li. Understanding hierarchical methods for dierentially private histograms. Proceedings of the VLDB Endowment, 6(14):1954{1965, 2013. [36] V. Rastogi and S. Nath. Dierentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 735{746. ACM, 2010. [37] C. E. Shannon and W. Weaver. A mathematical theory of communication, 1948. [38] H. Su, K. Zheng, H. Wang, J. Huang, and X. Zhou. Calibrating trajectory data for similarity-based analysis. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 833{844. ACM, 2013. [39] L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05). [40] H. To, L. Fan, and C. Shahabi. Dierentially private h-tree. 2nd Workshop on Privacy in Geographic Information Collection and Analysis, 2015. [41] H. To, L. Fan, L. Tran, and C. Shahabi. Real-time task assignment in hyperlocal spatial crowdsourcing under budget constraints. In PerCom. IEEE, 2016. [42] H. To, G. Ghinita, and C. Shahabi. A framework for protecting worker location privacy in spatial crowdsourcing. VLDB, 7(10):919{930, 2014. [43] H. To, C. Shahabi, and L. Kazemi. A server-assigned spatial crowdsourcing framework. TSAS, 1(1):2, 2015. [44] E. Toch, J. Cranshaw, P. H. Drielsma, J. Y. Tsai, P. G. Kelley, J. Springeld, L. Cranor, J. Hong, and N. Sadeh. Empirical models of privacy in location sharing. In UbiComp, pages 129{138. ACM, 2010. [45] H. Van Dyke Parunak and S. Brueckner. Entropy and self-organization in multi-agent systems. In AAMAS, pages 124{130. ACM, 2001. [46] H. Wang, K. Yao, G. Pottie, and D. Estrin. Entropy-based sensor selection heuristic for target localization. In IPSN, pages 36{45. ACM, 2004. [47] Y. Xiao and L. Xiong. Protecting locations with dierential privacy under temporal correlations. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1298{1309. ACM, 2015. [48] T. Xu and Y. Cai. Feeling-based location privacy protection for location-based services. In CCS, pages 348{357. ACM, 2009. [49] K. Yanai, H. Kawakubo, and B. Qiu. A visual analysis of the relationship between word concepts and geographical locations. In CIVR, page 13. ACM, 2009. [50] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao. Privbayes: Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1423{1434. ACM, 2014. APPENDIX A. PROOFOFTHEOREM A.1 ProofofTheorem1 We prove the bound of global sensitivity for two cases, c = 1 andc> 1. It is obvious that whenc = 1, the maximum of the change of location entropy is log 2. When c > 1, log(1 + 1 exp(H(Cncu)) ) decreases when n increases. Thus, it is maximized when n = 1 and the maxi- mum change is log 2. We also have: log n 1 n 1 +C + C n 1 +C logC 0 n = 1 n 1 1 n 1 +C C logC (n 1 +C) 2 = (n 1 +C) 2 (n 1)(n 1 +C) (n 1)C logC (n 1)(n 1 +C) 2 = C 2 + (n 1)C (n 1)C logC (n 1)(n 1 +C) 2 log n 1 n 1 +C + C n 1 +C logC 0 n 0 ,C 2 + (n 1)C (n 1)C logC 0 ,C (n 1)(logC 1) If logC 1 0 or C is not less than the base of the logarithm (which is e in this work), where n = C logC1 + 1, log n1 n1+C + C n1+C logC is maximized and: log n 1 n 1 +C + C n 1 +C logC = log C logC 1 C logC 1 +C + C C logC 1 +C logC = log C C +C logCC + C(logC 1) logC C logC = log 1 logC + logC 1 = logC log(logC) 1 If logC1< 0, log n1 n1+C + C n1+C logC always increases. We have: lim n!1 n 1 n 1 +C + C n 1 +C logC = 0 ) n 1 n 1 +C + C n 1 +C logC < 0 Similarly, we can prove that: max n log n n +C + C n +C logC = logC log(logC) 1 A.2 ProofofTheorem2 In this section, we derive the bound for the sensitivity of location entropy of a particular location l when the data of a user is removed from the dataset. For convenience, let n be the number of users visiting l, n =jU l j, cu =jO l;u j. LetC =fc1;c2;:::;cng be the set of numbers of visits of all users to location l. Let S =jO l j = P u cu. LetSu =Scu be the sum of numbers of visits of all users to location l after removing user u. From Equation 1, we have: H(l) =H(O l ) =H(C) = X u cu S log cu S By removing a user u, entropy of location l becomes: H(Cncu) = X u cu Su log cu Su From Equation 4, we have: H(C) = Su S H(Cncu) +H( Su S ; cu S ) Subsequently, the change of location entropy when a user u is removed is: Hu =H(Cncu)H(C) = cu S H(Cncu)H( Su S ; cu S ) (8) Taking derivative of Hu w.r.t cu: (Hu) 0 cu = Scu S 2 H(Cncu) + ( Su S log Su S + cu S log cu S ) 0 cu = Su S 2 H(Cncu) Su S 2 log Su S + Su S Su S 2 Su S + Su S 2 log cu S + cu S Su S 2 cu S = Su S 2 H(Cncu) Su S 2 log Su S Su S 2 + Su S 2 log cu S + Su S 2 = Su S 2 (H(Cncu) log Su cu ) We have: (Hu) 0 cu = 0,H(Cncu) = log Su cu Therefore, Hu decreases when cu c u , and increases when c u < cu where c u is the value so that H(Cncu) = log Su c u . In addition, becauseH(Cncu)log(n1) andn1<Su when C > 1, ) Hu < 0 when cu = 1 )jHuj is maximized when cu =c u or cu =C. Case 1: cu =C: If Hu < 0,jHuj is maximized when cu =c u . If Hu > 0, we have: 0 cu S H(Cncu)H( Su S ; cu S ) C S log(n 1)H( Su S ; C S ) Taking the derivative of the right side w.r.t Su: C S log(n 1)H( Su S ; C S ) 0 Su = C Su +C log(n 1)H( Su Su +C ; C Su +C ) 0 Su = C Su +C log(n 1) + Su Su +C log Su Su +C + C Su +C log C Su +C 0 Su = C (Su +C) 2 log(n 1) + C (Su +C) 2 log Su Su +C + C (Su +C) 2 C (Su +C) 2 log C Su +C C (Su +C) 2 = C (Su +C) 2 log(n 1) + log Su C = C (Su +C) 2 log Su (n 1)C 0 ) Hu is maximized when Su is minimized. ) When cu =C and Hu > 0: Hu C n 1 +C log(n 1)H( n 1 n 1 +C ; C n 1 +C ) = C n 1 +C log(n 1) + n 1 n 1 +C log n 1 n 1 +C + C n 1 +C log C n 1 +C = log n 1 n 1 +C + C n 1 +C logC Case 2: cu =c u : Hu = cu S log Su cu H( Su S ; cu S ) = cu S log Su cu + Su S log Su S + cu S log cu S = cu S log Su S + Su S log Su S = log Su S = log S Su = log Su +cu Su = log(1 + cu Su ) = log(1 + 1 Su cu ) = log(1 + 1 exp(H(Cncu)) ) ) Hu is maximized when H(Cncu) is minimized. Lemma 1. Given a set ofn numbers,C =fc1;c2;:::;cng, 1 ci C, entropy H(C) is minimized when ci = 1 or ci =C, for all i = 1;:::;n. Proof. When the value of a number cu is changed and others a xed, from equation 8: H(C) 0 cu = (Hu) 0 cu = Su S 2 (log Su cu H(Cncu)) Therefore, H(C) increases when cu c u , and decreases when c u < cu where c u is the value so that H(Cncu) = log Su c u . ) H(C) is minimized when cu = 1 or cu =C. Lemma 2 (Minimum Entropy). Given a set ofn num- bers,C =fc1;c2;:::;cng, 1ciC, the minimum entropy H(C) = logn logC C1 + log logC C1 + 1. Proof. Using Lemma 1, entropyH(C) is minimized when C =f1;:::; 1 | {z } nk times ;C;:::;C | {z } k times g. Let S = P i ci =nk +kC. We have: H(C) = nk S log 1 S kC S log C S = nk S logS + kC S logS kC S logC = logS kC S logC We haveS 0 k =C 1. Take the derivative ofH(C) w.r.tk, we have: (H(C)) 0 k = C 1 S Sk(C 1) S 2 C logC = 1 S 2 S(C 1) (SkC +k)C logC = 1 S 2 (nk +kC)(C 1) (nk +kCkC +k)C logC = 1 S 2 (nCnkC +k +kC 2 kCnC logC) (H(C)) 0 k = 0,k = n(C logCC + 1) (C 1) 2 When k = 0 or k = n, H(C) is maximized. Thus H(C) is minimized when k = n(C logCC + 1) (C 1) 2 . It is clear that we needk to be an integer. Thus,k would be chosen asbkc orbkc + 1, depending on what value creates a smaller value of H(C). S =nk +kC =n +k(C 1) =n + n(C logCC + 1) C 1 = nCn +nC logCnC +n C 1 = nC logC C 1 kC S logC = n(C logCC + 1) (C 1) 2 C C 1 nC logC logC = C logCC + 1 C 1 H(C) = log nC logC C 1 C logCC + 1 C 1 = logn + log C logC C 1 C logC C 1 + 1 = logn logC C 1 + log logC C 1 + 1 Using Lemma 2, when cu =c u : jHuj log(1 + 1 exp(H(Cncu)) ) where H(Cncu) = log(n 1) logC C1 + log logC C1 + 1. Thus, the maximum change of location entropy when a user is removed equals : log n n1 when C = 1 max log n1 n1+C + C n1+C logC; log(1 + 1 exp(H(Cncu )) ) whereH(Cncu) = log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. Similarly, the maximum change of location entropy when a user is added equals : log n+1 n when C = 1 max log n n+C + C n+C logC; log(1+ 1 exp(H(Cncu )) ) where H(Cncu) = log(n 1) logC C1 + log logC C1 + 1, when n> 1;c> 1. Thus, we have the proof for Theorem 2. A.3 ProofofTheorem3 In this section, we prove that Theorem 3 satises-dierential privacy. We prove the theorem when a user is removed from the database. The case when a user is added is similar. Let l be an arbitrary location andA l : L! R be the Algorithm 2 when only locationl is considered for perturba- tion. O l(org) be the original set of observations at location l; O l;u(org) be the original set of observations of user u at location l; O l;u be the set of observations after limiting cu toC and limiting maximumM locations per user;O l be the set of all O l;u for all users u that visit location l;C be the set visits. LetO l nO l;u be the set of observations at location l when a useru is removed from the dataset. Letb = MH . If O l;u(org) =?: Pr[A l (O l(org) ) =t l ] Pr[A l (O l(org) nO l;u(org) ) =t l ] = 1 If O l;u(org) 6=?: Pr[A l (O l(org) ) =t l ] Pr[A l (O l(org) nO l;u(org) ) =t l ] = Pr[A l (O l ) =t l ] Pr[A l (O l nO l;u ) =t l ] = Pr[H(C) + Lap(b) =t l ] Pr[H(Cnc l;u ) + Lap(b) =t l ] If H(C)H(Cnc l;u ): Pr[H(C) + Lap(b) =t l ] Pr[H(Cnc l;u ) + Lap(b) =t l ] = Pr[H(C) + Lap(b) =t l ] Pr[H(C) + H(l) + Lap(b) =t l ] = Pr[Lap(b) =t l H(C)] Pr[Lap(b) =t l H(C) H(l)] = exp 1 b (jt l H(C)j +jt l H(C) H(l)j) exp( jH(l)j b ) exp( H b ) = exp( M ) If H(C)>H(Cnc l;u ), similarly: Pr[H(C) + Lap(b) =t l ] Pr[H(Cnc l;u ) + Lap(b) =t l ] exp( M ) Similarly, we can prove that: Pr[A l (O l(org) nO l;u(org) ) =t l ] Pr[A l (O l(org) ) =t l ] exp( M ) Therefore,A l satises 0-dierential privacy whenO l;u(org) = ?, and satises M -dierential privacy when O l;u(org) 6=?. For all locations, letA :L!R jLj be the Algorithm 2. let L1 be any subset ofL. LetT =ft1;t2;:::;t jL 1 j g2Range(A) denote an arbitrary possible output. Thus,A is the com- position of allA l , l 2 L. Let L(u) be the set of all lo- cations that u visits,jL(u)j M. Applying composition theorems [14] for allA l whereA l satises 0-dierential pri- vacy whenl = 2L1\L(u), and satises M -dierential privacy whenl2L1\L(u), we haveA satises-dierential privacy.
Abstract (if available)
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 964 (2016)
PDF
USC Computer Science Technical Reports, no. 962 (2015)
PDF
USC Computer Science Technical Reports, no. 966 (2016)
PDF
USC Computer Science Technical Reports, no. 893 (2007)
PDF
USC Computer Science Technical Reports, no. 835 (2004)
PDF
USC Computer Science Technical Reports, no. 622 (1995)
PDF
USC Computer Science Technical Reports, no. 840 (2005)
PDF
USC Computer Science Technical Reports, no. 828 (2004)
PDF
USC Computer Science Technical Reports, no. 587 (1994)
PDF
USC Computer Science Technical Reports, no. 785 (2003)
PDF
USC Computer Science Technical Reports, no. 590 (1994)
PDF
USC Computer Science Technical Reports, no. 719 (1999)
PDF
USC Computer Science Technical Reports, no. 618 (1995)
PDF
USC Computer Science Technical Reports, no. 855 (2005)
PDF
USC Computer Science Technical Reports, no. 694 (1999)
PDF
USC Computer Science Technical Reports, no. 742 (2001)
PDF
USC Computer Science Technical Reports, no. 733 (2000)
PDF
USC Computer Science Technical Reports, no. 584 (1994)
PDF
USC Computer Science Technical Reports, no. 826 (2004)
Description
Hien To, Kien Nguyen, and Cyrus Shahabi. "Differentially private publication of location entropy." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 968 (2016).
Asset Metadata
Creator
Nguyen, Kien
(author),
Shahabi, Cyrus
(author),
To, Hien
(author)
Core Title
USC Computer Science Technical Reports, no. 968 (2016)
Alternative Title
Differentially private publication of location entropy (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
14 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16270959
Identifier
16-968 Differentially Private Publication of Location Entropy (filename)
Legacy Identifier
usc-cstr-16-968
Format
14 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/