Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
Computer Science Technical Report Archive
/
USC Computer Science Technical Reports, no. 969 (2016)
(USC DC Other)
USC Computer Science Technical Reports, no. 969 (2016)
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APrivacyEngineeringFrameworkforBigData Ranjan Pal, Chien-Lun Chen, Leana Golubchik University of Southern California {rpal, chienlun, leana}@usc.edu ABSTRACT In the Big Data era, the collection and mining of high-dimensional user data has become a fast growing and common practice by a large number of private and public institutions. In some cases, these institutions share data with third parties without an anonymity pre- ferring user’s consent or awareness, thereby risking his privacy. In other cases, data is released voluntarily by a user not necessarily preferring anonymity to specific data collecting entities in order to get a service in return, e.g., targeted recommendations, but with the intention of keeping some sensitive data attributes private. In either case, privacy risks can arise due to the leak of private user data from its inference from correlated non-private data. In this paper we propose HIDE, a computationally efficient information-theoretic privacy engineering framework for high-dimensional datasets. A salient advantage of HIDE is its ability to generate optimal utility- privacy tradeoffs when the privacy preserving entity in the worst case might have no prior information that links a user’s private data with his public data. We validate the efficacy of HIDE via extensive experiments conducted on synthetic and real world datasets. Categories and Subject Descriptors H.2.0 [Security, Integrity, and Protection]: Privacy Protection Keywords privacy; correlation; high-dimensionality, mutual information 1. INTRODUCTION As information technology and electronic communications are re-defining almost every sphere of human activity, including com- merce, medicine and social networking, the risk of accidental or in- tentional disclosure of sensitive private information has increased. The concomitant creation of large centralized searchable data repos- itories and deployment of applications that use them has made ‘leak- age’ of private information such as medical data, credit card infor- mation, power consumption data, etc. highly probable, and subse- quently pose an important and urgent societal problem. In some applications (e.g., health data collection by a hospital), the collec- tion of information (e.g., smokers having cancer), its analysis (de- termining the correlation between smoking and having cancer), and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. its sharing with third parties (e.g., insurance agencies) is sometimes done without an anonymity preferring user’s consent or awareness, thereby breaching user privacy. A well known example of such a breach is the de-anonymized health record of the governor of Massachusetts in 1997 [1]. In some other applications (e.g., cus- tomer product rating collection by a retail organization such as Wal- mart), users (customers) do not bother if their identity information is shared by an organization with third parties with whom the or- ganization has business relationships (and who can provide price customer discounts for shopping on Walmart), as long as customer product information is kept private from these third parties. Failure to ensure such a privacy results in embarassing breaches such as the well known Target teenage pregnancy information breach of 2012. 1.1 Research Motivation The main challenge to ensuring that private information remains private is the fact that non-sensitive public information is correlated with private information, and this correlation can lead to an infer- ence attack whereby an adversary can have access to the private data of users in a dataset. A popular solution approach to solving the inference attack problem is to distort the release of public in- formation in a manner so as to prevent the leaking of private user data from public data. However, the distortion should be done in a manner to provide significant utility to the party, e.g., third party, receiving the distorted public data. In this section, we motivate our research with respect to the principles adopted in existing research on appropriate ways to distort publicly released information. Differential Privacy - There have been a plethora of public data distortion solutions (see Section 6) based on the differential privacy concept introduced by Dwork [2], where statistically distorted pub- lic data is released in the form of a query output on a given statis- tical database. However, all these solutions are aimed towards pre- serving user anonymity. In addition, differential privacy (DP) gen- erally adds noise, i.e., a probability distribution, to a true query out- put (be it scalar (single dimensional) or vector (multi-dimensional)) without explicitly factoring in the correlations between the private sensitive and the non-sensitive data attributes. In fact, DP does not account for the distribution of user data. As a result, the statistical noise added to preserve DP might not be well calibrated to ensure a given level of user privacy, i.e., the noise amount added to pre- serve DP might be more than required for a given privacy level of user privacy, thereby leading to a less than possible utility for the intended parties. Thus, one of our goals is to design a distortion method for publicly released data that explicitly accounts for the correlations between the private and public data of users and sub- sequently provides better utility than DP to interested parties, for a given level of user privacy. Information-Theoretic Privacy - Several approaches have used information-theoretic (see Section 6) tools to model privacy-utility tradeoffs. Indeed, information theory, and more specifically rate- distortion theory, appear as natural frameworks to analyze tradeoffs in privacy and utility resulting from the distortion of public user data that is statistically correlated with private user data. However, most of these approaches have modeled the privacy-utility trade- off problem as a mathematical optimization problem, where the solution relies on the fundamental assumption that the prior joint distribution that links users’ private data and data to be released is known and can be fed as an input to the optimization problem. In practice, this true prior distribution may not be known, but rather some prior statistics may be estimated in the best case from a set of sample data that can be observed. Thus, one of our goals is to design a solution method for privacy-utility tradeoff optimization problems, where the above mentioned prior distribution may in the worst case be unknown to the privacy preserving entity. High-Dimensional Datasets - For high-dimensional datasets hav- ing a large number of attributes, which are common to many real- world settings, the application of the aforementioned privacy-utility optimization framework to ensure optimal privacy-utility tradeoffs comes with a practical scalability challenge, i.e., information the- ory based optimization problems take a large amount of time to solve when accounting for high-dimensional user data, the primary reason being the large number of variables any solver has to deal with. This holds true even in the simplistic case when the prior joint distribution that links users’ private data and data to be re- leased is known and can be fed as an input to the optimization problems. Thus, a matter of concern is the time required to solve the optimization problems for high-dimensional datasets when the prior joint distribution is unknown. In this regard, one of our goals is to design scalable methods that help solve privacy-utility trade- off optimization problems in a fast manner for high-dimensional datasets, and under the additional assumption that the prior joint distribution that links users’ private data and data to be released might be unknown. Combined Goal - Our combined goal in this paper is to design a mechanism that satisfies each of the three aforementioned goals. 1.2 Research Contributions We make the following research contributions in this paper. 1. For a given dataset, we propose our information-theoretic privacy-utility tradeoff model, where the optimal privacy- utility tradeoffs are obtained via solutions to optimizations problems. The dataset could either pertain to multiple users or a single user. Our model takes into account the fact that the prior joint distribution that links users’ private data and data to be released might be unknown in the worst case (see Section 2). 2. For high-dimensional datasets, we develop a computation- ally efficient mechanism, HIDE, that consists of two compo- nents, and leads to optimal privacy-utility tradeoffs via the solution of optimization problems that model the tradeoffs. As a first component of HIDE, we propose a general solution methodology to solve the information theory driven privacy- utility tradeoff optimization problems when the prior joint statistical distribution between users’ private and public data is unknown in the worst case to the optimizer. We first show that these optimization problems are convex in the functional space for a certain general class of objective functions and constraints, where the functional space is over certain prob- ability distribution functions. For such problems, we pro- vide details of our solution methodology that uses properties of Hilbert spaces in functional analysis theory, and guaran- tees a unique optimal solution for a given optimization prob- lem. Second, we show that for a certain relevant special class of constraints, the privacy-utility tradeoff problems are non-convex in the probability distribution functional space. For such problems, we provide details of a greedy solution methodology. Finally, we prove tight bounds for the mini- mum amount of private information disclosed, i.e., a mea- sure of privacy leakage, for a given utility guarantee to a data analyst (see Section 3). 3. In Section 4, we describe in detail the second component of HIDE, that is a novel computationally efficient mechanism to choose the most relevant dataset attributes from a large set of private and public user data attributes on which to apply our mathematical framework (in section 3) to optimize privacy- utility tradeoffs. This mechanism merges information-theoretic concepts and dimensionality reduction techniques from un- supervised machine learning to arrive at a significantly smaller set of statistically relevant data attributes. 4. We validate the theory behind HIDE with extensive experi- ments conducted on practical datasets. As part of our main experimental results, we (i) show that the performance, i.e., privacy-utility tradeoffs, obtained via HIDE are better than those obtained from differential privacy (DP), for both sin- gle and compositional queries, and (ii) show that the com- putationally efficient mechanism in Section 4 speeds up the running time of HIDE by significant margins (upto eight times) when compared to a mechanism which accounts for all attributes of a dataset (see Section 5). 2. PRIV ACY-UTILITY TRADEOFF MODEL In this section, we describe in detail the privacy-utility tradeoff model for a given dataset. In this regard, we first mention our pri- vacy setting. Second, we describe our threat model and the privacy metric. Third, we explain our utility metric, and finally we propose the mathematical optimization problem to capture the notion of a privacy-utility tradeoff. 2.1 The Privacy Setting We consider a multi-user database setting where each user has some private data, represented by the random variable (vector)S that is correlated with his non-private data, represented via a ran- dom variable (vector)X. We assume that the database is owned and managed by a non-malicious database manager (DBM). Both X andS can take values from either the numeric or the non-numeric domain. Each user in the database might be interested to share his non-private (public) data (via the DBM), that could include his name or id, with an analyst to get some value (e.g., price discounts on future purchases) in return. We capture the correlation between S and X via the joint dis- tribution PS;X . Due to this correlation, releasing X to the ana- lyst would enable him to draw some inference on the private data S of a user. In order to reduce the inference threat on S that would arise from the observation of X, rather than releasing X, the database owner (DBM) releases a distorted version of X de- noted by a random variable (vector) Y . The distorted data Y is generated by passing X through a randomized mechanism (e.g., additive white Gaussian noise (AWGN) [3]) with a conditional dis- tributionP YjX , called the privacy mapping. Throughout the paper, we assumeS!X!Y form a Markov chain, i.e., givenX,Y is independent ofS. Therefore, once the distributionP YjX is fixed, the joint distributionPS;X;Y =P YjX PS;X is defined. We discuss the assumptions on the availability to DBM ofP YjX andPS;X in Section 2.4. The analyst is a legitimate recipient of dataY , which it can use to provide utility to a user, e.g. some personalized service. How- ever, the analyst can also act as an adversary by usingY to illegit- imately infer private dataS about the user. The privacy mapping aims at balancing the tradeoff between utility and privacy, i.e., the privacy mapping should be designed in a manner so as to decrease the inference threat on private dataS by reducing the dependency betweenY andS, while at the same time preserving the utility of Y , by maintaining the dependency betweenY andX. 2.2 Threat Model and the Privacy Metric Threat Model - We consider threats that are inferential in nature, i.e., threats that allows an adversary with unbounded computational power to infer S from Y . For the purpose of this work we syn- onymize the term ‘analyst’ with ‘adversary’. We assume that the DBM may, may not, or have partial knowledge ofPS;X . However, the adversary has complete knowledge ofPS;X in terms of side in- formation. We also assume that the adversary has complete knowl- edge of the privacy mapping, i.e.,P YjX . Thus, we ensure that the adversary has the worst case statistical side information aboutS. Moreover the adversary has knowledge that inputS is drawn from a distribution spacePS , and it choosesq2PS that minimizes an expected inference cost functionC(S;q). In other words, the ad- versary chooses a belief distributionq over the private information S prior to observingY as: q 0 = arg min q2P S ES [C(S;q)]; prior to observingY , and a revised belief distribution q y = arg min q2P S E SjY [C(S;q)jY =y]; after observing Y = y. Using the chosen belief distribution q, the adversary can produce an estimate of the inputS, e.g., using a Maximum a Posteriori (MAP) estimator. Note here that the analyst might or might not have information of the distribution spacePS . Privacy Metric - Letc 0 andc y be the minimum average cost of inferringS without observingY , and after observingY = y, re- spectively. We represent these quantities mathematically as c 0 = min q2P S ES [C(S;q)]; prior to observingY , and a revised belief distributionc y , as: c y = min q2P S E SjY [C(S;q)jY =y]: The average inference cost gain C =c 0 EP Y [c Y ] measures the improvement in the quality of the inference of private dataS due to the observation ofY . We denote C as the privacy metric in this work. The goal of the DBM is to design the privacy mapping P YjX so as to reduce C, i.e., it should aim to bring the inference cost given the observationY closer to the initial inference costc 0 without observingY . Cost Function - An important aspect pertaining to the privacy met- ric is the design of the cost functionC. From a practical viewpoint, we would expect the cost function to be consistent with a rational adversary, i.e., the prior and posterior distributions that minimize ES [C(S;q)] andE SjY [C(S;q)jY =y] respectively for the adver- sary should be the most likely estimates of the true distributions that characterizeS andSjY . In this regard, we find the self information (SI) cost function [4] defined as: C(S;q) = logq(S); to be one of the most appropriate cost functions for the adversary, and use it in our work (see rationale below). From a practical view- point, the self information cost function manifests the degree of uncertainty of the random variable (vector)S. Greater the uncer- tainty captured byS, more the cost of inferringS incurred by the adversary. Thus, the adversary would hope for a significant change in the amount of uncertainty between trying to inferq2PS with and without observingY , and eventually enforce a privacy breach. In a technical sense, our rationale to stick with the self-information cost function throughout this work is as follows: 1. For the self information cost function, the prior and posterior distributions that minimize the expectations,ES [ logq(S)] andE SjY [ logq(S)jY =y], respectively for the adversary are the true distributions that characterize S and SjY [5]. Moreover by the Shannon-McMillan-Brieman theorem [5], under certain ergodicity assumptions, this is true not only in the expected value sense, but also in the almost sure sense in probability theory. 2. We show that C for the self information cost function is the mutual information,I(S;Y ), betweenS andY , and charac- terizes the privacy leak for private dataS from public output Y (see Lemma 1 below). Here, we define mutual informa- tion betweenS andY in the information-theoretic sense [3] as I(S;Y ) = X s;y p(s;y) logp(s;y) logp(s) logp(y) : It is evident that whenS andY are independent,I(S;Y ) is zero, and no informationS can be obtained from the release ofY . 3. We prove that any bounded C(S;q) leads to a C that is a function of I(S;Y ) and a constant factor (see Theorem 1 below). Thus, it is safe to use the self information cost function as a representative function for all boundedC(S;q) in order to capture the privacy leakage via C. We now state and prove Lemma 1 and Theorem 1; the proofs are in the Appendix. LEMMA 1. The average inference cost gain, C, under the self information cost function is the mutual information between S andY : C = I(S;Y). Lemma 1 relates the privacy leakage, modeled as the average in- ference cost gain, C, of the adversary under the self-information cost function, to the mutual information betweenS andY . THEOREM 1. LetC(s;q) be a bounded cost function such that L = sup s2S;q2P S jC(s;q)j<1. Then the following result holds. C =c 0 EY [c Y ] 2 p 2L p I(S;Y ): Theorem 1 relates the privacy leakage, modeled as the average in- ference cost gain, C, under any bounded cost function, to the mutual information betweenS andY . 2.3 Utility Metric One of the goals of the DBM is to provide considerable utility to the analyst. In this work, we define utility of the analyst to be the average distortion of Y from X, i.e., EX;Y [d(X;Y )], where d is the distortion measure on the range of non-negative real num- bers. Thus, lesser the expected distortion, higher the utility for the analyst. For a given scalarD 0, the utility constraint is given as EX;Y [d(X;Y )]D: Whend(X;Y ) is the SI cost function, we have d(x;y) = logP (X =xjY =y); and EX;Y [d(X;Y )] =EX;Y [ logP XjY ] =H(XjY ): Thus, the constraintEX;Y [d(X;Y )]D is equivalent to the con- straintH(XjY ) D, for a given distortion levelD. GivenPX , and thereforeH(X), the utility constraint can be rewritten as I(X;Y )T; whereT =H(X)D. Basically, this utility constraint says that the the amount of informationY conveys aboutX (the true public data) should be above a given thresholdT to satisfy the data analyst. 2.4 Privacy-Utility Tradeoff Problem Our goal is to design a privacy-preserving mapping,P YjX , that minimizes C for a given distortion levelD. Problem Formulation - Given complete knowledge by the database manager (DBM) about the joint distributionPS;X which captures the correlation betweenS andX, we mathematically represent our goal as a optimization task as follows: arg min P YjX C subject to EX;Y [d(X;Y )]D: Here, the objective function represents the privacy requirement of the DBM, and the constraints represents the utility requirement of the analyst. i.e., the average distortion betweenX andY should not be greater than a pre-specified levelD. For the SI cost function, the above optimization problem can be expressed as: arg min P YjX I(S;Y ) subject to EX;Y [d(X;Y )]D; which further simplifies to arg min P YjX EY [D(P SjY jjPS )] subject to EX;Y [d(X;Y )]D: The latter formulation of the privacy-utility tradeoff problem has an interesting and intuitive interpretation: if we consider KL-divergence as a metric for the distance between two distributions, then the ob- jective function states that the revised distribution after observing Y should be as close as possible to the apriori distribution in terms of the KL-divergence, to ensure minimum privacy leakage. Partial/No Information Structure - In practice, the DBM may not have access to the exact joint probability distribution,PS;X , but rather in the best case only have some partial knowledge of the joint statistics ofS andX. For example, the prior joint distribution could be estimated from a set of users who do not have privacy concerns and publicly release both their private and non-private data. Alter- natively, when the private data cannot be observed, the marginal distribution of the data to be released, or simply its second order statistics, may be estimated from a set of users who only release their non-private data. In practice, there may also exist a mismatch between the estimated prior statistics and the true prior statistics, for example due to a small number of observable samples, or due to the incompleteness of the samples. Under the scenario when the DBM has partial information aboutPS;X , the privacy-utility trade- off problem can be formulated as follows: min P YjX max P SjX I(S;Y ) subject to EX;Y [d(X;Y )]D; where DBM has complete knowledge about marginalPX without knowing jointPS;X . Note here that the search space of the max component of the objective function is over P SjX since PS;X = P SjX PX andPX is known. Unlike the objective functions above wherePS;X was known, the objective function aims to capture the maximum mutual information possible (instead of the exact MI) between S and Y when PS;X is unknown. Similarly, when the DBM has no knowledge of the statistics ofS and/orX, the privacy- utility tradeoff problem becomes: min P YjX max P S;X I(S;Y ) subject to EX;Y [d(X;Y )]D: Here, since bothP SjX andPX are unknown the search space of the max component of the objective function is overPS;X . 3. ANALYSIS OF TRADEOFF PROBLEMS In this section we analyze structures of the optimization prob- lems (i.e., whether a problem is convex or not) characterizing privacy- utility tradeoffs, and propose computational methods to solve the problems. In addition, we provide tight bounds for the minimum amount of private information disclosed. i.e., a measure of the pri- vacy leakage, for a given utility guarantee to a data analyst. 3.1 Structures of Optimization Problems In this section, we first analyze the structure of privacy-utilty tradeoff optimization problems when the DBM has complete infor- mation aboutPS;X . We then analyze the structure of optimization problems when the DBM has partial or no information aboutPS;X . Full Information Case: Let us denote byOPT1, the following op- timization problem: arg min P YjX C =I(S;Y ) subject to EX;Y [d(X;Y )]D; where the objective function represents the privacy requirement of the DBM, and the constraints represents the utility requirement of the analyst. We have the following theorem regarding the structure ofOPT1, the proof of which is in the Appendix. THEOREM 2. GivenPS;X ,d(;), and are known to the DBM, OPT1 is a convex optimization problem whend(;) is not a func- tion of the statistical properties ofX andY , and is a non-convex problem otherwise. The implication of Theorem 2 is that it lays the foundation be- hind the design of computationally efficient algorithms to solve the privacy-utility tradeoff optimization problems when the DBM has complete information aboutPS;X . As a special case of the full in- formation case, the DBM might have access toS in which caseX is a deterministic function ofS andY is determined directly from S. In this case,OPT1 can be written as arg min P YjX;S ;P YjX ;P YjS X y2Y X s2S P YjS (yjs)PS (s) log P YjS (yjs) PY (y) ; (1) subject to X y2Y X x2X P YjX (yjx)PX (x)d(y;x) ; (2) X x2X P XjS (xjs)P YjX;S (yjx;s) =P YjS (yjs)8y;s; (3) X s2S P YjS (yjs)PS (s) =PY (y)8y; (4) X s2S P SjX (sjx)P YjX;S (yjx;s) =P YjX (yjx)8y;x: (5) We have the following corollary derived from Theorem 2, the proof of which is in the Appendix. COROLLARY 1. IfX is a deterministic function ofS andS! X!Y ,OPT1 simplifies toOPT2, that is the following optimiza- tion problem. arg min P YjX C =I(X;Y ) subject to EX;Y [d(X;Y )]D; and is convex whend(;) is not a function of the statistical prop- erties ofX andY , and is a non-convex problem otherwise. The corollary signifies the generality ofOPT1 to be applicable also whenX is a deterministic function ofS. Partial/No Information Case: Before we analyze the structure of optimization problems for this case, we state some definitions and results from existing work that will be useful to model optimization problems in a manner that leads to the design of computationally efficient solutions. Definition 1. [6] Given a joint distributionPX;Y , we define S (X;Y ) = sup Q X 6=P X D(QYjjPY ) D(QXjjPX ) 1; whereQY is the marginal probability ofY resulting from the joint distributionP YjX QX . THEOREM 3. (due to [7]) If S ! X ! Y form a Markov chain, thenI(S;Y ) S (X;Y )I(S;X), and the bound is tight as we varyS. In order words, assumingI(S;X)6= 0, we have sup S:S!X!Y I(S;Y ) I(S;X) =S (X;Y ): (6) Theorem 3 is important to our work as it decouples the depen- dency of Y and S into two terms, one relating S and X and the other relatingX andY . Thus, the DBM can upper bound the pri- vacy leakage (on releasing Y ) even without knowing PS;X . The intuition is as follows: assume that the DBM does not use any pri- vacy preserving mapping to release true public attributesX, it can guarantee I(S;X) H(S) - privacy, whereH(S) is the Shannon entropy of S and 0 I(S;X) H(S) 1. On the other hand, for a desired privacy levels2 [0; 1], the DBM can guarantees I(S;X) H(S) - privacy. Thus, irrespective of the DBM having no knowledge about the joint dis- tribution on (S;X), it can guarantees-divergence privacy. Here, D(PS;YjjPSPY ) =H(S)I(S;Y )I(S;X)S (X;Y ); where (a) the latter equivalence follows from Theorem 3, (b)2 [0; 1] is the privacy leakage factor, (c) divergence privacy is equal to I(S;X) H(S) , and (d)s I(S;X) H(S) correlates with the amount of privacy leakage. Based on Theorem 3, under the scenario when the DBM has partial information aboutPS;X (only has knowledge ofPX ), the privacy-utility tradeoff problem (OPT3) is formulated as follows: OPT3 : min P YjX max P SjX I(S;Y ) min P YjX max P X S (X;Y ) subject to EX;Y [d(X;Y )]D: Under the scenarion when the DBM has no information aboutPS;X , the privacy-utility tradeoff problem (OPT4) is formulated as fol- lows: OPT4 : min P YjX max P S;X I(S;Y ) min P YjX S (X;Y ) subject to EX;Y [d(X;Y )]D: In order to solveOPT3 andOPT4, we first need the following definition from [8]: Definition 2. Given two random variablesX andY , the maxi- mal correlation,m(X;Y ), betweenX andY is given by m(X;Y ) = max ( 1 (X); 2 (Y))2T E[1(X)2(Y )]; (7) whereT is the collection of pairs of real-valued random vari- ables1(X);2(Y ) such thatE[1(X)] = E[2(Y )] = 0, and E[1(X) 2 ] =E[2(Y ) 2 ] = 1. We now take help of the following result from [6] that connects quantitiesS (X;Y ) andm(X;Y ): max P X 2 m (X;Y ) = max P X S (X;Y ); (8) when sup S:S!X!Y I(S;Y) I(S;X) is not achievable. i.e., when sup S:S!X!Y I(S;Y ) I(S;X) 6= max S:S!X!Y I(S;Y ) I(S;X) : Under these conditions, OPT3 and OPT4 can respectively be reformulated as OPT3 : min P YjX max P X 2 m (X;Y ) subject to EX;Y [d(X;Y )]D; and OPT4 : min P YjX 2 m (X;Y ) subject to EX;Y [d(X;Y )]D: It is a well known result from [9] thatm(X;Y ) is characterized by the second largest singular value of the matrix Q with entries Qx;y = P(x;y) p P(x)P(y) . Thus,OPT3 can be solved numerically using the standard power iteration algorithm [10]. In Section 3.2, we explain the details behind the context and motivation to use the power algorithm. We now have the following theorem stating the structure of optimization problemOPT4, the proof of which is in the Appendix. THEOREM 4. For a given distributionPX , let p PX denote a vector with entries equal to the square root of the entries ofPX . If (a)Q is ajXjjYj matrix satisfying the following constraints: (i) Q 0 (entry-wise), and (ii)QQ T p PX = p PX , and (b)d(;) is not a function of the statistical properties ofX andY , thenOPT4 is convex, and non-convex otherwise. The implication of Theorem 4 is that it lays the foundation be- hind the design of computationally efficient algorithms to solve the privacy-utility tradeoff optimization problems when the DBM in the worst case has no information aboutPS;X . 3.2 Computing Optimal Solutions In this section, we first state a result characterizing a non-trivial guarantee on the probability of error in inferring private dataS by an adversary. We then propose methods for computing optimal so- lutions to privacy-utility tradeoff optimization problems. Error Probability in Inferring S: One natural and related ques- tion is whether a privacy mapping designed to minimize informa- tion leakage by solving problemsOPT f1;2;3;4g also provides guar- antees on the probability of error in inferringS fromY . It turns out that a DBM can guarantee a tight lower bound on this error prob- ability based on Fano’s inequality [3], for any inference algorithm used by the adversary. We have the following proposition from [3]: PROPOSITION 1. If |S| > 2, andI(S;Y )H(S), we have P [ S(Y )6=S] (1)H(S) 1 log(jSj 1) ; (9) andH(Pe) (1)H(S), forjSj = 2. Here S(Y ) is the estimator ofS inferred fromY , and2 [0; 1] is the privacy leakage factor. Next, we focus on algorithms for computing solutions to the optimization problems. Computing Optimal Solutions:OPT f1;2;4g are convex optimiza- tion problems whend(;) is not a function of the statistical prop- erties ofX andY . They can either be solved using (a) a dual mini- mization procedure analogous to the Arimoto-Blahut algorithm [3] by starting at a fixed marginal probabilityPY (y), solving a convex minimization at each step (with an added linear constraint com- pared to the original algorithm), and updating the marginal distribu- tion, or (b) the use of interior point methods [11]. In this work, we decide on the use of interior point methods due to their generality for computing the optimal solutions to convex optimization prob- lems. We however note here that the Arimoto-Blahut algorithm is an iterative procedure that converges to the global optimum of a convex program. The crux behind computing a solution toOPT3 lies in an effi- cient way to representm(X;Y ) that enables the design of a com- putationally feasible algorithm to solve OPT3 . We achieve this via the use of orthonormal bases [12] in Hilbert Spaces. We have the following definition [13] of a Hilbert space: Definition 3. For a given real-valued random variable X, we define a Hilbert spaceH as: H =f(X)jE[(X)] = 0;E[((X)) 2 ]<1;g where(X) is measurable. In our work, we first represent discrete random variables X and Y as orthornormal bases,f 1;ig jXj i=1 ,f 2;jg jYj j=1 of Hilbert spaces H1 andH2 respectively. This is possible since every Hilbert space has an orthonormal basis (Theorem 2.4 [14]). m(X;Y ) is then expressed via the following optimization problem: min X i;j a1;ia2;jE[ 1;i(X) 2;i(Y )] X 8j a 2 i;j = 1; i = 1; 2; X 8j a1;jE[ 1;j (X)] = 0; X 8j a2;jE[ 2;j (Y )] = 0; where1(X) = P jXj i=1 a1;i 1;i(x);2(Y ) = P jYj i=1 a2;i 2;i(y), for two sequences of coefficientsfa1;ig jXj i=1 ;fa2;ig jYj i=1 . We now choose the following orthonormal bases forH1 andH2: 1;i(x) = 1fx =ig 1 p PX (i) ; 2;j (y) = 1fy =jg 1 p PY (j) ; via which we get E[ 1;i(X); 2;i(Y )] = PX 1 ;X 2 (i;j) p PX (i)PY (j) : Thus, we can expressm(X;Y ) as: max a 1 ;a 2 a T 1 Q a2 subject to jjaijj2 = 1; i = 1; 2; ai? p p i ; i = 1; 2; where a1 = (a1;1;a1;2;:::::;a 1;jXj ) T ; a2 = (a2;1;a2;2;:::::;a 2;jYj ) T ; p p 1 = ( p pX (1); p pX (2);:::::; p pX (jXj)) T , p p 2 = ( p pY (1); p pY (2);:::::; p pY (jYj)) T ; and Q(i;j) = PX 1 ;X 2 (i;j) p PX (i)PY (j) : This formulation ofm(X;Y ) as a matrix spectral decomposition problem [15] enables the use of the power method [16] to compute the optimal solution. This method converges to a unique optimal solution. In practice, we can only have access to a sample value ofm(X;Y ) derived from i.i.d. samples drawn according to a joint probability distributionPX;Y . Thus, it is essential that we provide guarantees on the deviation of the sample maximal correlation, p m (X;Y ), from the true maximal correlationm(X;Y ), wherep is the num- ber of drawn samples. We have the following theorem characteriz- ing the limits of this deviation (see Appendix for the proof). THEOREM 5. For any distributionP , and any > 0,P[j p m (X;Y ) m(X;Y )j>]! 0, exponentially fast. More precisely, if p 3 (P ) 2 p R log( 24 ); then P[j p m (X;Y )m(X;Y )j>]>]; or P[j p m (X;Y )m(X;Y )j>]>] 1 24 e p (P) 2 3 p D ; whereR = maxfjXj;jYjg, and(P ) = minf1;2g> 0, where 1 = arg min x PX (x); 2 = arg min y PY (y): The implication of this theorem is that the sample maximal corre- lation converges exponentially fast in the number of samples to the true maximal correlation. OPT f1;2;4g are non-convex optimization problems whend(;) is the self-information distortion function. We design a greedy al- gorithm (see Algorithm 1) based on the agglomerative information method [17]. Based on this method, we first merge yi 2 Y and yj 2 Y to form a merged element yij . We then iteratively up- date P YjX as P (yijjx) = p(yijx) + p(yjjx);8x: It is evident that the minimum ofI(S;Y ) is a decreasing function ofI(X;Y ) and is achieved for a mappingP YjX that satisfiesI(X;Y ) =T = H(X)D. For a given mutual information,T , there are many con- Algorithm 1: Greedy Algorithm to solveOPT f1;2;4g (non-convex) Input:T Output:P YjX 1 Initialization:X =Y ,P YjX (yjx) = 1fy =xg 2 while9i 0 ;j 0 jI(X;Y i 0 j 0 )T do 3 fyi;yjg = arg max fy i 0;y j 0g2Y I(S;Y )I(S;Y i 0 j 0 ) merge:fyi;yjg!yij 4 update:Y =fYfyi;yjgg\fyijg andP YjX 5 returnP YjX ditional probability distributions,P YjX , achievingI(X;Y ) = T , out of which there exists one that gives the minimumI(S;Y ), and one that gives the maximumI(S;Y ). Algorithm 1 converges to a local minimum ofI(S;Y ) for a givenI(X;Y ) = T , and allows us to approximately characterize the range of valuesI(S;Y ) can take for a given value ofI(X;Y ) as being those between the local minimum and the local maximum. 3.3 Bounds on Privacy Leakage and Utility In this section, we characterize tight privacy leakage bounds un- der a given utility constraint. More specifically, our goal here is (i) to investigate whether perfect privacy is achievable with non- negligible utility, and (ii) to characterize bounds for the largest amount of useful information disclosed under perfect privacy. We first state few definitions and results that will be useful in deriving the bounds. Definition 4. For 0tH(X), and a joint distributionPS;X , we define a privacy function (PF) as: PF (t;PS;X ) = inffI(S;Y )jI(X;Y )t;S!X!Y; (10) where the infimum is over all mappingsP YjX . The privacy function lower bounds the privacy leakage for a given utility of at leastt. We now have the following lemma related to PF (t;PS;X ), whose proof is in the Appendix. LEMMA 2. PF (t;PS;X ) = min P YjX fI(S;Y )jI(X;Y )t;S!X!Y (11) The lemma states the existence of aP YjX satisfying Equation 10. Definition 5. The optimal privacy-utility coefficient (PUC) for joint distributionPS;X is given by: PUC opt (PS;X ) = inf P YjX I(S;Y ) I(X;Y ) = lim t!0 PF (t;PS;X ) t : (12) Definition 5 draws the relationship between optimalPUC opt (PS;X ) andPF (t;PS;X ). LEMMA 3. (due to [7]) Let QS denote the distribution of S. Then, PUC opt (PS;X ) = inf Q X 6=p X D(QSjjPS ) D(QXjjPX ) : (13) The lemma states the divergence characterization ofPUC opt . IfPUC opt = 0, it may be possible to disclose some information aboutX without revealing any information aboutS, i.e., there ex- ists at strictly bounded away from 0 such thatPF (0;PX;S ) = 0. This would represent the ideal privacy setting, since from Lemma 2, there would exist a privacy assuring mapping that allows the disclosure of some non-negligible amount of useful information for I(S;Y ) = 0. This, in turn would mean that perfect privacy is achievable with non-negligible utility regardless of the specific privacy metric used, sinceS andY would be independent. In this regard, we have the following result, the proof of which is in the Appendix. THEOREM 6. For anyPS;X with finite support, we have 1. PUC opt (PS;X )(PS;X ); (14) where(PS;X ) = 2 m1 (Q) is the square of the (m 1)-th singular value ofQ; m = minfjSj;jXjg, and Q =R 1 2 S [PS;X ]R 1 2 X : 2. PUC opt (PS;X ) = 0 () (PS;X ) = 0; (15) givenH(X)> 0. 3.9 a mappingP YjX such thatS! X! Y ,I(X;Y ) > 0, andI(S;Y ) = 0 if and only if(PS;X ) = 0. The result show that if the optimal privacy-utility coefficient is 0, then there exists a privacy-assuring mapping that allows the dis- closure of non-trivial amount of useful information while guaran- teeing perfect privacy. We now present an explicit lower bound for the largest amount of useful information that can be disclosed while guaranteeing perfect privacy. We have the following theorem, the proof of which is in the Appendix. THEOREM 7. For a fixedPS;X , let F =ff :X!RjE[f(X)] = 0;jjf(X)jj 2 = 1;jjE[f(X)jS]jj 2 = 0g[f 0 ; wheref0 is the trivial function that mapsX tof0g. Then,PUC opt (P S;X ) is 0 fort2 [0;t ], where t 1max f2F E h b 1 2 + f(X) 2jjfjj1 ; (16) whereh b () is binary entropy function [3]. Furthermore, the lower bound for t is sharp when (P S;X ) = 0, i.e., there exists a P S;X such that t > 0, andPUC opt (P S;X ) = 0 if and only ift2 [0;t ]. The theorem derives the lower bound of the amount of useful infor- mation disclosed with perfect privacy. 4. DIMENSIONALITY REDUCTION In this section, we study the dimensionality reduction component of HIDE. This component is based on the subspace search concept in unsupervised learning theory. To this end, we first investiagte the notion of jointly correlated dimensions of a subspace, and then propose our heuristic for subspace selection. 4.1 Jointly Correlated Dimensions Given a database of sizeN, and dimensionalityD, we first aim to measure the diversity of any lower dimensional subspaceS with dimensionality 1dD, where the diversity measure quantifies the difference of S w.r.t. the baseline of d independent and ran- domly distributed dimensions. This measurement is important to characterize the existence and magnitude of jointly correlated di- mensions of a subspace. To this end, we define a diversity measure of subspaces as a function,DM, of the following form: DM :P(F )!R: where F = fX1;:::::;XDg is the full space of all dimensions, fXig’s are real-valued random variables with density distributions fpX i (xi)g,P(F ) is the power set ofF , andS2P(F ). We de- sire the following properties forDM: (i) for subspacesS1 andS2 such that dim(S1) = dim(S1), ifS1 is more correlated thanS2 then DM(S1)>DM(S2), (ii)DM(S) = 0 if and only if the dimen- sions ofS are mutually independent, i.e., for ad dimensional sub- spaceS,p(xi;::::x d ) =p(x1):::::p(x d ), and (iii)DM(S) is small if the dimensions ofS arem-independent but not mutually inde- pendent, where for a givenmd,X1;:::::;X d arem-independent if and only if any subsetfXi 1 ;:::::;Xim gfX1;:::::;X d g is mu- tually independent. Moreover,DM should be directly applicable to continuous data. Should HIDE satisfy all these properties, it will be unique in the sense that existing standard subspace search methods ENCLUS [18], PODM [19], and HiCS [20] fail to sat- isfy all these properties. More specifically, ENCLUS only satisfies properties (i) to (iii) for discretized data with a proper grid res- olution, which is data-dependent. PODM fails to satisfy (ii) and (iii), whereas HiCS fails to satisfy (ii) and does not address prop- erty (iii). In this paper, we propose Aggregate Mutual Information (AMI) to be a type ofDM measure that jointly satisfies properties (i) to (iii). Given continuous random variablesX1;::::::;X d , their AMI, denoted byAMI(X1;::::;X d ) is defined as: AMI(X1;::::;X d ) = d X i=2 p(xi)p(xijx1;:::::;xi1): (17) Intuitively, AMI(X1;::::;X d ) measures the mutual information of X1;::::;X d by aggregating the difference between p(xi) and p(xijx1;:::::;xi1) for 2id. However, in practice, the prob- ability distributions are not available at hand, and can only be es- timated via data discretization resulting in loss of information. In order to alleviate this problem, we instantiate AMI by means of cumulative entropy [21] and conditional cumulative entropy. The cumulative entropy (CE) for a continuous random variableX, de- noted ashCE (X), is defined as: hCE (X) = Z dom(X) P (Xx) logP (Xx)dx: (18) The conditional CE of any continuous random variableX given a random vectorV 2R B is given as: hCE (XjV =v) = Z dom(X) P (Xxjv) logP (Xxjv)dx: (19) Using CE, we have p(xi)p(xijx1;:::::;xi1) =hCE (xi)hCE (XijX1;:::::;Xi1); and we have AMI(X1;::::;X d ) = d X i=2 hCE (xi) d X i=2 hCE (XijX1;:::::;Xi1); (20) wherehCE (XijX1;:::::;Xi1) ishCE (XijV );V = (X1;::;Xi1) being a random vector. If X1;:::::;X d are m-independent, then AMI(X1;:::::;X d ) is low as hCE (Xi)hCE (Xij:::) vanishes forim, and AMI = 0 if and only ifX1;:::;X d are mutually in- dependent (from (20)). In addition, similar to mutual information, the more correlatedX1;:::;X d , the smaller the conditional CE val- ues, and subsequently larger the AMI values. Thus, AMI satisfies properties (i) to (iii). 4.2 Heuristic for Subspace Selection Our heuristic mechanism for subspace selection involves three main steps: (i) designing a permutation-independent AMI measure, (ii) scalable subspace exploration, and (iii) efficient diversity com- putation. Permutation-Independent AMI - Since AMI is permutation de- pendent, for a given subspace S = fX1;::;X d g, a brute force search overd! permutations to compute a maximal diversity is im- practical. In contrast we propose the following greedy heuristic method to approximate the optimal diversity: we first pick a pair of dimensions Xa and X b where 1 a 6= b d such that hCE (X b )hCE (X b jXa) is maximal among the possible pairs. We then continue to select the next dimensionXc(c6= a; c6= b) such thathCE (X b )hCE (X b jXa;X b ) is maximal among the re- maining dimensions. Likewise, at each step, assuming a setI ofk dimensions have already been picked anddk dimensions remain to be picked, we select the dimension Xr such that hCE (Xr) hCE (X b jI) is maximal. The process goes on until no dimension is left to be picked. Subspace Exploration - In order to mine high diversity subspaces, we rely on the intuition that a high diversity high dimensional sub- space likely has its high diversity reflected in it its lower dimen- sional projections, analogous to an idea from subspace clustering where subspace clusters tend to have their data points clustered in all of their lower dimensional projections [22] [23]. One can them apply a levelwise scheme to mine subspaces of diversity larger than a pre-specified value. More specifically, starting with two- dimensional subspaces, in each step we use top M subspaces of high diversity to generate new candidates in a levelwise manner. A newly generated candidate is only considered if all its child sub- spaces have high diversity. Efficient Diversity Computation - Since the number of data tu- plesN is limited, computing diversity using the exact formula of conditional CE might lead to inaccurate results, since the expected number of tuples contained in the hypercube [x2;x2 +]::: [x d ;x d +], equalling N(2) d1 approaches 0 as ! 0 + . This follows from the following result: hCE (Xjx2;::;x d ) =L; (21) whereL = lim !0 +hCE (X1jx2 X2 x2 +;::::;x d X d x d +). To alleviate this problem, we propose data summarization by clustering to ensure that we have enough points for meaningful diversity computation. Since the number of clus- ters is generally much smaller than the original data size, we may have more data points per cluster. Given, a clustering algorithmA that projects the data set tofX1;:::X d g resulting in Q clusters fC1;::::;CQg, we estimatehCE (X1jX2;::;X d ) by: hCE (X1jX2;::;X d ) = Q X i=1 jCij N hCE (X1jCi): (22) If Q is kept small, we will have enough points for a meaningful computation ofhCE (X1jCi) regardless of the dimensionality ofd. 5. EXPERIMENTAL EV ALUATION In this section we provide a detailed experimental evaluation of the HIDE mechanism. We experiment on both real-world and syn- thetic data sets to get answers to the following set of questions: How does HIDE’s subspace search method compare (with increasing data dimensionality) to existing subspace search methods in terms of quality, outlier detection, and running time? How does HIDE perform, i.e., minimize privacy leakage, un- der given distortion constraints betweenX andY ? How does HIDE perform under the presence of adversarial collusion, in the information-theoretic privacy setting? How does the utility-privacy tradeoff performance in HIDE compare to that in differential privacy? We first present our experimental setting and then follow up with explanations of our obtained answers to the questions. 0.2 0.4 0.6 0.8 1 F1 ACC E4SC Clustering Quality HIDE HiCS ENCLUS PODM DBSCAN FB 0.2 0.4 0.6 0.8 1 F1 ACC E4SC Clustering Quality HIDE HiCS ENCLUS PODM DBSCAN FB 0.2 0.4 0.6 0.8 1 F1 ACC E4SC Clustering Quality HIDE HiCS ENCLUS PODM DBSCAN FB 0.2 0.4 0.6 0.8 1 F1 ACC E4SC Clustering Quality HIDE HiCS ENCLUS PODM DBSCAN FB Figure 1: Subspace Quality of HIDE for (a) SD4 (left), (b) SD5 (left middle), (c) RWD1 (right middle), and (d) RWD2 (right) datasets 5.1 Experimental Setting We choose real world databases from the UCI Machine Learn- ing Repository for our experiments. Specifically, we consider the US 1990 Census dataset (RWD1) of 68 attributes and the Diabetes dataset (RWD2) of 20 attributes. We generate synthetic databases according to the method in [24] to experiment on high dimensional (dimensionality more than 100) and very high dimensional (dimen- sionality greater than 1000) data, and for artificially creating data outliers for low to high-dimensional datasets. We consider five syn- thetic datasets (representing low to high dimensional datasets) of 6000 instances having 20, 40, 60, 100, and 120 dimensions, respec- tively. We denote them as SD1, SD2, SD3, SD4, and SD5, respec- tively. We also consider two synthetic datasets (representing very high dimensional datsets) of 6000 instances having 1000 and 2000 dimensions, respectively. We denote them as SD6 and SD7, respec- tively. For each data set we fix five private data attributes, and ex- ecute HIDE’s subspace search method on the remaining public at- tributes. For the synthetic datasets, subspace clusters are embedded in randomly selected d-dimensional subspaces, where we choose d 2 [2; 10]. We choose this interval range to avoid excessive run time due to the exponential number of subspaces for datasets of a given dimensionality. For each synthetic dataset, we create 100 outliers deviating from these clusters. We use the LOF [25], ENCLUS [18], PODM [19], HiCS [20], and FB [26] methods to compare the outlier detection quality of HIDE, where outlier de- tection results are assessed using the well known Area Under the ROC Curve (AUC) metric [26] [24]. We use the DBSCAN [27], ENCLUS, PODM, HiCS, and FB methods to compare the cluster- ing quality of HIDE, where cluster quality results are assessed us- ing three well known metrics: F1, Accuracy, and E4SC [28] [23]. As a distortion function betweenX andY , we consider the Ham- ming distance and l2 distance metrics. For the purpose of com- paring the privacy-preserving strength of HIDE with those of - differential privacy (DP) and (;) -DP, we consider a count query on a given attribute of both real world and synthetic datasets, and choose the well-known Laplacian and the Gaussian mechanisms as the-DP and (;) -DP [29] noise generation mechanisms re- spectively. Here,2 [0; 1] for the DP setting equals the value adopted in [30], and is equal to the 2 [0; 1] value used in the information-theoretic privacy models introduced in this paper. We choose2 [0:01; 0:1] for the purpose of simulation. We chooseQ = 10 for all our experiments, and use the clustering strategy intro- duced in [31] for relational databases. Figure 2: Outlier Detection Quality of HIDE for (a) SD1 to SD5 datasets (left), and (b) RWD1 and RWD2 (right) datasets 5.2 Dimensionality Reduction via HIDE We evaluate the dimensionality reduction capability of HIDE with respect to (a) quality, (b) outlier detection, and (c) running time parameters, and compare their values to those obtained from existing methods in literature. We observe from Figure 1 that HIDE’s quality of subspace output as measured via the clustering quality is the best amongst existing methods, for both synthetic and real world datasets, and for each of F1, Accuracy, and E4SC metrics. In the interest of space, we choose to represent results for SD4 and SD5 datasets only, in the plots of Figure 1. Results for SD1, SD2, and SD3 datasets are quite similar. An interesting thing to note is the high E4SC values of HIDE that indicate that the latter performs well in selecting subspaces that contain clusters and outliers. The quality of outlier detection is measured by the Area Un- der the ROC Curve (AUC) metric. We observe from Figure 2 that HIDE outperforms peer methods for both synthetic and real world datasets. In addition, to illustrate the robustness of HIDE’s out- lier detection ability with respect to increasing dimensionality of subspaces, we record in Figure 3, maxA d minA d maxA d values for syn- thetic (SD5) and real world (RWD2) datasets, whereA d is the set of diversity scores of alld2f2; 4; 6; 8; 10g-dimensional spaces. Note that maxA d minA d maxA d lies between 0 and 1. The plotted re- sults show that HiCS, ENCLUS, and PODM do not scale well with higher dimensionality. In contrast, HIDE is more robust to dimen- sionality and yields discriminative diversity scores even for high dimensional spaces. We plot the total running times of HIDE (dimensionality reduc- tion with privacy preservation) in Figure 4, for both synthetic and real world datasets, and compare them with those when HIDE does not use dimensionality reduction. We observe (for high dimen- sional datasets) a significant (nearly 300%) increase in the running times when HIDE does not perform dimensionality reduction. For very high dimensional datasets (SD6 and SD7), we observe from Figure 5 a large (nearly 800%) increase in the running times when HIDE does not perform dimensionality reduction.The same is the trend when the dimensionality reduction component of HIDE is re- placed with other existing dimensionality reduction algorithms in the literature, except that the increase in running times are lesser, justifying the relatively higher efficiency of HIDE’s dimensionality component. Figure 3: Robustness of HIDE with increasing dimensionality of subspaces for the (a) SD5 dataset (left), and (b) RWD1 dataset 5.3 Privacy Preserving Performance The privacy preserving performance of HIDE with various ex- pected distortion levels as constraints are plotted in Figure 6. Note that for Hamming distance as a distortion metric, the expected dis- tortion between X and Y lies between 0 and 1. In the interest of space, we only plot results for the Hamming distortion metric. The results for the l2 distortion metric are quite similar. We ob- serve (in accordance with our intuition) that the optimal value of privacy leakage, i.e., mutual information between private attributes S and perturbed public attributesY , decreases at a rapid rate with increasing expected distortion values. In Figure 7, we plot the per- formance of HIDE under collusion, for a given representative mean distortion value of 0.6. We observe (as expected) that the value of mutual informationI(S;Y ) increases with the number of colluding agents; however, the rate of increase in the value ofI(S;Y ) starts decreasing with the increase in the number of agents beyond a cer- tain threshold value and begins to converge, implying collusion be- coming ineffective beyond the threshold. In Figure 8, we compare the performance of HIDE with differential privacy for real world data set RWD1, on a count query. For the-DP case, we observe that HIDE provides increasingly better de-anonymity guarantees (in terms of average leakage in bits) compared to DP for increas- ing values, under the Laplacian noise mechanism. For the case of (;)-DP, for a given value of 0.05 (a representative value in the interval [0.01, 0.1]) for the Gaussian noise mechanism, we ob- serve HIDE to outperform DP till a certain2 [0; 1] value, beyond which DP outperforms HIDE. This is true for all values of, we experiment on. Thus, HIDE is clearly a better substitute for-DP under the Laplacian noise mechanism, for providing tight privacy guarantees. On the other hand, the relatively better performance of HIDE over a Gaussian noise driven (;)-DP mechanism to pro- vide tight privacy guarantees depends on the value of. 2 4 6 8 HIDE HiCS ENCLUS PODM FB Ratio of Running Time (full/reduced) Subspace Search Technique SD6 (1000 dimensions) SD7 (2000 dimensions) Figure 5: Running Time Ratio of HIDE with and without dimen- sionality reduction for very high-dimensional data sets Figure 8: HIDE comparison with DP on RWD1 via (a) -DP [Laplacian] (left), and (b) (;)-DP [Gaussian] (right) 6. RELATED WORK In this section, we briefly review existing research on preserving privacy in datasets, as applicable to our work. We segment this section into two broad categories: (i) preserving privacy in datasets, and (ii) dimensionality reduction in high-dimensional datasets. 6.1 Preserving Privacy in Datasets Earliest attempts at systematic privacy preservation in datasets resulted first in adhoc techniques like subsampling, aggregation, and suppression (e.g., [32], [33] and the references therein.) Then came the methods ofk-anonymity [1] (only protects identity dis- closures), t-closeness [34], and l-diversity [35]. The former two preserve both identity disclosures as well as attribute-based disclo- sures; however none of the last three methods are universal, i.e., they are only robust against limited adversaries. The first uni- versal formalism for privacy was proposed in differential privacy (DP) [2]; since then researchers have focussed on using DP for inference algorithms, transporting, and querying data [36]. More recent works [37], [38] focused on the relation of differential pri- vacy with statistical inference. Other frameworks similar to dif- ferential privacy exist such as the Pufferfish framework [39], that however does not focus on utility preservation. Many approaches rely on information-theoretic techniques to model and analyze the privacy-accuracy tradeoff, such as [40] [41] [42] [43] [44] [45]. In particular, the works in [41] [42] [43] [44] focus mainly on col- lective privacy for all or subsets of the entries of a database, and provide fundamental and asymptotic results on the rate-distortion- equivocation region as the number of data samples grows arbitrar- ily large. In contrast, the framework studied in [46], models non- asymptotic privacy guarantees in terms of the inference cost gain that an adversary achieves by observing the released output. In this work, we follow the framework in [46], and use the log-loss cost to model the inference threat as the mutual information be- tween private data and released data, as in [40]. Composition of privacy guarantees under differential privacy has been studied in theory by [47] [48] . In contrast, we empirically study composition of privacy under information-theoretic privacy. Drawbacks - The main drawback of the above mentioned information- theoretic works is the guaranteeing of privacy under the often im- practical assumption that the prior joint distribution between users’ private and public data is known to the privacy preserving entity. In this work we proposed a general solution methodology to solve in- formation theory driven privacy-utility tradeoff optimization prob- lems when the prior joint statistical distribution between users’ pri- vate and public data is unknown in the worst case to the optimizer. 6.2 Dimensionality Reduction Dimensionality reduction techniques (non-randomized and ran- domized) [49] [50], including PCA, are not aware of locally clus- tered projections; they only measure the (non-)linear dependence between dimensions, meaning that they consider one (global) pro- jection, and may hence miss interesting local projections contain- ing subspace clusters and outliers. Our method provides multiple projections for clustering and outlier mining. With respect to de- pendencies between dimensions, well known correlation measures (e.g., Spearman, Kendall), and modern variants [51] are only aimed at pairwise correlations. However, mutual dependence among sev- eral dimensions might be missed in the process. In contrast, our method is not limited to pairwise correlations. In regard to feature selection for dimensionality reduction, current methods [52] [53] are specifically bound to clustering, where most approaches select a single projection of the data space that might miss local projec- tions and outliers, whereas HIDE mines multiple possibly overlap- ping subspaces. In order to overcome the drawbacks of single sub- space selection, several recent methods focus on multiple subspace projections with arbitrary dimensionality. However, these methods rely on discretization of continuous dimensions [18] [19], or only work with binary [54] and/or discrete data [55]. ENCLUS [18] and PODM [19] detect subspaces with low entropy and high interest, discretizing continuous dimensions into equi-width bins in order to compute the entropy measure. By requiring discretization, these methods have unintuitive parameters, and are hence inherently sus- ceptible to knowledge loss and to the curse of dimensionality. To some extent, these limitations have been tackled by HiCS [20], which works directly on continuous data. It quantifies the differ- Figure 4: Total Running Time of HIDE for (a) SD1 to SD5 datasets (left), and (b) RWD1 and RWD2 (right) datasets Figure 6: Preserving Privacy with HIDE via (a)OPT1 (left), (b)OPT3 (middle), and (c)OPT4 (right) ence between the marginal and conditional distribution in a ran- dom dimension of the considered subspace; by its random nature it may hence miss relevant subspaces. On the other hand, HIDE can reliably score contrast, regardless of subspace dimensionality. Fur- thermore, for each subspace we find the permutation of dimensions that yields optimal contrast. With regard to outlier detection, ex- isting methods [24] [56] [57] based on subspace clustering choose relevant dimensions specific to each individual outlier. In contrast, HIDE detects outliers using a general method that does not target specific outliers. 7. CONCLUSIONS In this paper we proposed HIDE, a computationally efficient in- formation theory based privacy preserving mechanism for high di- mensional datasets that leads to optimal utility-privacy tradeoffs, when the privacy preserving entity in the worst case might have no prior information that links user private data and his public data to be released. As a first component of HIDE, we proposed a gen- eral solution methodology to solve the information theory driven privacy-utility tradeoff optimization problems when the prior joint statistical distribution between users’ private and public data is un- known in the worst case to the optimizer. We then described in de- tail the second component of HIDE, that is a novel computationally efficient mechanism to choose the most relevant dataset attributes from a large set of private and public user data attributes. This mechanism merges information-theoretic concepts and dimension- ality reduction techniques from unsupervised machine learning to arrive at a significantly smaller set of statistically relevant data at- tributes. We validated the theory behind HIDE with extensive ex- periments conducted on practical datasets. As part of our main ex- perimental results, we showed that (i) the performance, i.e., privacy- utility tradeoffs, obtained via HIDE are better than those obtained from differential privacy (DP), for both single and compositional queries, and (ii) HIDE’s computationally efficient mechanism speeds up the total running time of HIDE by significant margins (upto eight times) when compared to a mechanism which accounts for all attributes of a dataset. 8. REFERENCES [1] L. Sweeney, “k-anonymity: A model for protecting privacy,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, no. 05, pp. 557–570, 2002. [2] C. Dwork, “Differential privacy,” in Automata, languages and programming, pp. 1–12, Springer, 2006. [3] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [4] N. Merhav and M. Feder, “Universal prediction,” Information Theory, IEEE Transactions on, vol. 44, no. 6, pp. 2124–2147, 1998. [5] J. W. Miller, R. Goodman, and P. Smyth, “On loss functions which minimize to conditional expected values and posterior probabilities,” Information Theory, IEEE Transactions on, vol. 39, no. 4, pp. 1404–1408, 1993. [6] R. Ahlswede and P. Gács, “Spreading of sets in product spaces and hypercontraction of the markov operator,” The annals of probability, pp. 925–939, 1976. [7] V . Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation, hypercontractivity, and the data processing inequality studied by erkip and cover,” arXiv preprint arXiv:1304.6133, 2013. [8] H. O. Hirschfeld, “A connection between correlation and contingency,” in Mathematical Proceedings of the Cambridge Philosophical Society, vol. 31, pp. 520–524, Cambridge Univ Press, 1935. [9] H. S. Witsenhausen, “On sequences of pairs of dependent random variables,” SIAM Journal on Applied Mathematics, vol. 28, no. 1, pp. 100–113, 1975. [10] K. I. Diamantaras and S. Y . Kung, Principal component neural networks: theory and applications. John Wiley & Sons, Inc., 1996. [11] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [12] G. Strang, “Introduction to linear algebra,” 2011. [13] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regression and correlation,” Journal of the American statistical Association, vol. 80, no. 391, pp. 580–598, 1985. [14] E. M. Stein and R. Shakarchi, Real analysis: measure theory, integration, and Hilbert spaces. Princeton University Press, 2009. [15] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012. [16] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: bringing order to the web.,” 1999. [17] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215, ACM, 2000. [18] C.-H. Cheng, A. W. Fu, and Y . Zhang, “Entropy-based subspace clustering for mining numerical data,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 84–93, ACM, 1999. [19] M. Ye, X. Li, and M. E. Orlowska, “Projected outlier detection in high-dimensional mixed-attributes data set,” Expert Systems with Applications, vol. 36, no. 3, pp. 7104–7113, 2009. [20] F. Keller, E. Muller, and K. Bohm, “Hics: high contrast subspaces for density-based outlier ranking,” in 2012 IEEE 28th International Conference on Data Engineering, pp. 1037–1048, IEEE, 2012. [21] A. Di Crescenzo and M. Longobardi, “On cumulative entropies,” Journal of Statistical Planning and Inference, vol. 139, no. 12, pp. 4072–4087, 2009. [22] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining Figure 7: HIDE performance under collusion via (a)OPT1 (left), (b)OPT3 (middle), and (c)OPT4 (right), with expected distortion of 0.6 applications, vol. 27. ACM, 1998. [23] E. Müller, S. Günnemann, I. Assent, and T. Seidl, “Evaluating clustering in subspace projections of high dimensional data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1270–1281, 2009. [24] E. Müller, M. Schiffer, and T. Seidl, “Statistical selection of relevant subspace projections for outlier ranking,” in 2011 IEEE 27th International Conference on Data Engineering, pp. 434–445, IEEE, 2011. [25] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” in ACM sigmod record, vol. 29, pp. 93–104, ACM, 2000. [26] A. Lazarevic and V . Kumar, “Feature bagging for outlier detection,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 157–166, ACM, 2005. [27] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.,” in Kdd, vol. 96, pp. 226–231, 1996. [28] S. Günnemann, I. Färber, E. Müller, I. Assent, and T. Seidl, “External evaluation measures for subspace clustering,” in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 1363–1372, ACM, 2011. [29] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” in Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503, Springer, 2006. [30] A. Ghosh, T. Roughgarden, and M. Sundararajan, “Universally utility-maximizing privacy mechanisms,” SIAM Journal on Computing, vol. 41, no. 6, pp. 1673–1693, 2012. [31] C. Ordonez and E. Omiecinski, “Efficient disk-based k-means clustering for relational databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 909–921, 2004. [32] T. E. Raghunathan, J. P. Reiter, and D. B. Rubin, “Multiple imputation for statistical disclosure limitation,” Journal of official statistics, vol. 19, no. 1, p. 1, 2003. [33] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, M. West, et al., “Assessing the risk of disclosure of confidential categorical data,” in Bayesian statistics 7: Proceedings of the seventh Valencia international meeting, p. 125, Oxford University Press, USA, 2003. [34] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy beyond k-anonymity and l-diversity,” in 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115, IEEE, 2007. [35] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond k-anonymity,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 1, no. 1, p. 3, 2007. [36] C. Dwork, “A firm foundation for private data analysis,” Communications of the ACM, vol. 54, no. 1, pp. 86–95, 2011. [37] L. Wasserman and S. Zhou, “A statistical framework for differential privacy,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 375–389, 2010. [38] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pp. 429–438, IEEE, 2013. [39] D. Kifer and A. Machanavajjhala, “A rigorous and customizable framework for privacy,” in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pp. 77–88, ACM, 2012. [40] D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer, “From t-closeness-like privacy to postrandomization via information theory,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 11, pp. 1623–1636, 2010. [41] I. S. Reed, “Information theory and privacy in data banks,” in Proceedings of the June 4-8, 1973, national computer conference and exposition, pp. 581–587, ACM, 1973. [42] H. Yamamoto, “A source coding problem for sources with additional outputs to keep secret from the receiver or wiretappers (corresp.),” IEEE Transactions on Information Theory, vol. 29, no. 6, pp. 918–923, 1983. [43] L. Sankar, S. R. Rajagopalan, and H. V . Poor, “Utility-privacy tradeoffs in databases: An information-theoretic approach,” IEEE Transactions on Information Forensics and Security, vol. 8, no. 6, pp. 838–852, 2013. [44] R. Tandon, L. Sankar, and H. V . Poor, “Discriminatory lossy source coding: Side information privacy,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5665–5677, 2013. [45] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 211–222, ACM, 2003. [46] F. du Pin Calmon and N. Fawaz, “Privacy against statistical inference,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 1401–1408, IEEE, 2012. [47] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pp. 51–60, IEEE, 2010. [48] A. Blum, K. Ligett, and A. Roth, “A learning theory approach to noninteractive database privacy,” Journal of the ACM (JACM), vol. 60, no. 2, p. 12, 2013. [49] J. A. Lee and M. Verleysen, Nonlinear dimensionality reduction. Springer Science & Business Media, 2007. [50] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, “Randomized dimensionality reduction for-means clustering,” IEEE Transactions on Information Theory, vol. 61, no. 2, pp. 1045–1062, 2015. [51] D. N. Reshef, Y . A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,” science, vol. 334, no. 6062, pp. 1518–1524, 2011. [52] J. G. Dy and C. E. Brodley, “Feature selection for unsupervised learning,” Journal of machine learning research, vol. 5, no. Aug, pp. 845–889, 2004. [53] M. H. Law, M. A. Figueiredo, and A. K. Jain, “Simultaneous feature selection and clustering using mixture models,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 9, pp. 1154–1166, 2004. [54] X. Zhang, F. Pan, W. Wang, and A. Nobel, “Mining non-redundant high order correlations in binary data,” Proceedings of the VLDB Endowment, vol. 1, no. 1, pp. 1178–1188, 2008. [55] P. Chanda, J. Yang, A. Zhang, and M. Ramanathan, “On mining statistically significant attribute association information.,” in SDM, pp. 141–152, SIAM, 2010. [56] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” in ACM Sigmod Record, vol. 30, pp. 37–46, ACM, 2001. [57] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Outlier detection in axis-parallel subspaces of high dimensional data,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 831–838, Springer, 2009. Appendix September 12, 2016 Proof of Lemma 1. We have c 0 = min q2P S E S [ logq(S)] =E S [ logp(S)] +D(pjjq); where D(pjjq) is the Kullback-Liebler (KL) divergence between distributions p and q. Here p is the real distribution that generates S. Since, D(pjjq) 0, with equality if p =q, we have c 0 =H(S), where H(S) =E S [ logp(S)]; is the Shannon entropy ofS denoting the degree of randomness inS [1]. Similarly, we havec y =H(SjY = y). Thus, we get C =H(S)E Y [H(SjY =y)] =I(S;Y ): Hence, we have proved Lemma 1. Proof of Theorem 1. Proof. The proof of Theorem 1 needs the following lemma. L 1. Let C(s;q) be a bounded cost function such that L = sup s2S;q2P S jC(s;q)j <1. Then, for any given Y =y, we have E SjY [C(S;q 0 )C(S;q y )jY =y] 2 p 2L q D(P SjY=y jjP S ): Proof. We have E SjY [C(S;q 0 )C(S;q y )jY =y] = X s p(sjy)K; where K =C(s;q 0 )C(s;q y ); and P s p(sjy)K simplies to X s p(sjy)p(s) +p(s))K = X s (p(sjy)p(s))K + X s p(s)K; 2L X s jp(sjy)p(s)j + (E S [C(S;q 0 ]E S [C(S;q 0 )]); 2L X s jp(sjy)p(s)j = 4LjjP SjY=y P S jj TV ; 1 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik where, TV denotes the total variation distance (or L 1 distance) between P SjY=y and P S . Thus, from Pinsker's inequality [1], we have 4LjjP SjY=y P S jj TV 4L r 1 2 D(P SjY=y jjP S ): Here C(s;q 0 )C(s;q y ) 2L,E S [C(S;q 0 ]E S [C(S;q 0 )] 0. Thus, we have proved Lemma 2. We now prove Theorem 1. We have C =E S [C(S;q 0 )]E Y [E SjY [C(S;q y )jY =y]]; or C =E Y [E SjY K]; or C 2 p 2L q E Y [D(P SjY=y jjP S )] 2 p 2L p I(S;Y ); where the latter inequality follows from the concavity of the square root function, and the former equality follows from Lemma 2. Hence, we have proved Theorem 1. Proof of Theorem 2. Proof. OPT 1 can be rewritten as arg min P YjX ;P YjS X y2Y X s2S P YjS (yjs)P S (s) log P YjS (yjs) P Y (y) ; (1) subject to X y2Y X x2X P YjX (yjx)P X (x)d(y;x) ; (2) X x2X P XjS (xjs)P YjX (yjx) =P YjS (yjs)8y;s; (3) X s2S P YjS (yjs)P S (s) =P Y (y)8y: (4) In order to prove the convexity of the objective function, rst note that ax logx is convex for a xed a 0 and x 0, thus the perspective of g(x;z;a) = ax log( x z ) is also convex in x and z for z 0 and a 0 [2]. Since the objective function in (1) can be written as X y2Y X s2S g(P YjS (yjs);P Y (y);P S (s)); it follows that the objective function is convex. OPT 1 is convex whend(;) is a function ofX andY but not of their statistical properties, in which caseE X;Y [d(X;Y )] is linear inP YjX . In the special case when d(;) is a function ofX andY and their statistical properties, e.g., whend(x;y) = logP (X =xjY =y) is the SI function, E X;Y [d(X;Y )] is not linear in P YjX , and hence OPT 1 is not a convex optimization problem. Proof of Corollary 1. Proof. We have I(S;Y ) =I(S;X;Y )I(X;YjS): Since X is a deterministic function of S, we have I(S;Y ) =I(X;Y ) +I(S;YjX)I(X;YjS) =I(X;Y ): The decision whether OPT 2 is convex is arrived at via a rationale similar to the one in the proof of Theorem 2 while deciding whether OPT 1 was convex. 2 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik Proof of Theorem 4. Proof. OPT 4 can be cast as OPT 4 : min Q:QQ T p P X = p P X 2 (Q) subject to E X;Y [d(X;Y )]D; where 2 (Q) is the second largest singular value of Q 0 (entry-wise), and the expectation operator in the constraint is over the joint probability induced by the matrix Q. Since the constraint is quadratic in the entries of Q, and is linear in P YjX we have in OPT 4 a quadratically constrained convex program for d(;) not a function of the statistical properties of X and Y , and a non-convex program otherwise. Proof of Theorem 5. The proof of Theorem 5 involves the use of lemmas L2 to L6 (stated below). L2. LetP and P be the matrix form of two joint probability distributions on (X;Y ) such thatP X;Y (x;y) = [P ] x;y and P X;Y = [ P ] x;y . We bound the dierence between Q and Q by the dierence between P and P , as follows: jjQ Qjj 2 =jj jj 2 1 p X (P ) jj Y jj 2 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 2 + 1 p Y ( P ) jj Pjj 2 jj X jj 2 ; (5) where =R X (P ) 1 2 PR Y (P ) 1 2 R X ( P ) 1 2 PR Y ( P ) 1 2 ; X =R X (P ) 1 2 R X ( P ) 1 2 ; Y =R Y (P ) 1 2 R Y ( P ) 1 2 ; and Q =R X (P ) 1 2 P X;Y R Y (P ) 1 2 : We also have jjQ Qjj 2 1 p X (P ) p Rjj Y jj 1 1 2 Y (P; P ) 3 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 1 + 1 p Y ( P ) p Rjj X jj 1 1 2 Y (P; P ) 3=2 ; (6) where X =R X (P )R X ( P ); Y =R Y (P )R X ( P ); and R = maxfjXj;jYjg, Y (P; P ) = minf Y (P ); Y (P 0 )g and X (P; P ) = minf X (P ); X (P 0 )g. Proof. We have jjQ Qjj 2 =jj jj 2 ; R X (P ) 1 2 PR Y (P ) 1 2 R X (P ) 1 2 PR Y ( P ) 1 2 +R X (P ) 1 2 PR Y ( P ) 1 2 R X ( P ) 1 2 PR Y ( P ) 1 2 ; R X (P ) 1 2 Pjj 2 jj Y jj 2 +R Y ( P ) 1 2 jj 2 jjR X (P ) 1 2 PR X ( P ) 1 2 Pjj 2 ; R X (P ) 1 2 Pjj 2 jj Y jj 2 +R X (P ) 1 2 PR X (P ) 1 2 P +R X (P ) 1 2 PR X ( P ) 1 2 P; or jjQ Qjj 2 R X (P ) 1 2 Pjj 2 jj Y jj 2 +jjR Y ( P 1 2 jj 2 jjR X (P ) 1 2 jj 2 jjP Pjj 2 +jjR Y ( P ) 1 2 jj 2 jj Pjj 2 jj X jj 2 : (7) For any distribution P X;Y , we have jjR X (P ) 1 2 jj 2 = max x2X 1 p P X (x) = 1 p X (P ) (8) 3 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik and jjR Y (P ) 1 2 jj 2 = max y2Y 1 p P Y (y) = 1 p Y (P ) (9) Substituting (4) and (5) in (3), we get jjQ Qjj 2 =jj jj 2 1 p X (P ) jj Y jj 2 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 2 + 1 p Y ( P ) jj Pjj 2 jj X jj 2 ; (10) which proves the rst part of L2. Further, for a joint probability distribution matrix, P , we have jjPjj 1 1, andjjPjj 1 1. In addition, we have the following result from [3]: L 3 (due to [3]). For a given mn matrix, A, we have jjAjj 2 p mjjAjj 1 ; and jjAjj 2 p njjAjj 1 : Thus, using L3, we havejjPjj 2 p jYjjjPjj 1 p jYj andjjPjj 2 p jXjjjPjj 1 p jXj. Using this result we have jjQ Qjj 2 1 p X (P ) p Rjj Y jj 1 1 2 Y (P; P ) 3 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 1 + 1 p Y ( P ) p Rjj X jj 1 1 2 Y (P; P ) 3=2 : (11) Thus, we have proved L2. L 4. LetjjP Pjj 1 , for some > 0. Then we have j j 2 2 2 R p R: (12) Proof. To prove this lemma, we rst need the result from the following lemma from [4]: L 5 (due to [4]). Let A 1 and A 2 be two matrices of the same size. We have j 1 i 2 i jjjA 1 A 2 jj 2 ; (13) where 1 i and 2 i are the i-th largest singular values of A 1 and A 2 , respectively. Using L5 and part (1) of L2, we have j 2 2 jjjQ Qjj 2 1 p X (P ) jj Y jj 2 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 2 + 1 p Y ( P ) jj Pjj 2 jj X jj 2 ; (14) where 2 is the second largest singular value of matrix Q. Furthermore, using part (2) of L2, we have j 2 2 j 1 p X (P ) p Rjj Y jj 1 1 2 Y (P; P ) 3 + 1 p X (P ) 1 p Y ( P ) jjP Pjj 1 + 1 p Y ( P ) p Rjj X jj 1 1 2 Y (P; P ) 3=2 : (15) Since,jjP Pjj 1 , we have j 2 2 j 2 2 R p R + p R + 2 2 R p R 2 2 2 R p R: (16) Thus, we have proved L4. We now state the following lemma (see [5] for a proof) for its use in the proof of Theorem 5. 4 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik L6. LetP be a probability distribution overX. Also letP (p) denote the empirical probability distribution of X, obtained from i.i.d. samples,fx i g p i=1 , drawn according to P . Then we have P[jjP (p) Pjj 1 >] 4e p 2 : (17) We now proceed with the proof of Theorem 5. Proof of Theorem 5. Using L1 and the union bound in probability, we have for a given number of i.i.d. samples p P[j 2 (Q p 2 (Q)j>]P[ 1 ] +P[ 2 ] +P[ 3 ]; (18) where 1 = 1 X (P ) jjR Y (P (p) ) 1 2 R Y (P ) 1 2 jj 2 jjPjj 2 > 3 ; 2 = 1 X (P ) 1 Y (P (p) ) jjP (p) Pjj 2 > 3 ; and 3 = 1 Y (P (p) ) jjR X (P (p) ) 1 2 R X (P ) 1 2 jj 2 jjP (p) jj 2 > 3 ; that further leads us to P[j 2 (Q p 2 (Q)j>]P[ 1 ] +P[ 2 ] +P[ 3 ]; (19) where 1 =jjR Y (P (p) ) 1 2 R Y (P ) 1 2 jj 1 > 3 X (P ) 1 p R ; 2 =jjP (p) Pjj 1 > X (P ) Y (P (p) ) 3 1 p R ; and 3 =jjR X (P (p) ) 1 2 R X (P ) 1 2 jj 1 > Y (P (p) ) 3 1 p R : Using L5, we have P jjP (p) (Y )P (Y )jj 1 > 1 2 Y (P ) 4e p( 1 2 Y (P)) 2 : (20) This occurs with a probability less than 2 ifp 1 4 log 2 ( 1 2 ) 2 . Moreover, givenjjP (p) (Y )P (Y )jj 1 1 2 Y (P ), we obtain Y (P (p) )> 1 2 Y (P ). We also have P[j 2 (Q p ) 2 (Q)j>] =P j 2 (Q p ) 2 (Q)j>jP (p) (Y )P (Y )jj 1 1 2 Y (P ) P P (p) (Y )P (Y )jj 1 1 2 Y (P ) +P j 2 (Q p ) 2 (Q)j>jP (p) (Y )P (Y )jj 1 1 2 Y (P ) P P (p) (Y )P (Y )jj 1 1 2 Y (P ) P j 2 (Q p ) 2 (Q)j>jP (p) (Y )P (Y )jj 1 1 2 Y (P ) P + P (p) (Y )P (Y )jj 1 1 2 Y (P ) : Let p 1 4 log 2 ( 1 2 ) 2 to make the second term less than or equal to 1 2 . Next, we choose suciently large p such that the rst term is no greater than 1 2 . Using (15) we have P j 2 (Q p ) 2 (Q)j>jP (p) (Y )P (Y )jj 1 1 2 Y (P ) P jjR Y (P (p) )R Y (P ) 1 2 jj 1 > 3 (2 Y (P ) 3 ) X (P ) 1 p R 5 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik +P jjP (p) Pjj 1 > X (P ) 1 2 Y (P ) 3 1 p R +P jjR X (P (p) )R X (P )jj 1 > (2 X (P ) 3 ) 1 2 Y (P ) 3 1 p R : Now using L6, each term on the right hand side goes to zero exponentially fast. Moreover, we have P[j 2 (Q p ) 2 (Q)j>] 12e p 02 ; where 0 = ( 3 (2 Y (P ) 3 ) X (P ) r 1 R ; X (P ) 1 2 Y (P ) 3 1 p R ; (2 X (P ) 3 ) 1 2 Y (P ) 3 1 p R ) ; which is equal to 3 4 1 p R . Therefore, in order to have P j 2 (Q p ) 2 (Q)j>jP (p) (Y )P (Y )jj 1 1 2 Y (P ) 2 ; it suces to have p log 12 2 3 4 1 p R : Thus, overall it suces to have p log 24 3 4 1 p R . Hence, we have proved Theorem 5. Proof of Theorem 6. Let P SjX be xed, and dene g (P X ) =H(S)H(X); where H(S) and H(S) are the entropy of S and X respectively, when (S;X) ~ P SjX P X . For 0 < << 1, let P (i) =P X (i)(1 +f(i)) be a perturbed version of P X , when E[f(X)] = 0 and w.l.o.g.,jjf(X)jj 2 = 1. The second derivative of g (P ) at = 0 is @ 2 g (P ) @ 2 j =0 = log 2 (e)(jjE[f(X)jSjj 2 2 +): (21) The above equation directly follows from the following result: @ 2 a(1 +b) log 2 a(1 +b) @ 2 =b 2 a log 2 (e): (22) We now need the following lemma to move forward with the proof. L 7. For a given P S;X , we have (P S;X ) = minfjjE[f(X)jS]jj 2 2 jf :X!R;E[f(X)] = 0;jjf(X)jj 2 = 1g: (23) Proof. Letf :X!R;E[f(X)] = 0, andjjf(X)jj 2 2 = 1. Letf2R jXj be a vector with entriesf i =f(i) for i2X. Observe that jjE(f(X)jSjj 2 2 = X s P S (s)E[f(X)jS =s] 2 ; or jjE(f(X)jSjj 2 2 =f T P T XjS D S P XjS f T ; 6 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik or jjE(f(X)jSjj 2 2 =f T D 1 2 X Q T QD 1 2 X f(P S;X ); where the last inequality follows by noting that x = f T D 1 2 X satisesjjxjj 2 = 1, and that (P S;X ) is the smallest eigenvalue of positive semi-denite matrix Q T Q where Q = D 1 2 S P S;X D 1 2 X . Hence we have proved L7. Thus, from L7, if (P S;X ), then for any suciently small perturbation of P X , (22) will be non- positive. Consequently if > (P S;X ), we can nd a perturbation f(X) such that (22) is positive. Therefore, g (P X ) has a negative semi-denite Hessian if and only if 0 (P S;X ). For any S! X!Y , we have I(S;Y ) I(X;Y ) PUC opt (P S;X ); and consequently, for 0 y PUC opt (P S;X ), we have g y(P X ) =H(SjY ) y H(XjY ); and g y(P X ) touches the upper-concave envelope of g y at P X . Consequently, g y has a negative semi- denite Hessian at P X and from (21), g y(P S;X ). Since this holds for any 0 y PUC opt (P S;X ), we nd PUC opt (P S;X )(P S;X ). We have thus proved Theorem 6.1. In order to prove Theorem 6.2, we need the following lemma. L 8. Let Q S denote the distribution of S when P SjX is xed and XQ X . Then PUC opt (P S;X ) = inf Q X 6=P X D(Q S jjP S ) D(Q X jjP X ) . Proof. For xed P YjX and P SjX , we have I(S;X) I(X;Y ) = P y2Y P Y (y)D(P SjY=y jjP S ) P y2Y P Y (y)D(P XjY=y jjP X ) ; or I(S;X) I(X;Y ) min y2Y;D(P XjY=y jjP X )>0 D(P SjY=y jjP S ) D(P XjY=y jjP X ) ; or I(S;X) I(X;Y ) inf D(Q S jjP S ) D(Q X jjP X ) : Now let d be the inmum in the RHS of PUC opt (P S;X ) = inf Q X 6=P X D(Q S jjP S ) D(Q X jjP X ) ; and Q X satisfy D(Q Y jjP Y ) D(Q X jjP X ) =d +; where > 0. For any > 0 and suciently small, let P YjX be such that Y = 2, P Y (1) =, P XjY (xj1) = Q X (x) and P XjY (xj2) = 1 1 P X (x) 1 Q X (x): Since for any distribution r X with support X, we have D((1)P X +r X jjP X ) =o(), we nd I(S;Y ) =D(P SjY=y jjP S ) + (1)D(P SjY=0 jjP S ) =D(Q S jjP S ) +o(); 7 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik and equivalently, I(X;Y ) =D(Q X jjP X ) +o(). Consequently, I(S;X) I(X;Y ) = D(Q S jjP S ) +o() D(Q X jjP X ) + () !d +; where the limit is taken as ! 0. This holds for any > 0, then PUC opt (P S;X d , proving L8. It follows immediately from Theorem 6.1 that (P S;X ) = 0! PUC opt (P S;X ) = 0. Then, since D(Q X jjP X ) min i log 2 P X (i), and X is nite, it follows from L8 that for any > 0, their exists Q X and 0 min i log 2 P X (i) such that D(Q X jjP X )> 0; and D(Q S jjP S )<: We can then construct a sequenceQ 1 X ;Q 2 X ;::::: such thatQ i X 6=P X , D(Q k S jjP S ) k , and lim k!1 k = 0. Let Q k S be a vector whose entries are Q k S (). Then, from Pinsker's inequality, we have k 1 2 jjQ k S P S jj 2 1 1 2 jjQ k S P S jj 2 2 : (24) Dening X k = Q k X P X , observe that 0 <jjX k jj 2 2 2 and from (24) we havejjP SjX X k jj 2 p 2 k . Hence, we have lim l!1 jjP SjX X k jj 2 2 jjX k jj 2 2 = 0: (25) In addition, denoting S m = min s P S (s) and X M = min x2X P X (x), for each k we have jjP SjX X k jj 2 2 jjX k jj 2 2 min jjYjj 2 2 jjP SjX Yjj 2 2 jjyjj 2 2 : (26) or jjP SjX X k jj 2 2 jjX k jj 2 2 min jjYjj 2 2 jjP S;X D 1 2 X Yjj 2 2 jjD 1 2 X yjj 2 2 : (27) or jjP SjX X k jj 2 2 jjX k jj 2 2 min jjYjj 2 2 S m jjD 1 2 S P S;X D 1 2 X Yjj 2 2 X M jjyjj 2 2 : (28) or jjP SjX X k jj 2 2 jjX k jj 2 2 s m x M min jjyjj 2 2 >0 jjQyjj 2 2 jjyjj 2 2 = s m (P S;X ) x M : (29) In the derivation above, (27) follows fromD X being invertible (by denition), (28) is a direct consequence ofjjD 1 2 S yjj 2 2 s 1 m jjyjj 2 2 andjjD 1 2 X yjj 2 2 x M jjyjj 2 2 for anyy, and (29) follows from the denition of Q and (P S;X ), respectively. Combining (29) with (25), ir follows that (P S;X ) = 0, proving Theorem 6.2. The direct part of Theorem 6.3 follows directly from the denition ofPUC opt (P S;X ) and Theorem 6.2. Assume that (P S;X ) = 0. Then from L7, there exists F :X!R such thatjjf(X)jj 2 = 1;E[f(X) = 0], andjjE[f(X)jS]jj 2 = 0. Consequently, E[f(X)jS =s] = 0 for all s2S. Fix Y = [2], and for > 0 and approximately small, we have P YjX (yjx) = 1 2 + (1) y f(x); y =f1; 2g: (30) Note that it is sucient to choose = (2 max x2X jf(X)j) 1 , so is strictly bounded away from 0. In addition, P Y (1) = 1 2 . Thus, we have I(X;Y ) = 1 X x2X P Xjx h b 1 2 +f(x) > 0; (31) 8 Submitted to Proceedings of the VLDB Endowment Pal, Chen, and Golubchik whereh b (x) =x log 2 x (1x) log 2 (1x) is the binary entropy function. SinceS!X!Y , we have P YjS (yjs) = X x2X (P YjX (yjx)P XjS (xjs) = X x2X 1 2 + (1) y f(x) P XjS (xjs); (32) or P YjS (yjs) = 1 2 + (1) y E[f(X)jS =s] = 1 2 ; and, consequently, S and Y are independent. Then I(S;Y ) = 0, and hence we proved Theorem 6.3. . Proof of Theorem 7. If (P S;X ) = 0, then the lower bound for t follows directly from the proof of Theorem 6.3, and in particular from (15) in the main paper. If (P S;X ) > 0, thenF =ff 0 g, and the lower bound (16) in the main paper reduces to the trivial bound t 0. In order to prove that the lower bound is sharp, consider S being an unbiased bit , drawn fromf1; 2g, and X the result of sending S through an erasure channel with erasure probability 1 2 and Xf1; 2; 3g, with 3 playing the role of the erasure symbol. Let f(x) equals 1 when x2f1; 2g and -1 when x = 3. Then f2F, h b 1 2 + f(x) 2jjfjj1 = 0 for all x2X and t = 1. But from Theorem 6.3, we have t H(XjS) = 1. Thus, the result in Theorem 7 follows. References [1] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [2] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004. [3] R. A. Horn and C. R. Johnson, Matrix analysis. Cambridge university press, 2012. [4] L. Mirsky, \Symmetric gauge functions and unitarily invariant norms," The quarterly journal of mathematics, vol. 11, no. 1, pp. 50{59, 1960. [5] L. Devroye, \The equivalence of weak, strong and complete convergence in l1 for kernel density estimates," The Annals of Statistics, pp. 896{904, 1983. 9
Linked assets
Computer Science Technical Report Archive
Conceptually similar
PDF
USC Computer Science Technical Reports, no. 917 (2010)
PDF
USC Computer Science Technical Reports, no. 918 (2010)
PDF
USC Computer Science Technical Reports, no. 905 (2009)
PDF
USC Computer Science Technical Reports, no. 920 (2011)
PDF
USC Computer Science Technical Reports, no. 943 (2014)
PDF
USC Computer Science Technical Reports, no. 834 (2004)
PDF
USC Computer Science Technical Reports, no. 766 (2002)
PDF
USC Computer Science Technical Reports, no. 914 (2010)
PDF
USC Computer Science Technical Reports, no. 923 (2012)
PDF
USC Computer Science Technical Reports, no. 924 (2012)
PDF
USC Computer Science Technical Reports, no. 904 (2009)
PDF
USC Computer Science Technical Reports, no. 906 (2009)
PDF
USC Computer Science Technical Reports, no. 928 (2012)
PDF
USC Computer Science Technical Reports, no. 615 (1995)
PDF
USC Computer Science Technical Reports, no. 952 (2015)
PDF
USC Computer Science Technical Reports, no. 888 (2007)
PDF
USC Computer Science Technical Reports, no. 919 (2011)
PDF
USC Computer Science Technical Reports, no. 896 (2008)
PDF
USC Computer Science Technical Reports, no. 894 (2008)
PDF
USC Computer Science Technical Reports, no. 643 (1996)
Description
Ranjan Pal, Chien-Lun Chen, and Leana Golubchik. "A privacy engineering framework for big data." Computer Science Technical Reports (Los Angeles, California, USA: University of Southern California. Department of Computer Science) no. 969 (2016).
Asset Metadata
Creator
Chen, Chien-Lun (author), Golubchik, Leana (author), Pal, Ranjan (author)
Core Title
USC Computer Science Technical Reports, no. 969 (2016)
Alternative Title
A privacy engineering framework for big data (
title
)
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Tag
OAI-PMH Harvest
Format
21 pages
(extent),
technical reports
(aat)
Language
English
Unique identifier
UC16269762
Identifier
16-969 A Privacy Engineering Framework for Big Data (filename)
Legacy Identifier
usc-cstr-16-969
Format
21 pages (extent),technical reports (aat)
Rights
Department of Computer Science (University of Southern California) and the author(s).
Internet Media Type
application/pdf
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/
Source
20180426-rozan-cstechreports-shoaf
(batch),
Computer Science Technical Report Archive
(collection),
University of Southern California. Department of Computer Science. Technical Reports
(series)
Access Conditions
The author(s) retain rights to their work according to U.S. copyright law. Electronic access is being provided by the USC Libraries, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Repository Email
csdept@usc.edu
Inherited Values
Title
Computer Science Technical Report Archive
Description
Archive of computer science technical reports published by the USC Department of Computer Science from 1991 - 2017.
Coverage Temporal
1991/2017
Repository Email
csdept@usc.edu
Repository Name
USC Viterbi School of Engineering Department of Computer Science
Repository Location
Department of Computer Science. USC Viterbi School of Engineering. Los Angeles\, CA\, 90089
Publisher
Department of Computer Science,USC Viterbi School of Engineering, University of Southern California, 3650 McClintock Avenue, Los Angeles, California, 90089, USA
(publisher)
Copyright
In copyright - Non-commercial use permitted (https://rightsstatements.org/vocab/InC-NC/1.0/