Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Differentially private learned models for location services
(USC Thesis Other)
Differentially private learned models for location services
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Differentially Private Learned Models for Location Services by Ritesh Ahuja A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Ritesh Ahuja Acknowledgements Foremost, my sincere appreciation for the teachings of my advisor, Prof. Cyrus Shahabi. Through- out the years, he has provided unfaltering support to me so that I may perform at my very best in; research, presentation ability, or my personal life. I also thank my dissertation committee mem- bers, Prof. Bhaskar Krishnamachari and Prof. Aleksandra Korolova, for their valuable feedback. They have been especially accommodating and kind. A special thanks to my co-author, Prof. Gabriel Ghinita for his dedicated mentorship. I have learned significantly from him, especially his writing style. I am grateful to my co-author and good friend Sepanta Zeighami for the many hours of intense discussions. I admire his unique ability to always take my argument in good faith, even when it may not be. I was fortunate to have made good friends at Infolab. Their company made my time at USC more fun than it had any right to be: Dimitris Stripelis, Sepanta Zeighami, Hien To, Sina Shaham, Haowen Lin, Giorgos Constantinou, Chrysovalantis Anastasiou, Mingxuan Yue and Yaguang Li. Last but not least, I thank my wife and best friend, Xiaoyang Long, for her warm encourage- ment and endless patience. When things get tough, I know I have you to take care of me. My heartfelt thanks to my family for their tremendous support throughout my education. When it got stressful and frustrating, they were always there to calm and support me through it. ii TableofContents Acknowledgements ii ListofTables vi ListofFigures vii Abstract xi Chapter1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Summary of thesis work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.1 On spatial range count query . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2 On spatio-temporal range count query, hotspot discovery, and POI visit forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3 On Next-POI prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter2: RelatedWork 12 2.1 On publishing spatial and spatio-temporal histograms . . . . . . . . . . . . . . . . 12 2.2 On location recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter3: ANeuralDatabaseforDifferentiallyPrivateSpatialRangeQueries 21 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Spatial Neural Histograms (SNH) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Baseline Solution using DP-SGD . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 A different learning paradigm for RCQs . . . . . . . . . . . . . . . . . . . 30 3.2.3 Proposed approach: SNH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Step 1: Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Step 2: SNH Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iii 3.3.3 Model Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 End-to-End System Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1 System Tuning with ParamSelect . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.1.1 ParamSelect forρ . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1.2 Generalizing ParamSelect to any system parameter . . . . . . . 44 3.4.2 Privacy and Security Discussion . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1.2 SNH system parameters . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1.3 Other experimental settings . . . . . . . . . . . . . . . . . . . . 50 3.5.2 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5.3 Ablation Study for SNH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.3.1 Modeling choices . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.3.2 Balancing Uniformity Errors . . . . . . . . . . . . . . . . . . . . 55 3.5.3.3 ParamSelect andρ . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.3.4 SNH Learning Ability in Non-Uniform Datasets . . . . . . . . . 56 3.6 Supplementary Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6.1 Comparison against all baselines . . . . . . . . . . . . . . . . . . . . . . . 59 3.6.2 Data Augmentation: Uniformity error or Large Scale Noise . . . . . . . . 60 3.6.3 Benefit of ParamSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.6.4 System parameters analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.5 Further GMM Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Differentially Private STHoles Implementation . . . . . . . . . . . . . . . . . . . . 63 3.8 ParamSelect Feature Engineering and Feature Selection . . . . . . . . . . . . . . . 66 Chapter 4: A Neural Approach to Spatio-Temporal Data Release with User-Level DifferentialPrivacy 69 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 VAE-based Density Release (VDR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 VDR Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.3 Learned Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.3.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.3.2 Denoising with Convolutional VAEs . . . . . . . . . . . . . . . . 82 4.2.4 Statistical Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.4.1 Estimation with Differential Privacy . . . . . . . . . . . . . . . . 86 4.2.4.2 Estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3 System Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.1 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.2 VDR Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.3 System Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.3.1 Refinement Factor and Sampling Parameter . . . . . . . . . . . 91 4.3.3.2 Discretization Granularity . . . . . . . . . . . . . . . . . . . . . 93 4.3.4 Data Release over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 iv 4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.2 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4.2.1 Range Count Query . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4.2.2 Forecasting Query . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.2.3 Hotspot Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4.3 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.3.1 Modeling choices . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.3.2 Data release over time . . . . . . . . . . . . . . . . . . . . . . . 104 4.4.4 User-level privacy and statistical refinement . . . . . . . . . . . . . . . . . 105 4.4.5 Learning Ability on Non-Uniform Datasets . . . . . . . . . . . . . . . . . 107 Chapter5: Differentially-PrivateNext-LocationPredictionwithNeuralNetworks 110 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.1.3 Differentially Private-SGD (DP-SGD) . . . . . . . . . . . . . . . . . . . . . 117 5.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2.2 Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 Model Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 Private Location Prediction (PLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.1 Private Location Prediction (PLP) . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.2 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4.2 Comparison with Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.3 Hyper-parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter6: Conclusions 143 Bibliography 145 v ListofTables 3.1 Summary of Notations in SNH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Urban datasets characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Validation set error of ParamSelect in predictingρ . . . . . . . . . . . . . . . . . . 67 5.1 Summary of Notations in PLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 vi ListofFigures 1.1 Location Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Location Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.1 Spatial Neural Histogram System . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 SHN Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Data Collection: map view (left), true cell count heatmap (middle),ε-DP heatmap with noisy counts (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Model Training: Augmented query sets of size r 1 to r k (top) are used to learn neural network models (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.5 Model utilization: 30m query answered from 25m network (left), 90m query from 100m network (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Impact of privacy budget: VS, SPD-VS and CABS datasets . . . . . . . . . . . . . . 51 3.7 Impact of privacy budget: GW dataset . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.8 Impact of data and query size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.9 Study of modeling choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.10 Impact of uniformity assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.11 Impact ofρ and ParamSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.12 Impact of data skewness (ε=0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.13 SNH learns patterns on GMM dataset of 16 components. Color shows number of data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii 3.14 Milwaukee (VS)ϵ =0.2,n=100k . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.15 Replacing uniformity error with noise . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.16 Study of ParamSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.17 Impact ofk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.18 Impact of model depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.19 SNH on GMMε=0.05,σ =14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.20 SNH on GMMε=0.2,σ =14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.21 SNH on GMMε=0.05,σ =7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.22 SNH on GMMε=0.2,σ =7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.23 SNH on GMMε=0.05,σ =3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.24 SNH on GMMε=0.2,σ =3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 (a) and (b): real-world complete and sampled dataset of location reports over time in Houston. (c) and (d): exact and noisy 3-d histograms created from the sampled dataset, higher brightness shows higher density. . . . . . . . . . . . . . . 76 4.2 Veraset Houston Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Spatial patterns over time on histogram slices . . . . . . . . . . . . . . . . . . . . 79 4.4 A histogram slice with varying coarseness . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Training set preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.6 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7 Impact of privacy budget on range count query (RCQ) accuracy. . . . . . . . . . . 99 4.8 Impact of query size on RCQ accuracy. . . . . . . . . . . . . . . . . . . . . . . . . 99 4.9 Impact of privacy budget on forecast query error (sMAPE) . . . . . . . . . . . . . 100 4.10 Impact of privacy budget on the hotspot query for the 4SQ Tokyo dataset at various thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 viii 4.11 Impact of privacy budget on hotspot query accuracy. . . . . . . . . . . . . . . . . 101 4.12 Impact of varying learning period . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.13 Out-of-sample denoising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.14 Learning Period Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.15 Multi Resolution Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.16 Model regularization analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.17 Sampling Error and and Noise Error at varying sampling rate. . . . . . . . . . . . 105 4.18 Effect of Scaling factor g on SE and SE+NE . . . . . . . . . . . . . . . . . . . . . . 105 4.19 Varying refinement constant C for VDR. . . . . . . . . . . . . . . . . . . . . . . . 107 4.20 Benefit of VDR refinement with C = 5e-5. . . . . . . . . . . . . . . . . . . . . . . 107 4.21 Gaussian Mixture Model visualization atσ varying from 3(top), 5, 7, 9(bottom) . . 108 4.22 Varyingσ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.23 Varying bottleneck. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Architecture of the location-recommendation model . . . . . . . . . . . . . . . . . 120 5.3 Data sampling and grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.4 Sensitivity of Gaussian Sum Query overU sample users: (a)ω = 1, a single user’s data is placed in exactly one bucket; (b)ω = 2, a single user’s data is split across two buckets. Since gradients computed over the generated bucketsH 1 ,...,H 4 , are bounded by C, a user can contribute at most2C to the computed sum. . . . . 131 5.5 Non-private model hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . 135 5.6 Non-private model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.7 PLP vs DP-SGD: varying privacy budgetε . . . . . . . . . . . . . . . . . . . . . . 137 5.8 PLP vs DP-SGD: varying sampling ratioq . . . . . . . . . . . . . . . . . . . . . . . 137 ix 5.9 Running time, varying grouping factorλ . . . . . . . . . . . . . . . . . . . . . . . 137 5.10 Effect of varying λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.11 Effect of varying σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.12 Effect of varying ℓ 2 clipping norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.13 Effect of varying neg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 x Abstract The emergence of mobile apps (e.g., location-based services, geosocial networks, ride-sharing) led to the collection of vast amounts of location data. Publishing aggregate information about user’s movements benefits research on traffic optimization, context-aware notifications and pub- lic health (e.g., disease spread). While the benefits provided by location data are indisputable, preserving location privacy is essential, since even aggregate statistics (e.g., in the form of pop- ulation density maps) can leak details about individual whereabouts. To protect against privacy risks, the data curator may publish a noisy version of the dataset, transformed according to Dif- ferential Privacy (DP) [36], the de-facto standard for releasing statistical data. The goal of a DP mechanism is to ensure privacy while keeping the query answers as accurate as possible. Conventional approaches build DP-compliant representation of a spatial dataset by partitioning the data domain into bins, and then publishing a histogram with the noisy count of points that fall within each bin. These solutions fall short of properly capturing skewness inherent to sparse location datasets, and as a result yield poor accuracy. Instead, in this work, we propose a paradigm shift towards learned representations of data. We learn powerful machine learning (ML) models that exploit patterns within location datasets to provide more accurate location services. We focus on key location queries that are the building blocks of many processing tasks. xi For population-density maps that support range count queries on snapshot releases, where each individual contributes a single location report, we design a neural database system called Spatial Neural Histograms (SNH). We model spatial data such that density features are preserved, even when DP-compliant noise is added. As such, learning can be used to also combat data mod- elling errors, present in DP setting. SNH employs a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. More often however, spatio-temporal density information is required for utility (e.g., in modeling COVID hotspots). As a result, the released statistics must continually capture population counts in small areas for short time periods. When releasing multiple snapshots, individuals may contribute multiple reports to the same dataset. The ability of an adversary to breach privacy increases significantly, and a shift to user- level privacy is necessitated. We employ the pattern recognition power of neural networks, specifically Variational Auto-Encoders (VAE), to reduce the noise introduced by DP mechanisms such that accuracy is increased, while the privacy requirement is still satisfied. The system called VAE based Data Release (VDR) enables longitudinal release of location data. In addition, by limit- ing the number of location reports from any single user, we reduce the noise needed by DP mech- anisms, while ensuring data utility is not compromised. As a post-processing step we propose statistical estimators to adjust density information to account for the fact that they are calculated on a subset of the actual data. Lastly, recommending a user the next-location to visit is fundamentally more challenging. When considering trajectories exhibiting short and non-repetitive spatial and temporal regular- ity, capturing user-user correlations requires learning sophisticated ML models that have high dimensionality in the intermediate layers of the neural networks. We propose a technique called xii Private Location Prediction (PLP). Central to our approach is the use of the skip-gram model, and its negative sampling technique. Our work is the first to propose differentially-private learning with skip-grams. In addition, we devise data grouping techniques within the skip-gram frame- work that pool together trajectories from multiple users in order to accelerate learning and im- prove model accuracy. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approaches—SNH, VDR and PLP— significantly outperform the state of the art. xiii Chapter1 Introduction 1.1 Motivation The last decade witnessed an exponential growth in the use of mobile technology, which has become an ubiquitous part of our daily life. The penetration rate of mobile devices, in the form of smart-phones or wearable technology, has sky-rocketed. Almost all online services are cus- tomized to mobile users’ whereabouts. Location data are used when individuals search for points- of-interest (POI) in their proximity, when they access news or social media, when they shop on- line, or when they use transportation services. More recently, we have witnessed the important role that access to location data plays when dealing with global pandemics, such as COVID-19, as contact tracing becomes an essential tool in controlling the disease spread. In this work, we focus on key location services such as those depicted in Figure 1.1. First we focus onrange count queries (RCQ), which are the most popular query type on location data, used as building blocks in most processing tasks. This information can typically be published by divid- ing a map into fine-grained cells, and then releasing the count (i.e., number of users) associated 1 A) Population-density maps User location updates over time in Houston. B) High-resolution spatio-temporal density maps User location updates in San Francisco 5pm 12am 6am 12pm D) Personalized POI recommendation C) Hotspot discovery and POI visit forecasting Next-POI Figure 1.1: Location Services to each of the cells. Population density maps (Figure 1.1(A)) support RCQs by publishing popu- lation statistics in the form of snapshot high resolution counts. These data are useful for many important applications such as improving POI placement. More often however, spatio-temporal density information is required for utility (e.g., in modeling COVID hotspots as shown in Figure 1.1(C)). As a result, the released statistics must continually capture population counts in small areas for short time periods (e.g., Figure 1.1(B)). Density maps collected over time also reveal POI popularity which can help improve the accuracy of POI recommendation and route planning for its visitors. Lastly we target next-POI recommendation, which is the task of predicting where a certain user is likely to visit in the next time period based on his visiting history (Figure 1.1(D)). This is a more sophisticated query since it not only correlates the users with POIs visited, but also identifies the influence of the current users’ locations on their future movements, while taking the context of the next destination into account as the POI types to be recommended. While the benefits provided by location data are indisputable, it is important to protect the privacy of the individuals who generated them, to prevent malicious attackers from deriving sensitive details about one’s health status, political orientation, etc. Preserving location privacy 2 Business Analytics Next-POI Recommendation Location updates Model Publishing Trust barrier Model distribution Trust barrier Mobile users Per-user information exchange Untrusted service provider Or parameter server user 1 user 2 user n user 3 user 4, 5 Offline Publishing : Online Publishing : Trusted data aggregator or Cloud parameter server Figure 1.2: Location Services is essential, since even aggregate statistics (in the form of population-density maps) can leak details about individual whereabouts. In Figure 1.2, we distinguish between two scenarios in which location privacy of individuals contributing data must be preserved. Theoffline publishing scenario assumes that the data curator such as the location based service provider istrusted, and that after receiving the locations of its users, it wishes to publicly release population statistics, while preserving their location privacy. This is the focus of our work. We contrast with the online scenario which assumes the location based service provider is untrusted, and that a user, to protect her location privacy, must obfuscate her position/whereaboutsbefore sending it to the service provider in exchange for location services. We do not consider this application scenario in this work. 3 1.2 Challenges To protect against privacy risks, the data curator may publish a noisy version of the dataset, transformed according to differential privacy (DP) [36], the de-facto standard for releasing sta- tistical data. The goal of a DP mechanism is to ensure privacy while keeping the query answers as accurate as possible. ForSpatialRangeCountQueries(RCQ). When consideringsnapshot releases, where each individual contributes a single location report [139, 123, 78, 52, 96, 29, 140, 69, 5, 97], a DP- compliant representation of a spatial dataset is created by partitioning the data domain into bins, and then publishing ahistogram with the noisy count of points that fall within each bin. Domain partitioning is commonly adopted [69, 96, 140, 52, 123, 29], e.g., uniform and adaptive grids [96] or hierarchical partitioning [140, 29]. At query time, the noisy histogram is used to compute answers, by considering the counts in all bins that overlap the query. When a query partially overlaps with a bin, the uniformity assumption is used to estimate what fraction of the bin’s count should be added to the answer. Since DP mechanisms release only the (noisy) count for each bin, it is assumed that data points are distributed uniformly within the partition, hence the estimate is calculated as the product of the bin count and the ratio of the overlapping area to the total area of the bin. This is often a poor estimate, since location datasets tend to be highly skewed in space (e.g., a shopping mall in a suburb increases mobile user density in an otherwise sparse region). Thus, in addition to DP sanitization noise, uniformity error is a major cause of inaccuracy for existing work on DP release of spatial data. 4 For Spatio-temporal data release. When releasing multiple snapshots, individuals con- tribute multiple reports to the same dataset. The ability of an adversary to breach privacy in- creases significantly, and a shift to user-level privacy [33] is required. To protect privacy in this scenario, an increased amount of noise is needed, which often grows linearly in the number of user contributions. Only a handful of techniques [4, 112] considered spatio-temporal location data release, and none of them is able to preserve data accuracy for any practical spatio-temporal res- olutions. In the absence of approaches for high resolution spatio-temporal data release, industry projects used basic DP mechanisms that simply add noise to the population density information without taking into account specific dataset characteristics [6, 39, 105, 14, 59]. The amount of privacy budget spent for such data releases is often not reported, or it is excessive [14, 59, 3], thus providing insufficient protection. In addition, reports of incorrect privacy accounting in such releases [14, 59] further necessitate a thorough end-to-end study of custom DP algorithms for spatio-temporal data. There are two key aspects that must be addressed. First, one needs to boundsensitivity (please see Chapter 4 for a formal definition) by limiting the number of location reports from any single user, which can be achieved through sampling. However, this must be done carefully, such that data utility is not compromised. Furthermore, density information must be adjusted to account for the fact that they are calculated on a subset of the actual data. Second, the effect of noise added by DP mechanisms must be addressed. Such mechanisms consider the worst-case scenario over all possible data distributions and query configurations, and err on the safe side, adding more noise than strictly necessary, compromising accuracy. For next-POI prediction. The next-location prediction task is fundamentally more chal- lenging. When considering trajectories exhibiting short and non-repetitive spatial and temporal 5 regularity, capturing user-user correlations requires learning sophisticated ML models that have high dimensionality in the intermediate layers of the neural networks. Building accurate models for such data, while at the same time preserving the privacy of the individuals who contribute data to the training set, is a difficult task. In addition to requiring user- level privacy protection, a more strict protection level, location data characteristics make training such a model challenging in DP settings. The training data are sparse, i.e., the distribution of the number of location visits of an average user over the universe of POIs is long-tailed. Movement data are high-dimensional, because individual activities and their context depend on many pa- rameters, such as location, time, or the sequence of activities within a broader daily routine. In the context of privacy-preserving learning, this implies high data sensitivity, requiring a large amount of noise to be introduced. These challenges are amplified by the fact that neural network layers that yield optimal results when learning without privacy constraints do not necessarily yield optimal results when trained in a privacy-preserving manner [93]. Existing solutions fall short of properly capturing skewness inherent to sparse location datasets, and as a result yield poor accuracy. 1.3 Summaryofthesiswork In this work, we propose a paradigm shift towards learned representations of data. We learn powerful machine learning (ML) models that exploit patterns within location datasets to provide more accurate location services than the conventional DP-preserving noise mechanisms. We focus on private learning of mobility patterns within a centralized architecture (depicted as offline publishing in Figure 1.2), where a trusted curator collects data from mobile users and builds a 6 DP-compliant ML model. It is important to distinguish this from the federated learning setting (typically adopting in the online publishing setting of Figure 1.2), where individual users engage in a collaborative protocol to build the model in distributed fashion, without the need for a trusted centralized component. 1.3.1 Onspatialrangecountquery To address this query, we focus on countering the sources of error that plague existing approaches (i.e., noise and uniformity error) through learning, We design a neural database system in Chapter 3, calledSpatialNeuralHistograms(SNH), that models spatial data such that density features are preserved, even when DP-compliant noise is added. As such, learning can be used to combat data modelling errors, also present in DP setting. Nonetheless, due to the impact of DP noise on the process of learning, creating learned differentially private data representations is non- trivial. In order to be accurate the learned models must capture the intrinsic properties of location datasets, which exhibit both high skewness, as well as strong correlation among regions with similar designations. For instance, two busy city areas (e.g., a stadium and a street bar area) will exhibit similar density patterns, while the regions in between may be sparse. These busy areas may also be correlated, since people are likely to congregate at bars after they see a game at the stadium. Models with strong representational power in the continuous domain are necessary to learn such patterns. SNH is a neural histogram system to model specifically spatial datasets such that important density and correlation features present in the data are preserved, even when DP-compliant noise is added. SNH employs a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. We also devise a 7 framework for effective system parameter tuning on top of public data, which helps practitioners set important system parameters while avoiding privacy leakages. 1.3.2 Onspatio-temporalrangecountquery,hotspotdiscovery,andPOI visitforecasting To address such queries, we propose a framework for continuous release of location data with high accuracy and strong protection guarantees using user-level differential privacy (DP). We propose Variational AutoEncoder-based Density Release (VDR) in Chapter 4, a novel sys- tem specifically designed to address accurate, DP-compliant release of spatio-temporal datasets. A key intuition behind VDR is the observation that noisy spatio-temporal data histograms are similar to a sequence of images, with spatial patterns in location data akin to visual patterns in images. This observation allows one to leverage a vast amount of work on image pattern recog- nition and apply it to spatio-temporal data releases. VDR first sanitizes density information by adding DP-compliant noise, then uses a novel neural network-based approach to improve accu- racy by performing an advanced post-processing denoising step based on convolutional neural networks (CNN). CNNs are able to capture subtle patterns specific to location datasets. We uti- lize variational auto-encoders (VAE) to capture data patterns without fitting to the noise. We also employ multi-resolution learning through a data augmentation step that captures location data patterns at multiple granularities, thus improving accuracy for a broad range of query extents. Recall that in the case of trajectory data, each individual contributes multiple reports, which is a significant complicating factor, as it requires a shift to user-level privacy [33], which in turn increases the noise level that must be added to achieve protection. We devise a comprehensive 8 strategy to reduce sensitivity of user-level privacy through sampling. We introduce a novel ap- proach to user-level sampling that reduces sensitivity by bounding the number of location reports from each user; the bound is chosen to provide a good trade-off between keeping sensitivity low and preserving density information across time. To counter-balance the effect of sampling, we de- sign a novel private statistical estimator which scales up query results to preserve accuracy. This permits us to control the sensitivity in user-level privacy without significantly affecting accuracy. 1.3.3 OnNext-POIprediction Next-location prediction is a fundamental and valuable task in location-centric applications. Since we must exploit user-user correlations for effective next-POI recommendation, we resort to train- ing private ML models with the gradient perturbation technique DPSGD [2], wherein the DP noise is added during training, as apposed to before training in the case of SNH and VDR. We propose,PrivateLocationPrediction(PLP) in Chapter 5, a technique that can accurately per- form learning on trajectory data even in the presence of DP noise. We briefly show that specific model architectures and data handling processes during DP-compliant training can significantly boost learning accuracy by keeping under tight control the amount of noise required to meet the privacy constraint. The central idea behind our approach is the use of the skip-gram model [85, 84]. One important property of skip-grams is that they handle well sparse data. At the same time, the use of skip-grams for trajectory data increases the dimensionality of intermediate lay- ers in the neural network. This creates a difficult challenge in the context of privacy-preserving learning, because it increases data sensitivity, and requires a large amount of noise to be intro- duced, therefore decreasing accuracy. To address this challenge, we capitalize on the negative 9 sampling (NS) technique that can be used in conjunction with skip-grams. NS turns out to be extremely valuable in private gradient descent computation, because it helps reduce the gradient update norms, and thus boosts the ratio of the useful signal compared to the noise introduced by differential privacy. In addition, we introduce a data grouping mechanism that makes learning more effective by combining multiple users into a single bucket, and then training the model per bucket. Group- ing has a dual effect: on the positive side, it increases the information diversity in each bucket, improving learning outcomes; on the negative side, it heightens the adverse effect of the intro- duced Gaussian noise. Overall, this enables marked improvements in the running times while maintaining the same level of accuracy. 1.4 ThesisStatement By addressing key challenges, we demonstrate that: Differentially private learned models can supportaccurate location services such as range count query, hotspot discovery, POI visits fore- casts, and next-POI prediction. 1.5 ThesisOutline The structure of the thesis is organized as follows: Chapter 2 introduces the related techniques on privacy-preserving (i) histogram publishing with domain partitioning, (ii) machine learning (including empirial risk minimization and deep learning), (iii) data distribution release (includ- ing private synthesis), and (iv) location recommender systems. In Chapter 3, we propose Spatial Neural Histograms (SNH), a novel neural database for privately answering range count queries 10 for snapshopt data release. To extend this work for spatio-temporal data release, we propose VAE-based Density Release (VDR) in Chapter 4. VDR denoises data and in addition to enabling spatio-temporal RCQs, allows accurate hotspot discovery and POI visits forecasting. Moreover, VDR uses a DP-compliant statistical estimator to derive the count of location reports that must be preserved to obtain a good sensitivity-accuracy trade-off when preserving user-level privacy. In Chapter 5 we propose Private Location Prediction (PLP), an approach for differentially-private next-location prediction using the skip-gram model. PLP focuses on reducing the norms of gra- dient updates of the high-dimensional internal neural network layers. It employs a data grouping technique that can improve the signal-to-noise ratio and allow for effective private learning. Fi- nally, we summarize our contributions and discuss potential future work in Chapter 6. 11 Chapter2 RelatedWork 2.1 Onpublishingspatialandspatio-temporalhistograms Answering RCQs. The task of answering RCQs is well explored in the literature. Classes of algorithms can be categorized based on data dimensionality and their data-dependent or inde- pendent nature. However there is no single dominant algorithm for all domains[51]. In the one dimensional case, the data-independent Hierarchical method [52] uses a strategy consisting of hierarchically structured range queries typically arranged as a tree. Similar methods (e.g., HB [97]) differ in their approach to determining the tree’s branching factor and allocating appro- priate budget to each of its levels. Data-dependent techniques, on the other hand, exploit the redundancy in real-world datasets to boost the accuracy of histograms. The main idea is to first lossily compress the data. For example, EFPA [5] applies the Discrete Fourier Transform whereas DAWA [69] uses dynamic programming to compute the least cost partitioning. The compressed data is then sanitized, for example, directly with Laplace noise [5] or with a greedy algorithm that tunes the privacy budget to an expected workload [69]). 12 While some approaches such as DAWA and HB extend to 2D naturally, others specialize to answer spatial range queries. Uniform Grid (UG) [96] partitions the domain into a m× m grid and releases a noisy count for each cell. The value of m is chosen in a data-dependent way, based on dataset cardinality. Adaptive Grid (AG) [96] builds a two-level hierarchy: the top-level partitioning utilizes a granularity coarser than UG. For each bucket of the top-level partition, a second partition is chosen in a data-adaptive way, using a finer granularity for regions with a larger count. QuadTree [29] first generates a quadtree, and then employs the Laplace mechanism to inject noise into the point count of each node. Range-count queries are answered via a top- down traversal of the tree. Privtree [140] is another hierarchical method that allows variable node depth in the indexing tree (as opposed to fixed tree heights in AG, QuadTree and HB). It utilizes the Sparse-Vector Technique [75] to determine a cell’s density prior to splitting the node. DAWA [69] operates in two steps. In the first it detects approximately uniform regions in the 1D histogram and then compresses the domain by using dynamic programming to compute the least cost partition that optimizes over the average DP noise and the counts in each cell. Once the partition is found, it uses the remainder of its privacy budget to to further optimize its query performance based on a set of strategy queries that closely resemble the query issuer’s workload. The cost of this optimization tends to be very high for large scale datasets. The case of high-dimensional data was addressed by [78, 123, 139]. The most accurate al- gorithm in this class is High-Dimensional Matrix Mechanism (HDMM) [78] which represents queries and data as vectors, and uses optimization and inference techniques to answer RCQs. An inference step solves an unconstrained ordinary least squares problem to minimize theL 2 error on the input workload of linear queries. DPCube [123] searches for dense ‘subcubes’ of the dat- acube representation to release privately. A part of the privacy budget is used to obtain noisy 13 counts using LPM over a straightforward partitioning, which is then improved to a standard kd- tree. Fresh noisy counts for the partitions are obtaining with the remaining budget and a final inference step resolves inconsistencies between the two sets of counts, and improves accuracy. Lastly, PrivBayes [139] is a mechanism that privately learns a Bayesian network over the data that generates a synthetic dataset which can consistently answer workload queries. Due to the use of sampling to estimate data distribution, it is a poor fit for skewed spatial datasets. Most similar to our work is PGM [79], which utilizes Probabilistic Graphical Models to measure a com- pact representation of the data distribution, while minimizing a loss function. Data projections over user-specified subgroups of attributes are sanitized and used to learn the model parameters. PGM is best used in the inference stage of privacy mechanisms (such as HDMM and PrivBayes) that can already capture a good model of the data. DP statistics from trajectories. Longitudinal release of individual location updates increase risk of attack [122, 19], and requires more stringent privacy settings, e.g., user-level privacy. The work in [4] models disjoint regions of the space as separate 1-d time series. However, this limits supported query types, and cannot answer range or hotspot queries. Moreover, the granularity used is very coarse. The prefix-tree approach from [26] was the first to tackle this problem, and built a hierarchical structure that indexes all possible trajectories of users in a discrete location space. Each tree node contains the DP-sanitized count of trajectories that start with a prefix equal to that node’s path from the tree root. The follow-up work in [25] improved accuracy by tabulating noisy counts of all possible sets of n-grams that may belong to a trajectory, not necessarily from its start. However, this approach is still unable to preserve accuracy for n-grams longer than 4-5. Finally, the work in [112] operates a transformation of the trajectory data to a Fourier transform space, and performs sanitization on the Fourier coefficients. This approach’s 14 accuracy depends a lot on the type of queries performed, and also fails to preserve data utility beyond several time snapshots. Privacy-preservinglearningforRCQs. A learned model can leak information about the data it was trained on [109, 56]. Recent efforts have developed differentially private versions of ML algorithms, e.g., empirical risk minimization [23, 61] and deep neural networks [107, 2]. Based on the familiar sensitivity method proposed by Dwork et al.[33], Output Perturbation [121] perturbs the output answers after the training procedure. The variance of the added noise corresponds to the maximum distance of the output parameters for any two neighboring input datasets. Objec- tive perturbation [23, 61] perturbs the objective function by adding a random regularization term and releases the minima of the perturbed objective. The first two approaches ensure privacy for algorithms that optimize a convex objective such as ERM, while gradient perturbation guarantees differential privacy even for non-convex objectives, e.g., in deep neural networks. Our approach is different in that we sanitize the training data before learning. Furthermore, the work of [2] achieves (ε,δ )-DP [91, 38, 3], a weaker privacy guarantee. Private parameter tuning. Determining the system parameters of a private data representa- tion must also be DP-compliant. Several approaches utilize the data themselves to tune system parameters such as depth of a hierarchical structure (e.g., in QuadTree or HB) or spatial partition size (e.g. k-d trees), without privacy consideration [52]. In particular, UG [96] models the grid granularity asm = p nε/c, where the authors tunec = 10 empirically on sensitive datasets, in breach of differential privacy. Moreover, this results in the value of c overfitting their structures to test data, and result in poor utility on new datasets [52]. The generalization ability of system parameters is also poor when parameter selection is performed arbitrarily, independent of data (as show by Hay et. al. [52] in their DP-compliant adoption of HB and QuadTree). Using public 15 datasets to tune system parameters is a better strategy [23]. Our strategy to determine a good cell width for a differentially-private grid is similar to that in UG [96]. However, our proposed strategy for parameter selection vastly improves generalization ability over UG [96] by exploit- ing additional dataset features and their non-linear relationships. Finally, only when end-to-end privacy systems are necessitated, a part of the total privacy budget is portioned to be used for parameter tuning [24, 23, 40]. PrivateDatadistributionRelease. PrivBayes [139] is a mechanism that privately learns a Bayesian network over the data, and then returns a matrix used for fitting the parameters of the Bayes net. This can be used to then generate a synthetic dataset which can consistently answer workload queries. Budget allocation is equally split between learning the Bayesian network structure and learning its parameters. Multiplicative-Weights Exponential Mechanism (MWEM) [49] maintains an approximating dis- tribution over the data domain, scaled by the number of records. It updates this distribution by posing a workload of linear queries (e.g., RCQs), finding poorly answered ones, and using the multiplicative update rule to revise its estimates. AHP [141] seeks to group a histogram’s adja- cent bins with close counts to trade for smaller noise. It utilizes LPM, and sets noisy counts below a threshold to zero. The counts are then sorted and clustered using a global clustering scheme to form a partition. Noise reduction techniques. Most deep-learning based denoising methods [60, 92, 67] rely on many pairs of clean/noisy images. Denoising autoencoders attempt to learn original data distributions that have been corrupted according to some noise distribution, (e.g., by maximizing the log probability of the clean input, given a noisy input). Recent work in [65] trains a model from noisy/noisy image pairs, by extracting noisy versions of the same image repeatedly. Such a 16 training process is not viable under DP since it would require additional privacy budget for each noisy extraction. Some mild noise from images can also be removed in an unsupervised fashion [143, 98]. No approach studied denoising in the presence of DP. 2.2 Onlocationrecommendation The problems of location recommendation and prediction have received significant attention in the last decade. Recommending a location to visit to a user necessitates modeling human mobility for the sequential prediction task. Markov Chain (MC) based methods, Matrix Factorization (MF) techniques, and Neural Network models (NN) are the schemes of choice for this objective. MC- based methods utilize a per-user transition matrix comprised of location-location transition prob- abilities computed from the historical record of check-ins [136]. The m th -order Markov chains emit the probability of the user visiting the next location based on the latestm visited locations. Private location recommendation over Markov Chains is studied in [138]. Aggregate counts of check-ins in discretized regions are published as differentially private statistics. However, due to the sparsity in check-in behavior and the general-purpose privacy mechanisms, their method can only extend to coarse spatial decompositions (e.g., grids having larger than 5km 2 cells). Factoriz- ing Personalized Markov Chains (FPMC) [101] extend MC by factorizing this transition matrix for the collaborative filtering task. Matrices containing implicit user feedback on locations can also be exploited for location recommendation via weighted matrix factorization [70]. Private Matrix Factorization has been explored in [108, 74], but we are not aware of any proposal for their ap- plication to the problem we are considering. Neural Networks have become a powerful tool in recommender applications due to their flexibility, expressive power and non-linearity. Recurrent 17 Neural Networks (RNN) can model sequential data effectively, especially language sentences [86]. Recurrent nets have also been adapted for location sequences [142, 72]. However, RNNs assume that temporal dependency changes monotonically with the position in a sequence. This is often a poor assumption in sparse location data. As a result, the state-of-art [73, 41, 21, 127] employs the skip-gram model [85] to learn the distributional context of users check-in behavior. Extensions incorporate temporal [73, 41, 131], textual [21] and other contextual features [127]. However, none of these studies provide any privacy features, which is the crux of our work. DifferentialPrivacy(DP)andNeuralNetworks . A recent focus in the differential privacy literature is to reason about cumulative privacy loss over multiple private computations given the values ofε used in each individual computation [37, 18, 117, 87]. A fundamental tool used in this task is privacy amplification via sampling [11], wherein the privacy guarantees of a private mechanism are amplified when the mechanism is applied to a small random subsample of records from a given dataset. Abadi et. al. [2] provide an amplification result for the Gaussian output per- turbation mechanisms under Poisson subsampling. Their technique, called moments accountant, is based on the moment generating function of the privacy loss random variable. Other privacy definitions that lend themselves to tighter composition include Rényi Differential Privacy [87] and zero-Concentrated Differential Privacy [18], and their application to private learning with data subsampling ([117],[66] respectively). However, these privacy models are relatively new and the distinctions in privacy guarantees at the user-end remain to be investigated. In practice, (ε,δ )-differential privacy is the de-facto privacy standard [45, 82]. Location Privacy for recommendation systems We overview literature that focuses on preventing the location based service provider (the adversary) from inferring a mobile user’s lo- cation in the online setting. Spatialk-anonymity (SKA) [46] generalizes the specific position of 18 the querying user to a region that encloses at leastk users. The resulting anonymity set bounds the adversary’s probability of identifying the query user to at most1/k. However, this syntactic notion of privacy can be easily circumvented when the data aresparse, i.e., the distribution of the number of location visits of an average user over the universe of POIs is long-tailed. Moreover, check-ins in sparse regions are especially vulnerable to an adversary with background knowledge, significantly increasing the probability that de-anonymization succeeds [90]. Another source of leakage is when the querying user moves, disconnecting himself from the anonymity set. DP can be used in the context of publishing statistics over datasets of locations or trajectories collected from mobile users. The Local Differential Privacy paradigm is well suited for this purpose, and its application to location data is explored in [99]. The Randomized Response mechanism is used to report, in addition to users actual locations, a large number of erroneous locations. Recom- mendation models that utilize these statistics can at best leverage spatial proximity queries [103] or apply to coarse spatial decompositions [99], and are incapable of cross-user learning such as in the case of the skip-gram model. Lastly, adapting the powerful guarantees of DP to protect- ing exact location coordinates, Geo-indistinguishability (GeoInd) [7] relaxes the DP definition to the euclidean space. It is the privacy framework of choice for obfuscating user check-ins in the absence of a trusted data curator. Note that, SKA and GeoInd rely on obfuscating individual location records that make up the larger dataset, making them suitable only for applications that utilize spatial proximity queries (e.g., a user that sends noisy coordinates to obtain points of interest in her vicinity). Utilizing these methods to publish data for training ML models is not viable, since adding noise to the coordinates wipes out any contextual information on the POI visited (beginning with the POI 19 identifier). Moreover, since the same user may have numerous check-in records in a longitudi- nal location dataset, data publishing with the common techniques suffers from serious privacy leakages. User-levelcorrelations (e.g., multiple checkins of a user that are closely related) severely increase the possibility of de-anonymization in the case of SKA. Likewise, in the case of GeoInd, the cumulative privacy loss variable calculated via a standard composition theorem exceeds rea- sonable privacy levels. 20 Chapter3 ANeuralDatabaseforDifferentiallyPrivateSpatialRange Queries Model publishing Trust barrier Mobile users Trusted-data aggregator Research & business use Location updates . . . ε-DP mechanism Data Augmentation . . . Stage 2: Training Stage 1: Data Collection Trained Neural Networks ParamSelect Public Datasets Figure 3.1: Spatial Neural Histogram System In this Chapter, we build a DP-compliant representation that can answer spatial range queries accurately. Typically, a DP-compliantrepresentation of a spatial dataset is created by partitioning the data domain into bins, and then publishing ahistogram with the noisy count of points that fall within each bin. Domainpartitioning is commonly adopted [69, 96, 140, 52, 123, 29], e.g., uniform and adaptive grids [96] or hierarchical partitioning [140, 29]. At query time, the noisy histogram is used to compute answers, by considering the counts in all bins that overlap the query. When a query partially overlaps with a bin, the uniformity assumption is used to estimate what fraction of the bin’s count should be added to the answer. Since DP mechanisms release only the (noisy) count for each bin, it is assumed that data points are distributed uniformly within the partition, 21 hence the estimate is calculated as the product of the bin count and the ratio of the overlapping area to the total area of the bin. This is often a poor estimate, since location datasets tend to be highly skewed in space (e.g., a shopping mall in a suburb increases mobile user density in an otherwise sparse region). Thus, in addition to DP sanitization noise, uniformity error is a major cause of inaccuracy for existing work on DP release of spatial data. We propose a paradigm shift towardslearnedrepresentations of data, which have been shown to accurately capture data distribution in non-private approximate query processing [76, 55, 135]. Recent attempts at creating learned DP data representations [139, 79] propose the use of learned models to answer queries in non-spatial domains (e.g., categorical data). While these ap- proaches perform well in the case of categorical data, they cannot model the intrinsic properties of location datasets, which exhibit both high skewness, as well as strong correlation among re- gions with similar designations. For instance, two busy city areas (e.g., a stadium and a street bar area) will exhibit similar density patterns, while the regions in between may be sparse. These busy areas may also be correlated, since people are likely to congregate at bars after they see a game at the stadium. Models with strong representational power in the continuous domain are necessary to learn such patterns. Meanwhile, training complex models while preserving DP is difficult. For neural networks, existing techniques [2] utilize gradient perturbation to train differentially private models. How- ever, the sensitivity of this process, defined as the influence a single input record may have on the output (see Section 3.1 for a formal definition), is high. DP-added noise is proportional to sensitivity, and as a result meaningful information encoded in the gradients is obliterated. The learning process has to be carefully crafted to the unique properties of spatial data, or accuracy will deteriorate. 22 We propose Spatial Neural Histograms (SNH), a neural network system specifically designed to answer differentially private spatial range queries. SNH models range queries as a function approximation task, where we learn a function approximator that takes as input a spatial range and outputs the number of points that fall within that range. Training SNH consists of two stages (Figure 3.1): the first perturbs training query answers according to DP, while the second trains neural networks from noisy answers. The first stage is called data collection. It prepares a differentially private training set for our model while ensuring low sensitivity, such that the signal-to-noise ratio is good. However, due to the privacy constraints imposed by DP, we can only collect a limited amount of training data. Thus, in the second stage, we synthesize more training samples based on the collected data to boost learning accuracy, in a step called data augmentation. Then, we employ a supervised learning training process with a carefully selected set of training samples comprising of spatial ranges and their answers. SNH learns from training queries at varying granularity and placement to capture subtle correlations present within the data. Finally, an extensive private parameter tuning process (ParamSelect) is performed using publicly available data, without the need to consume valuable privacy budget. The fully trained SNH can then be released publicly and only requires a single forward pass to answer a query, making it highly efficient at runtime. SNH is able to learn complex density variation patterns that are specific to spatial datasets, and reduces the negative impact of noise and uniformity assumption when answering range queries, significantly boosting accuracy. Use of machine learning when answering test queries (i.e., at runtime) is beneficial because, through learning, SNH combines evidence from multiple training queries over distinct regions. In fact, gradient computation during training can be seen as a novel means of aggregating in- formation across the space. We show that neural networks can learn the underlying patterns in 23 location data from imprecise observations (e.g., observations collected with noise and uniformity error), use those patterns to answer queries accurately and thereby mitigate noise and uniformity errors. In contrast, existing approaches are limited to usingimpreciselocal information only (i.e., within a single bin). When the noise introduced by differential privacy or the error caused by the uniformity assumption are large for a particular bin, the answer to queries evaluated using that bin will be inaccurate. As our experiments show, SNH outperforms all the state-of-the-art solu- tions: PrivTree [140], Uniform Grid (UG) [96], Adaptive Grid (AG) [96] and Data and Workload aware algorithm (DAWA) [69]. Contributionsandorganization. In this work, we • Formulate the problem of answering spatial range count queries as a function approxima- tion task (Sec. 3.1); • Propose a novel system that leverages neural networks to represent spatial datasets while accurately capturing location-specific density and correlation patterns (Sec. 3.2, 3.3); • Introduce a comprehensive framework for tuning system parameters on public data (Sec. 3.4); and • Conduct an extensive experimental evaluation on a broad array of public and private real- world location datasets with heterogeneous properties and show that SNH outperforms all the state-of-the-art solutions (Sec. 3.5). 24 3.1 Preliminaries 3.1.1 DifferentialPrivacy ε-differential privacy [36] provides a rigorous privacy framework with formal protection guar- antees. Given privacy budget parameter ε ∈ (0,+∞), a randomized mechanismM satisfies ε-differential privacy iff for all datasets D and D ′ , where D ′ can be obtained from D by either adding or removing one tuple, and for allE⊆ Range(M) Pr[M(D)∈E]≤ e ε Pr[M(D ′ )∈E] (3.1) Pr[M(D)∈ E] denotes the probability of mechanismM outputting an outcome in the setE for a database D and Range(M) is the co-domain ofM. M hides the presence of an individual in the data, since the difference in probability of any set of outcomes obtained on two datasets differing in a single tuple never exceeds e ε . The protection provided by DP is stronger when ε approaches0. The sensitivity of a function (e.g., a query) f, denoted by Z f , is the maximum amount the value off can change when adding or removing a single individual’s records from the data. The ε-DP guarantee can be achieved by adding random noise derived from the Laplace distribution Lap(Z f /ε). For a query f : D → R, the Laplace mechanismM returns f(D) + Lap(Z f /ε), where Lap(Z f /ε) is a sample drawn from the probability density function Lap(x|(Z f /ε)) = (ε/2Z f )exp(−| x|ε/Z f ) [36]. Thecomposability property of DP helps quantify the amount of pri- vacy attained when multiple functions are evaluated on the data. Specifically, when mechanisms M 1 ,M 2 with privacy budgetsε 1 ,ε 2 are applied in succession on overlapping data partitions, the 25 Table 3.1: Summary of Notations in SNH Notation Definition ε DP Privacy Budget Q,Q W Query distribution and workload query set Q D ,Y D Data collection query set and its answers Q A ,Y A Augmented query set and its answers R,k Set and number of query sizes for training l,u Lower and upper bound on query sizes f(q), ( ˆ f(q;θ )) Count of records inq calculated fromD (estimated fromθ ) ¯ f(q) f(q)+Lap(1/ε) ρ ,C Grid granularity, Set of bottom-left corners of grid cells ψ Smoothing factor in relative error Φ ,ϕ ParamSelect Model, Dataset features D,D T ,D I All public datasets, ParamSelect training and inference datasets π α (D,ε) Function denoting best value of system parameterα for datasetD and budgetε ˆ π α (D,ε) Empirical estimate ofπ α (D,ε) sequential composition property [36] states that the budget consumption is (ε 1 +ε 2 ). Conversely, whenM 1 ,M 2 are applied on disjoint data partitions, the parallel composition property states that the resulting budget consumption ismax(ε 1 ,ε 2 ). Thepost-processingproperty of differential privacy [36] states that given any arbitrary function h and an ε-DP mechanismM, the mecha- nismh(M) isε-DP. Lastly, we note that DP is robust to side-channel information [36], that is, the privacy guarantee on the DP-release of D is irrespective of any publicly available information about the users inD. 3.1.2 ProblemDefinition Consider a database D that covers a spatial region SR ⊆ R 2 , and contains n records each de- scribing an individual’s geo-coordinate. Given a privacy budget ε, the problem studied in this paper is to return the answer to an unbounded number of spatial range count queries (RCQs). An RCQ consists of a spatial range predicate and its answer is the number of records inD that 26 satisfy the range predicate. We consider spatial range queries that are axis-parallel and square- shaped, defined by their bottom-left corner c (wherec is a vector inSR), and their side lengthr. An RCQ,q, is then defined by the pair q = (c,r). We sayr is the query size andc is its location coordinate. For a database D, the answer to the RCQ q = (c,r) can be written as a function f(q)=|{p|p∈D,c[i]≤ p[i]<c[i]+r,∀i∈{0,1}}|, wherez[0] andz[1] denote the latitude and longitude of any coordinatez, respectively. We assume RCQs follow a distributionQ and for any RCQq, we measure the utility of its estimated answer,y, using the relative error metric, defined as∆( y,f(q))= |y− f(q)| max{f(q),ψ } , whereψ is a smoothing factor necessary to avoid division by zero. The typical way to solve the problem of answering an unbounded number of RCQs is to design an ε-DP mechanismM and a function ˆ f such that (1)M takes as an input the database D and outputs a differentially private representation of the data, θ ; and (2) the function ˆ f(q;θ ) takes the representationθ , together with any input queryq, and outputs an estimate off(q). In practice,M is used exactly once to generate the representation θ . Given such a representation, ˆ f(q;θ ) answers any RCQ, q, without further access to the database. For instance, in [96],M is a mechanism that outputs noisy counts of cells of a 2-dimensional grid overlaid onD. Then, to answer an RCQq, ˆ f(q;θ ) takes the noisy grid,θ , and the RCQ,q, as inputs and returns an estimate of f(q) using the grid. The objective is to designM and ˆ f such that the relative error between ˆ f(q;θ ) andf(q) is minimized, that is, to minimizeE θ ∼ M E q∼ Q [∆( ˆ f(q;θ ),f(q))]. Let ˆ f be a function approximator and define M to be a mechanism that learns its parameters. The learning objective ofM is to find a θ such that ˆ f(q;θ ) closely mimicsf(q) for different RCQs, q. The representation of the data,θ , is the set of learned parameters of a function approximator. MechanismM outputs a representationθ , and any RCQ,q, is answered by evaluating the function 27 ˆ f(q;θ ). However,M is now defined as a learning algorithm and ˆ f as a function approximator. Our problem is formally defined as follows: Problem1 Given a privacy budget ε, design a function approximator, ˆ f, (let the set of possible parameters of ˆ f beΘ ) and a learning algorithm,M, such thatM satisfies ε-DP and finds argmin θ ∈Θ E q∈Q [∆( ˆ f(q;θ ),f(q))] 3.2 SpatialNeuralHistograms(SNH) Our goal is to utilize models that can learn patterns within the data in order to answer RCQs ac- curately. We employ neural networks as the function approximator ˆ f, due to their ability to learn complex patterns effectively. Prior work [2] introduced a differentially private stochastic gradient descent (DP-SGD) approach to privately train a neural network. Thus, a seemingly straightfor- ward solution to Problem 1 is using a simple fully connected neural network and learning its parameters with DP-SGD. Sec. 3.2.1 discusses this trivial approach and outlines the limitations of using DP-SGD in our setting, which leads to poor accuracy. Next, in Sec.3.2.2, we discuss how we improve the training process to achieve good accuracy. In Sec.3.2.3 we provide an overview of our proposed Spatial Neural Histogram (SNH) solution. Table 3.1 summarizes the notations. 3.2.1 BaselineSolutionusingDP-SGD LearningSetup. We define ˆ f(.;θ ) to be a fully connected neural network with parameter setθ . We train the neural network so that for an RCQq, its output ˆ f(q;θ ) is similar tof(q). A training set,T , is created, consisting of(q,f(q)) pairs, whereq is the input to the neural network andf(q) 28 is the training label for the inputq (we call RCQs in the training settrainingRCQs). To create the training set, similar to [69, 78], we assume we have access to a set of workload RCQs, Q W , that resembles RCQs a query issuer would ask (e.g., are sampled fromQ or a similar distribution) and is assumed to be public. Thus, we can define our training set T to be{(q,f(q))|q ∈ Q W }. We define the training loss as L = X q∈Q W ( ˆ f(q;θ )− f(q)) 2 (3.2) In a non-private setting, a model can be learned by directly optimizing Eq. (3.2) using a gradient descent approach. The model can answer any new RCQq similar to the ground truthf(q). Incorporating Privacy. DP-SGD [2] incorporates differential privacy for training neural net- works. It modifies SGD by clipping each sample gradient to have norm at most equal to a given clippingthreshold,B, and obfuscating them with Gaussian noise. Intuitively, the clipping thresh- old,B, disallows learning more information than a set quantity from any given training sample (no matter how different it is from the rest) and the standard deviation of the Gaussian noise added is scaled withB to ensure obfuscation is proportional to the amount of information gained per sample. Specifically, in each iteration: (1) a subset, S, of the training set is sampled; (2) for each sample,s = (x,y)∈ S, the gradientg s =∇ θ ( ˆ f(x;θ )− y) 2 is computed, and clipped (i.e., trun- cated) to a maximumℓ 2 -norm ofB as ¯g s = min(∥g s ∥ 2 ,B) gs ∥gs∥ 2 ; (3) the average clipped gradient value for samples inS is obfuscated with Gaussian noise as g = X s∈S (¯g s )+N(0,σ 2 B 2 ) (3.3) 29 (4) the parameters are updated in the direction opposite tog. DP-SGDChallenges. In our problem setting, the training set is created by queryingD to obtain the training labels, and our goal is to ensure the privacy of records in D. On the other hand, DP-SGD considers the training set itself to be the dataset whose privacy needs to be secured. This changes the sensitivity analysis of DP-SGD. In our setting, to compute the sensitivity of the gradient sum P s∈S (¯g s ) in step (3) of DP-SGD, we have to consider the worst-case effect the presence or absence of a single geo-coordinate record p can have on the sum (as opposed to the worst-case effect of the presence or absence of a single training sample). Removing p can potentially affect every ¯g s for all s ∈ S, so sensitivity of the gradient sum is|2S|× B and Gaussian noise ofN(0,σ 2 4|S| 2 B 2 ) must be added to the gradient sum to achieve DP (cf. noise in step (3) above). After this adjustment, per-iteration and total privacy consumption of DP-SGD is amplified, impairing learning. We experimentally observed that, for any reasonable privacy budget, training loss does not improve at all during training due to the large added noise. 3.2.2 AdifferentlearningparadigmforRCQs Next, we introduce three design principles (P1-P3) we follow when training neural networks to answer RCQs. These principles are then used in Sec. 3.2.3 to build our solution. P1: Separationofnoiseadditionfromtraining. The main reason DP-SGD fails in our prob- lem setting is that too much noise needs to be added when calculating gradients privately. Recall that DP-SGD uses the quantityg, defined in Eq. (3.3), as the differentially private estimate of the gradient of the loss function. Here, we investigate the private gradient computation in more de- tails to provide an alternative method to calculate the gradient with differential privacy. Recall 30 that the goal is to obtain the gradient of the loss function,L, defined in Eq. (3.2) with respect to the model parameters. We thus differentiate L and obtain: ∇ θ L = X q∈Q W 2× ( ˆ f(q;θ ) | {z } data indep. − f(q) |{z} data dep. )×∇ ˆ f(q;θ ) | {z } data indep. (3.4) In Eq. (3.4), only f(q) accesses the database. This is because the training RCQs in Q W (i.e., the inputs to the neural network), are created independently of the database. The data dependent term requires computing private answers to f(q) for an RCQ q, hence must consume budget, while the data-independent terms can be calculated without spending any privacy budget. This decomposition of the gradient into data dependent and independent terms is possible because, different from typical machine learning settings, the differential privacy is defined with respect to the databaseD and not the training set (as discussed in Sec. 3.2.1). Instead of directly using g (Eq. (3.3)) as the differentially private estimate of the gradient (where the gradients are clipped and noise is added to the clipped gradients), we calculate a differentially private value of the training label f(q), called ¯ f(q), by adding noise to the label (define ¯ f(q) = f(q)+ Lap(1/ε)) and calculate the gradient from that. The differentially private estimate of the gradient is then g = X q∈Q W 2× ( ˆ f(q;θ )− ¯ f(q))×∇ ˆ f(q;θ ) (3.5) A crucial benefit is that ¯ f(q), does not change over successive learning iterations. That is, the differentially private value ¯ f(q) can be computed once and used for all training iterations. This 31 motivates our first design principle of separating noise addition and training. This way, training becomes a two step process: first, for all q∈ Q W , we calculate the differentially private training label ¯ f(q). We call this stepdatacollection. Then, we use a training set consisting of pairs(q, ¯ f(q)) for all q ∈ Q W for training. Since DP-compliant data measurements are obtained, all future operations that use as input these measurements are alsoε-differentially private according to the post-processing property of differential privacy [36]. Thus, the training process is done as in a non-private setting, where a conventional SGD algorithm can be applied (i.e., we need not add noise to gradients), and differential privacy is still satisfied. P2: Spatial data augmentation through partitioning. Following principle P1, privacy ac- counting is only needed when answering training queries to collect training labels. Meanwhile, in our experiments, we observed that training accurate neural networks requires a training set containing queries of different sizes (see Sec. 3.5.3.2). Such queries may overlap and, if we answer them directly from the database, sequential composition theorem would apply to account for the total privacy budget consumption. This way, the more such queries we answer, the more budget needs to be spent. Instead, to avoid spending extra privacy budget while creating more training samples with multiple query sizes, we propose spatial data augmentation through partitioning. First, we use a data collection query set,Q D , chosen such that RCQs inQ D don’t overlap (i.e., a space partition- ing). This ensures parallel composition can be used for privacy accounting, instead of sequential composition, which allows answering all RCQs in Q D by spending budget equal to one RCQ. Then, using the partitioningQ D , we create and answer new queries,q, of different sizes without spending any more privacy budget but by making uniformity assumption across cells inQ D that partially overlap q. Even though this approach introduces uniformity error in our training set, 32 Figure 3.2: SHN Overview it avoids adding the otherwise required large scale noise, and boosts accuracy. Thus, it allows us to optimize the uniformity/noise trade-off [96, 29] when creating our training set (we present experiments in Sec. 3.6.2 of our technical report [134] to show that data augmentation reduces error). P3: Learningatmultiplegranularities. We employ in our solution multiple models that learn at different granularities, each designed to answer RCQs of a specific size. Intuitively, it is more difficult for a model to learn patterns when both query size and locations change. Using multiple models allows each model to learn the patterns relevant to the granularity they operate on. 3.2.3 Proposedapproach: SNH Our Spatial Neural Histograms (SNH) design, illustrated in Figure 3.2, consists of three steps: (1) Data collection, (2) Model Training, and (3) Model Utilization. We provide a summary of each step below, and defer details until Sec. 3.3. Data Collection. This step partitions the space into non-overlapping RCQs that are directly answered with DP-added noise. The output of this step is a data collection query set, Q D , and a setY D which consists of the differentially private answers to RCQs in Q D . This is the only step 33 in SNH that accesses the database. In Fig. 3.2 for example, the query space is partitioned into four RCQs, and a differentially private answer is computed for each. Training. Our training process consists of two stages. First, we use spatial data augmentation to create more training samples based on Q D . An example is shown in Fig. 3.2, where an RCQ covering both the red and yellow squares is not present in the set Q D , but it is obtained by aggregating its composing sub-queries (both inQ D ). Second, the augmented training set is used to train a function approximator ˆ f that captures f well. ˆ f consists of a set of neural networks, each trained to answer different query sizes. Model Utilization. This step decides how any previously unseen RCQ can be answered using the learned function approximator, and how different neural networks are utilized to answer an RCQ. 3.3 TechnicalDetails 3.3.1 Step1: DataCollection This step creates a partitioning of the space into non-overlapping bins, and computes for each bin a differentially private answer. We opt for a simple equi-width grid of cell width ρ as our partitioning method. As illustrated in Fig. 3.3,(1) we overlay a grid on top of the data domain;(2) we calculate the true count for each cell in the grid, and(3) we add noise sampled fromLap( 1 ε ) to each cell count. We represent a cell by the coordinates of its bottom left corner,c, so that getting the count of records in each cell is an RCQ,q =(c,ρ ). LetC be the set of bottom left coordinates of all the cells in the grid. Furthermore, recall that for a query q, ¯ f(q) = f(q)+Lap( 1 ε ). Thus, the data collection query set is defined as Q D = {(c,ρ ),c ∈ C}, and their answers are the set 34 Figure 3.3: Data Collection: map view (left), true cell count heatmap (middle),ε-DP heatmap with noisy counts (right) Y D ={ ¯ f(c,ρ ),c∈C}. We useY D [c] to refer to the answer for the query located atc inY D . The output of the data collection step consists of setsQ D andY D . Even though more complex partitioning structures have been used previosuly for privately answering RCQs [96, 140], we chose a simple regular grid, for two reasons. First, our focus is on a novel neural database approach to answering RCQs, which can be used in conjunction with any partitioning type – using a simple grid allows us to isolate the benefits of the neural approach. Second, using more complex structures in the data collection step may increase the impact of uniformity error, which we attempt to suppress through our approach. The neural learning step captures density variations well, and conducting more complex DP-compliant operations in the data collection step can have a negative effect on overall accuracy. In our experiments, we ob- served significant improvements in accuracy with the simple grid approach. While it may be possible to improve the accuracy of SNH by using more advanced data collection methods, we leave that study for future work. The challenge in data collection is choosing the value of ρ to minimize induced errors. We address this thoroughly in Sec. 3.4.1 and present a method to determine the best granularity of the grid. 35 Figure 3.4: Model Training: Augmented query sets of sizer 1 tor k (top) are used to learn neural network models (bottom) 3.3.2 Step2: SNHTraining Given query setQ D and its sanitized answers, we can perform any operation on this set without privacy leakage due to the post-processing property of DP. As discussed in Sec. 3.2.3, we first perform a data augmentation step usingQ D to create an augmented training setQ A . Then,Q A is used for training our function approximator. DataAugmentation is a common machine learning technique to increase the number of samples for training based on the existing (often limited) available samples [144, 64]. We propose spatial data augmentation for learning to answer RCQs. Our proposed data augmentation approach is based on our design principle P2, discussed in Sec. 3.2.2, where we motivate augmenting the train- ing set through partitioning. In the data augmentation step, we create new queries of different sizes, answer them using the partitioning, and add the answers to our training set, as detailed in the following. 36 We use the partitioning defined by Q D and corresponding answersY D to answer queries at the same locations as in Q D but of other sizes. Consider a query location c ∈ C and a query size r, r ̸= ρ . We estimate the answer for RCQ q = (c,r) as P c ′ ∈C |(c,r)∩(c ′ ,ρ )| ρ 2 × Y D [c], where |(c,r)∩(c ′ ,ρ )| is the overlapping area of RCQs (c,r) and (c ′ ,r). In this estimate, noisy counts of cells inQ D fully covered byq are added as-is (since|(c,r)∩(c ′ ,ρ )| = ρ 2 ), whereas fractional counts for partially-covered cells are estimated using the uniformity assumption. Fig. 3.4 shows how we perform data augmentation for a query (c,r 1 ) with size r 1 at location c. Also observe that, by using queries at the same locations as inQ D , the bottom-left corners of all queries in the augmented query set are aligned with the grid. We repeat this procedure for k different query sizes to generate sufficient training data. To ensure coverage for all expected query sizes, we define the set of k sizes to be uniformly spaced. Specifically, assuming the test RCQs have size between l andu, we define the set R as the set of k uniformly spaced values between l and u, and we create an augmented training set for each query size in R. This procedure is shown in Alg.1. We define Q r A for r ∈ R to be the set of RCQs located at C but with query size r, that is Q r A = {(c,r),c ∈ C}, and define Y r A to be the set of the estimates for queries in Q r A obtained from Q D and Y D . The output of Alg. 1 is the augmented training set containing training samples for different query sizes. Note that, as seen in the definition above, Q r A , for anyr, only contains queries whose bottom-left corner is aligned with the grid used for data collection to minimize the use of the uniformity assumption. However, uniformity errors can still be present in our answers inY r A . We discuss in Sec. 3.3.3 how training of neural networks on top of these answers allows us to mitigate the uniformity error through learning. 37 Algorithm1 Spatial data augmentation Input: Query setQ D with answersY D ,k query sizes Output: Augmented training setQ A with labelsY A 1: R←{ l+ (u− l) k × (i+ 1 2 ),∀i,0≤ i<k} 2: forallr∈Rdo 3: Q r A ,Y r A ←∅ 4: for (c,ρ )∈Q D do 5: Q r A .append((c,r)) 6: Y r A [c]← P (c ′ ,ρ )∈Q D |(c,r)∩(c ′ ,ρ )| ρ 2 × Y D [c ′ ] 7: returnQ A ,Y A ←{ Q r A ,∀r∈R},{Y r A ,∀r∈R} Model architecture. We find that using multiple neural networks, each trained for a specific query size, performs better than using a single neural network to answer queries of all sizes. Thus, we train k different neural networks, one for each r ∈ R. Meaning that a single neural network trained for query sizer can only answer queries of sizer (we discuss in Sec. 3.3.3 how the neural networks are used to answer other query sizes), accordingly the input dimensionality of each neural network is two, i.e., lat. and lon. of the location of the query. We usek identical fully-connected neural networks (specifics of the network architecture are discussed in Sec. 3.5). Loss function and Optimization. We train each of thek neural networks independently. We denote by Q r A the training set for a neural network ˆ f(.;θ r ), trained for query size r, and we denote the resulting labels byY r A . We use a mean squared error loss function to train the model, but propose two adjustments to capitalize on the workload information available. First, note that for a query size r ∈ R, Q r A is comprised of queries at uniformly spaced intervals, which may not follow the query distributionQ. However, we can exploit properties of the workload queries, Q W to tune the model for queries fromQ. Specifically, for any (c,r) ∈ Q r A , let w (c,r) = |{q ′ ∈ Q W ,(c,r)∩q ′ ̸=∅}|, that is,w (c,r) is the number of workload queries that overlap a training query. In our loss function, we weight every query(c,r) byw (c,r) . This workload-adaptive modification 38 Figure 3.5: Model utilization: 30m query answered from 25m network (left), 90m query from 100m network (right) to the loss function emphasizes the regions that are more popular for a potential query issuer. Second, we aim at answering queries with low relative error, whereas a mean square loss puts more emphasis on absolute error. Thus, for a training query(c,r), we also weight the sample by 1/max{Y r A [c],ψ }. Putting these together, the loss function optimized for each neural network is X (c,r)∈Q r A w (c,r) max{Y r A [c],ψ } ( ˆ f(c,θ r )− Y r A [c]) 2 (3.6) 3.3.3 ModelUtilization To answer a new query(c,r), the model that is trained to answer queries with size most similar tor is accessed. That is, we find r ∗ =argmin r ′ ∈R |r− r ′ | and we answer the query using network ˆ f(c,θ r ∗ ). The output answer is scaled tor according to a uniformity assumption, and the scaled answer is returned, i.e.,( r r ∗ ) 2 ˆ f(c,θ r ∗ ). Fig. 3.5 shows this procedure for two different RCQs. It is important to differentiate the use of uniformity assumption before learning (i.e., in data augmentation), called uniformity assumption pre-learning, from the use of uniformity assump- tion after learning (during model utilization), called uniformity assumption post-learning. The 39 parameterk allows exploring the spectrum between the two cases. Specifically, when k is larger, we train more models and each model is trained for a different query size. For each query size, data augmentation uses uniformity assumption to generate training samples. Thus, more training samples are created using uniformity assumption. We call thisincreasing uniformity assumption pre-learning. On the other hand, since more models are trained, the output of each model will be scaled by a factor closer to one (i.e., in the above paragraph, r ∗ becomes closer to r so that ( r r ∗ ) 2 becomes closer to 1). We call this decreasing uniformity assumption post-learning. Our experimental results in Sec. 3.5.3.2 show that increasing k improves accuracy, and k should be set as large as possible so that uniformity assumption post-learning becomes negligible in prac- tice. This follows the SNH motivation (and observations in Sec. 3.5.3.4) that learning can mitigate the uniformity error. That is, the uniformity assumption should be made pre-learning so that its impact on final accuracy can be reduced through learning. 3.4 End-to-EndSystemAspects 3.4.1 SystemTuningwithParamSelect Choosing a good grid granularity,ρ , is crucial for achieving high accuracy for DP spatial data pub- lishing, and studied in previous work [96, 51]. Discretizing continuous domain geo-coordinates creates uniformity errors, and hence the granularity of the grid must be carefully tuned to com- pensate for the effect of discretization. Existing work [96, 51] makes simplifying assumptions to analytically model the impact of grid granularity on the accuracy of answering queries. However, modelling data and query specific factors is difficult and the simplifying assumptions are often not true in practice, as our experiments show (see Sec. 3.6.3 of our technical report [134]). Instead, 40 we learn a model that is able to predict an advantageous grid granularity for the specific dataset, query distribution and privacy budget. Sec. 3.4.1.1, discussesParamSelect, our approach to deter- mineρ . In Sec. 3.4.1.2 we show how to extend ParamSelect to tune other system parameters. 3.4.1.1 ParamSelectforρ The impact of grid granularity on privacy-accuracy trade-offs when answering queries is well- understood in the literature [96]. In SNH, the grid granularity in data collection phase impacts the performance as follows. On the one hand, smaller grid cells increase the resolution at which the data are collected, thereby reducing the uniformity error. Learning is also improved, due to more training samples being extracted. On the other hand, creating too fine grids can diminish the signal-to-noise ratio for cells with small counts, since at a given ε the magnitude of noise added to any cell count is fixed. Moreover, during data augmentation, aggregating multiple cells leads to increase in the total noise variance, since the errors of individual cells are summed. SNH is impacted by cell width in multiple ways, and determining a good cell width,ρ , is important to achieve good accuracy. Capturing an analytical dependence may not be possible, since numerous data, query and modelling factors determine the ideal cell width. If data points are concentrated in some area where the queries fall, a finer grid can more accurately answer queries for the query distribution (even though signal-to-noise ratio may be poor for parts of the space where queries are not often asked). This factor can be measured only by looking at the actual data and the distribution of queries, and would require spending privacy budget. The best value of ρ depends on the privacy budget ε, the distribution of points in D and the query distributionQ. Define δ (ρ,D,ε ) to be the error of SNH with cell width ρ and define 41 π (D,ε) = argmin ρ ∈R δ (ρ,D,ε ), that is, the function that outputs the ideal cell width. We learn a model, Φ , to approximate π (D,ε). We refer to Φ as regressor to distinguish it from the SNH model, ˆ f, discussed in Sec. 3.3. The learning process is similar to any supervised learning task, where for different dataset and privacy budget pairs, (D,ε), we use the labelπ (D,ε) to trainΦ . The input to the regressor is(D,ε) and the training objective is to get the output,Φ( D,ε), to be close to the labelπ (D,ε). Feature engineering. Learning a regressor that takes a raw database D as input is infeasible, due to the high sensitivity of learning with privacy constraints. Instead, we introduce a feature engineering step that, for the dataset D, outputs a set of features, ϕ D . Training then replaces D with ϕ D . Let the spatial region of D be SR D . First, as one of our features, we measure the skewness in the spread of individuals over SR D , since this value directly correlates with the expected error induced by using the uniformity assumption. In particular, we (1) discretizeSR D using an equi-width partitioning, (2) for each cell, calculate the probability of a point falling into a cell as the count of points in the cell normalized by total number of points inD, and (3) take the Shannon’s Entropyh D over the probabilities in the flattened grid. However, calculating h D on a private dataset violates differential privacy. Instead, we utilize publiclyavailable location datasets as an auxiliary source to approximately describe the private data distribution for the same spatial region. We posit that there exist high-level similarities in distribution of people’s locations in a city across different private and public datasets for the same spatial regions and thus, the public dataset can be used as a surrogate. LetD be the set of public datasets that we have access to, and letD I ∈D be a public dataset covering the same spatial region asD. We estimateh D for a dataset withh D I . We callD I public ParamSelect Inference dataset. 42 Second, we use data-independent features: ε, 1 n× ε and 1 √ n× ε , where the product of n× ε accounts for the fact that decreasing the scale of the input dataset and increasing epsilon have equivalent effects on the error. This is also understood as epsilon-scale exchangeability [51]. We calculate ϕ D,ε = (n,ε, 1 nε , 1 √ nε ,h D I ) as the set of features for the dataset D without consuming any privacy budget in the process. Lastly, we remark that for regions where an auxiliary source of information is unavailable, we may still utilize the data-independent features to good effect. In our technical report [134], we show that our proposed features achieve reliable accuracy across datasets; particularly, we choseh D amongst several alternative data-dependent features for that reason. Training Sample Collection. Generating training samples for Φ is not straightforward since we do not have an analytical formulation for δ (ρ,D,ε ) and thus π (D,ε). Since the exact value of π (D,ε) is unknown, we use an empirical estimate. We run SNH with various grid granular- ities of data collection and return the grid size, ρ D,ε , for which SNH achieves the lowest error. Our experimental results in Sec. 3.5.3 show thatδ (ρ,D,ε ) is only marginally affected with small changes inρ (so evaluatingδ (ρ,D,ε ) at different values of ρ five meters apart and selecting the bestρ provides a good estimate ofπ (D,ε)). Intuitively, one expects the error in the training set to remain the same if the cell width of data collection grid changes by a few meters, since the uniformity errors induced are similar. Thus, we use this approach to obtainρ D,ε as our training label. Note that the empirically determined value of ρ D,ε is dependent on—and hence accounts for—the query distribution on which SNH error is measured. Moreover, whenD contains sensi- tive data, obtaining ρ D,ε would require spending privacy budget. Instead, we generate training records from a set of datasets,D T ⊆ D that have already been publicly released (see Sec. 3.5 for 43 Algorithm2 ParamSelect training Input: A set of public training datasetsD T ⊆ D and privacy budgetsE for training to predict a system parameterα Output: RegressorΦ α for system parameterα 1: procedureϕ (D,n,ε) 2: h D ← entropy ofD 3: return(n,ε, 1 nε , 1 √ nε ,h D ) 4: procedureTrain_ParamSelect(D T ,E) 5: T ←{ (ϕ (D,|D|,ε),ˆ π α (D,ε))|ε∈E,D∈D T } 6: Φ α ← Train regressor usingT 7: returnΦ α Algorithm3 ParamSelect usage Input: Spatial extentSR and sizen of a sensitive datasetD and privacy budgetε Output: System parameter valueα for private datasetD 1: procedureParamSelect(SR,n,ε) 2: D I ← Public dataset with spatial extentSR 3: α ← Φ α (ϕ (D I ,n,ε)) 4: returnα details of public datasets). We call datasets in D T public ParamSelect Training datasets. Put to- gether, our training set is{(ϕ D,ε ,ρ D,ε )|ε∈E,D∈D T }, whereE is the range of different privacy budgets chosen for training. PredictingGridWidthwithParamSelect. The training phase of ParamSelect builds regressor Φ using the training set described above. We observed that models from the decision tree family perform the best for this task. Once the regressor is trained, its utilization for any unseen dataset is straightforward and only requires calculating the corresponding features and evaluatingΦ . 3.4.1.2 GeneralizingParamSelecttoanysystemparameter We can easily generalize the approach in Sec. 3.4.1.1 to any system parameter. Define function π α (D,ε) that given a query distribution, outputs the best value of α for a certain database and 44 privacy budget. The goal of ParamSelect is to learn a regressor, using public datasets D T ∈D, that mimics the functionπ α (.). ParamSelect functionality is summarized in Alg. 2. First, during a pre-processing step, it de- fines the feature extraction function ϕ (D,n,ε), that extracts the features described in Sec. 3.4.1.1 from the public datasetD withn records, and a privacy budgetε. Second, it creates the training set{(ϕ (D,|D|,ε),ˆ π α (D,ε)),ε ∈ E,D ∈ D T }, where ˆ π α (D,ε) estimates the value of π α (D,ε) with an empirical search (i.e., by trying different values of α and selecting the one with the highest accuracy), andD T andE are different public datasets and values of privacy budget, respectively, used to collect training samples. Lastly, it trains a regressor Φ α that takes extracted features as an input and outputs a value forα . At inference stage (Alg. 3) ParamSelect uses a public datasetD I that covers the same spatial region as D, as well as size of D, n, and privacy budget ε to extract features ϕ (D I ,n,ε). The predicted system parameter value forD is thenΦ α (ϕ (D I ,n,ε)). 3.4.2 PrivacyandSecurityDiscussion Let D be a private dataset covering a spatial region SR andD be a set of public datasets. The SNH end-to-end privacy mechanismM is comprised of two parts that compose sequentially: mechanismM f , that models range count queries using the neural networks, and mechanism M Φ , that trains a regressor to determine the system parameters.M f operates overD,ε,SR and D. M Φ operates overD and SR for ParamSelect training and inference. Hence, we write the end-to-end system as the SNH mechanismM(D|ε,SR,D)=M f (D|ε,D,SR,M Φ (D,SR)). Theorem1 MechanismM(D|ε,SR,D) satisfies ε-DP. 45 Proof of Theorem 1. SNH, represented as the mechanismM is the composition of mecha- nismsM ϕ andM f . Furthermore,M f can be written as a composition of the data collection mechanism, denoted asM D , which outputs the data collection grid, and a functionh that per- forms the arbitrary transformations on this grid during data augmentation and training. That isM(D|ε,SR,D) =h(M D (D|ε,SR,M Φ (D,SR)),D).M Φ (D,SR)) is the ParamSelect mech- anism that obtains system parameters utilizing only public informationD and SR to predict the system parameters. Thus it does not access private records in D, consequently, it also does not consume privacy budget. Note that ParamSelect mechanism does use the size of the private dataset for prediction, which we assume is publicly available and, if not, an estimate can be ob- tained by spending negligible privacy budget. Next,M D is called, which creates a grid of cell widthρ , whereρ is the output ofM ϕ , on the spatial extentSR. For each cell in the grid created, it then access the database to obtain the number of records in the cell and adds noise Lap( 1 ε ) to the true count. Thus, a noisy count for each cell is obtained with ε-DP. Furthermore, since cells do not overlap parallel composition theorem of DP applies, and the computation of noisy count for all the cells is stillε-DP. Finally, the transformationh is applied to the output ofM D , which due to the post processing property of DP does not consume any privacy budget. Thus, the mechanismM, which is a composition ofM ϕ ,M D andh isε-differentially private. □ SecurityDiscussion. DP has different requirements and guarantees compared to alternative se- curity models such as encryption. With encryption, one protects thedatavalues of an individual (i.e., locations visited by a person), whereas the presence of an individual in the data is known (either a real identity or a pseudo-identity). In the context of cryptography, leaking the distribu- tion of visited locations is not permitted. In contrast, DP allowsstatistical information (including 46 Table 3.2: Urban datasets characteristics. Low Pop. density Medium Pop. density High Pop. density Fargo [46.877, -96.789] Phoenix [33.448, -112.073] Miami [25.801, -80.256] KansasCity [39.09, -94.59] Los Angeles [34.02, -118.29] Chicago [41.880, -87.70] Salt Lake [40.73, -111.926] Houston [29.747, -95.365] SF [37.764, -122.43] Tulsa [36.153, -95.992] Milwaukee [43.038, -87.910] Boston [42.360, -71.058] density distribution) to be released, as long as an adversary cannot pinpoint the presence of a targeted individual in the data. The purpose of SNH is to publish DP-compliant density statistics while protecting against individual presence inference. In this context, density information is actually needed by the application (e.g., identifying hotspots), and leakage of DP-sanitized den- sity information is desired and permitted. Moreover, due to the robustness of DP to side-channel information, this privacy guarantee is independent of available public information inD. 3.5 ExperimentalEvaluation Sec. 3.5.1 describes the experimental testbed. Sec. 3.5.2 evaluates SHN in comparison with state- of-the-art approaches. Sec. 3.5.3 provides an ablation study of various design choices. Sec. 3.6 of our technical reports [134] contains complementary experimental results. 3.5.1 ExperimentalSettings 3.5.1.1 Datasets We first describe all the datasets and then specify how they are utilized in our experiments. DatasetDescription. All datasets comprise of user check-ins specified as tuples of: user identi- fier, latitude and longitude of check-in location, and timestamp. Our first dataset is a subset of the user check-ins collected by the SNAP project [27] from the Gowalla (GW) network. It contains 47 6.4 million records from 200k unique users during a time period between February 2009 and October 2010. Our second dataset, SF-CABS-S (CABS) [95], is derived from the GPS coordinates of approximately 250 taxis collected over 30 days in San Francisco. Following [51, 96], we keep only the start point of the mobility traces, for a total of 217k records. The third dataset is propri- etary, obtained from Veraset [115] (VS), a data-as-a-service company that provides anonymized movement data from 10% of the cellphones in the U.S [116]. For a single day in December 2019, there were 2.6 billion readings from 28 million distinct devices. From VS we generate the fourth dataset called SPD-VS. We perform Stay Point Detection (SPD) [130] on the data to remove loca- tion signals when a person is moving, and to extract POI visits when a user is stationary. SPD is useful for POI services [94], and results in a data distribution consisting of user visits (i.e., fewer points on roads and more at POIs). Following [130], we consider as location visit a region 100 meters wide where a user spends at least 30 minutes. To simulate a realistic urban environment, we focus on check-ins from several cities in the U.S. We group cities into three categories based on their population densities [71], measured in people per square mile: low density (lower than 1000/sq mi), medium density (between 1000 and 4000/sq mi) andhighdensity (greater than 4000/sq mi). A total of twelve cities are selected, four in each population density category as listed in Table 3.2. For each city, we consider a large spatial region covering a20× 20km 2 area centered at [lat, lon]. From each density category we randomly select a test city (highlighted in bold in Table 3.2), while the remaining cities are used as training cities. We use the notation <city> (<dataset>) to refer to the subset of a dataset for a particular city, e.g., Milwaukee (VS) refers to the subset of VS datasets for the city of Milwaukee. Experiments on VS. Private dataset: Our experiments on Veraset can be seen as a case-study of answering RCQs on a proprietary dataset while preserving differential privacy. We evaluate 48 RCQs on the Veraset dataset for the test cities. Due to the enormous volume of data, we sample at random sets ofn check-ins, forn∈{25k,50k,100k,200k,400k} for the test cities and report the results on these datasets. AuxiliaryDatasets: For each test city in VS, we setQ W andD I to be the GW dataset from the corresponding city. GW and VS datasets are completely disjoint (they are collected almost a decade apart). The public datasetsD T are the set of all the training cities of the GW dataset. ExperimentsonGW. Private dataset: We present the results on the complete set of records for the test cities of Miami, Milwaukee and Kansas City with 27k, 32k and 54k data points, respec- tively. Auxiliary Datasets: For each test city, we setQ W andD I to be the VS counterpart dataset for that city. D T contains all the training cities in the GW dataset. None of the test cities, which are considered sensitive data, are included inD T . ExperimentsonCABS. Privatedataset: Since CABS consists of 217k records within the city of San Francisco only, we treat it as the sensitive test city for publishing. AuxiliaryDatasets: We set Q W andD I to be the GW dataset for San Francisco. D T contains all the training cities in the GW dataset. Once again, collecting auxiliary information from an entirely different dataset ensures no privacy leakage on the considered private dataset. 3.5.1.2 SNHsystemparameters We use the GW dataset to train the ParamSelect regression model. For the nine training cities and five values of privacy budget ε, we obtain 45 training samples. We utilize an AutoMLpipeline (such as [42, 126]) to find out a suitable model from among a wide range of ML algorithms. The pipelines use cross-validation to evaluate goodness-of-fit for possible algorithm and hyper- parameter combinations. The final model is an Extremely Randomized Tree (ExtraTrees) [44]. 49 ExtraTrees create an ensemble of random forests [57], where each tree is trained using the whole learning sample (rather than a bootstrap sample). The model ensembles 150 trees having a max- imum depth of 7. For other system parameters, we observed that their best value for SNH remain stable over various dataset and privacy budget combinations. Sec. 3.5.3.2 and Sec. 3.6.4 of our technical report [134] present this result for parameterk and Sec. 3.5.3.4 and Sec. 3.6.4 of our technical report [134] for the model depth. We observed no benefit in using ParamSelect to set these parameters and merely selected a value that performed well on our public datasets for the system parameter k and neural network hyper-parameters. The fully connected neural networks contain 20 layers of 80 unit each and are trained with Adam [62] optimizer with learning rate 0.001. 3.5.1.3 Otherexperimentalsettings EvaluationMetric. We construct query sets of 5,000 RCQs centered at uniformly random posi- tions. Each query has side length that varies uniformly from 25 meters to 100 meters. We evaluate the relative error for a query q as defined in Sec. 3.1, and set smoothing factor ψ to 0.1% of the dataset cardinalityn, as in [140, 29, 96]. Baselines. We evaluate our proposed SNH approach in comparison to state-of-the-art DP solu- tions: PrivTree [140], Uniform Grid (UG) [96], Adaptive Grid (AG) [96] and Data and Workload Aware Algorithm (DAWA) [69]. Brief summaries of each method are provided in Sec. 2. DAWA requires the input data to be represented over a discrete 1D domain, which can be obtained by applying a Hilbert transformation. To this end, we discretize the domain of each dataset into a uniform grid with2 20 cells, following the work of [69, 140]. DAWA also uses the workload query 50 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (a) Kansas City (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 0.20 relative error (b) Milwaukee (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 relative error (c) Miami (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (d) Milwaukee (SPD VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 0.20 relative error (e) SF (CABS) SNH AG UG PrivTree DAWA STHoles Figure 3.6: Impact of privacy budget: VS, SPD-VS and CABS datasets 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 0.8 relative error (a) Kansas City (GW) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 relative error (b) Milwaukee (GW) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 relative error (c) Miami (GW) Figure 3.7: Impact of privacy budget: GW dataset 25 50 100 200 400 n ( × 1 0 0 0) 0.0 0.1 0.2 relative error (a) Impact of n 25 50 75 100 query size (m) 0.0 0.1 0.2 relative error (b) Impact of Query size Figure 3.8: Impact of data and query size set, Q W , as specified in Sec. 3.5.1.1. For PrivTree, we set its fanout to 4, following [140]. We also considered Hierarchical methods in 2D (HB2D) [97, 52] and QuadTree [29], but the results were far worse than the above approaches and thus are not reported (we report the results of all the baselines in Sec. 3.6.1 of our technical report [134]). As an additional baseline, we mod- ify STHoles [17], a non-private workload-aware algorithm, to satisfy DP. STHoles builds nested buckets in regions where the workload requires finer granularity. We incorporate differential privacy by (1) adding the required sanitization noise to the frequency counts in STHoles’ buck- ets and (2) implementing the algorithm so that it avoids asking overlapping queries from the database to minimize the magnitude of noise added. Details of our DP-compliant adoption of STHoles are available in the Appendix 3.7 of our technical report [134] and our implementation is publicly available at [132]. Similar to DAWA and SNH, STHoles uses the workload query set, Q W , as specified in Sec. 3.5.1.1. Implementation. All algorithms were implemented in Python, and executed on a Linux ma- chine with an Intel i9-9980XE CPU, 128GB RAM and a RTX2080 Ti GPU. Neural networks are 51 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (a) n=25,000 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (b) n=100,000 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (c) n=400,000 SNH PGM@ParamSelect IDENTITY@ParamSelect Figure 3.9: Study of modeling choice 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 relative error SNH, k=1 SNH, k=8 SNH+QS, k=1 SNH+QS, k=8 Figure 3.10: Impact of uni- formity assumption 15 20 25 30 35 40 45 ρ (m) 0.00 0.05 0.10 0.15 relative error ε=0.05 ε=0.2 ε=0.8 SNH Figure 3.11: Impact of ρ and ParamSelect implemented in JAX [16]. Given this setup, SNH took up to 20 minutes to train in our experi- ments, depending on the value ofρ . The average query time of SNH is 329µs and a model takes 4 MB of space. We publicly release the source code at [133]. DefaultValues. Unless otherwise stated, we present the results on themedium population den- sity city, Milwaukee (VS), with data cardinalityn=100k. Privacy budgetε is set to0.2. 3.5.2 ComparisonwithBaselines Impact of privacy budget. Figs. 3.6 and 3.7 present the error of SNH and competitors when varying ε for test datasets VS, SPD-VS, CABS and GW. Recall that a smaller ε means stronger privacy protection. For our proprietary datasets, VS and SPD-VS, we observe that SNH outperforms the state- of-the-art by up to 50% at all privacy levels (Fig. 3.6 (a)-(d)). This shows that SNH is effective in 52 utilizing machine learning and publicly available data to improve accuracy of privately releasing proprietary datasets. Fig. 3.6 (e) and Fig. 3.7 show that SNH also outperforms for CABS and GW datasets in almost all settings, the advantage of SNH being more pronounced for smallerε values. Stricter privacy regimes are particularly important for location data, since such datasets are often released at multiple time instances with smaller privacy budget per release. Impactofdatacardinality. Fig. 3.8 (a) shows the impact of data cardinality on relative error for Milwaukee (VS). For all algorithms, the accuracy improves as data cardinality increases. This is a direct consequence of the signal-to-noise ratio improving as cell counts are less impacted by DP noise. SNH consistently outperforms competitor approaches at a wide range of data cardinality settings. Impact of query size. We evaluate the impact of query size on accuracy by considering test queries of four different sizes in Milwaukee (VS). Fig. 3.8 (b) shows that the error for all the al- gorithms increases when query size grows, with SNH outperforming the baselines at all sizes. There are two competing effects when increasing query size: on the one hand, each query is less affected by noise, since actual counts are larger; on the other hand, the error from more grid cells is aggregated in a single answer. The second effect is stronger, so the overall error steadily increases with query size. 53 2 5 10 20 40 σ 0.00 0.02 0.04 relative error (a) Noisy Obs. 2 5 10 20 40 σ 0.00 0.05 0.10 relative error (b) Obs. with Uniformity SNH ( s 1 ) SNH ( s 2 ) SNH ( s 3 ) No Learning Figure 3.12: Impact of data skewness (ε=0.2) 3.5.3 AblationStudyforSNH 3.5.3.1 Modelingchoices Recall that SNH first creates a uniform grid, with granularity decided by ParamSelect. It then performs data augmentation and learning using the data collected on top of the grid. Next, we study the importance of each component of SNH to its overall performance. We create two new baselines to show how our choice of using neural networks to learn the patterns in the data im- proves performance. The first, called IDENTITY@ParamSelect, ablates SNH, utilizing only the uniform grid created in SNH at data collection. The second baseline, called PGM@ParamSelect, employs Private Probabilistic Graph Models (PGM) [79], a learning algorithm specifically de- signed for high-dimensional categorical data. We extend PGM to 2D spatial datasets by feeding it a DP uniform grid at the granularity selected by ParamSelect. Fig. 3.9 (a) shows SNH outperforming both these baselines. SNH outperforming IDENTITY shows the benefit of learning, since both SNH and IDENTITY use the same grid for data collection but SNH learns neural networks using data generated from the grid, while IDENTITY directly uses the grid to answer queries. This benefit diminishes when the privacy budget and the data cardinality increase (note that bothn andε are in log scale), where a simple uniform grid chosen 54 at thecorrect granularity outperforms all existing methods (comparing Fig. 3.9 (b) with Fig. 3.6 (b) shows IDENTITY@ParamSelect outperforms the state-of-the-art forε = 0.4 and 0.8). For such ranges of privacy budget and data cardinality, ParamSelect recommends a very fine grid granu- larity. Thus, the uniformity error incurred by IDENTITY@ParamSelect becomes lower than that introduced by the modelling choices of SNH and PGM. This also shows the importance of a good granularity selection algorithm, as UG in Fig. 3.6 performs worse than IDENTITY@ParamSelect for largerε. 3.5.3.2 BalancingUniformityErrors We discuss how the use of the uniformity assumption at different stages of SNH impacts accuracy. Recall from Sec. 3.3.3 that the value ofk balances the use of the uniformity assumption pre- and post-learning. We empirically study how uniformity assumption pre- and post-learning influence SNH’s accuracy by varyingk. Furthermore, we study how removing the uniformity assumption post-learning and replacing it with a neural network affects accuracy. Specifically, we consider a variant of SNH where we train the neural networks to also take as an input the query size. Each neural network is still responsible for a particular set of query sizes,[r l ,r u ], where we use data augmentation to create query samples with different query sizes falling in [r l ,r u ]. Instead of scaling the output of the trained neural networks, now each neural network also takes the query size as an input, and thus, the answer to a query is just the forward pass of the neural network. We call this variant SNH with query size, or SNH+QS . Fig. 3.10 shows that, first, removing the uniformity assumption post-learning has almost no impact on accuracy when k is large. However, for a small value of k, it provides more stable accuracy. Note that when k = 1, SNH trains only one neural network for query size r ∗ and 55 answers the queries of size r by scaling the output of the neural network by r r ∗ . The error is expected to be lower whenρ andr ∗ have similar values, since there will be less uniformity error when performing data augmentation. This aspect is captured in Fig. 3.10, where at ε = 0.2, r ∗ and ρ are almost the same values and thus, the error is the lowest. Sec. 3.6.4 of our technical report [134] evaluates more comprehensively the impact ofk. 3.5.3.3 ParamSelectandρ Fig.3.11 shows the performance of SNH with varying cell width ρ at multiple values of ε. A coarser grid first improves accuracy by improving signal-to-noise ratio at each cell, but a grid too coarse hampers accuracy by reducing the number of samples extracted for training SNH. This creates a U-shaped trend which shifts to smaller values of ρ for larger values of ε as the lower DP noise impacts the cell counts less aggressively. The red line in Fig.3.11 labelled SNH shows the result of SNH at the granularity chosen by ParamSelect. SNH performing close to the best possible proves that ParamSelect finds an advantageous cell width for SNH. 3.5.3.4 SNHLearningAbilityinNon-UniformDatasets We study the ability of neural networks to learn patterns from skewed datasets through imprecise observations, where imprecision is due to noise or uniformity assumption. Setup. We synthesize100k points from a Gaussian Mixture Model (GMM) [102] with 16 compo- nents. The means of the components are placed uniformly over the data space. All components are equally weighted and have the covariance matrix I × σ 2 , where I is the identity matrix. GMMs allow controlling data skewness via the parameter σ . We partition the data space into a grid of 200× 200 cells and report σ in terms of number of cells. The query set, Q, consists of 56 queries asking for the number of points inside each cell. Fig. 3.13(a) plots the true answers to this query set whenσ =7. Learning from Noisy Observations. We consider two scenarios. First, we obtain the DP an- swers, ˜ A, to the queries in Q by adding noise to the true answers. We call this algorithm No Learning. Forε = 0.05, Fig. 3.13 (b1) shows the noisy answers reported by No Learning. Com- paring Figs. 3.13 (a) and 3.13 (b1) we observe that the sanitization noise severely distorts the existing patterns in the data. Second, we train a neural network using only the noisy answers shown in Fig. 3.13 (b1), that is, the inputs to the neural network are queries in Q and training labels are the answers in ˜ A. After training, we ask the same queries, Q. The result in Fig. 3.13 (c1) shows the output of the neural network. SNH has a strong ability to recover the underlying patterns of GMMs from even highly distorted observations. Additional visualizations for several values ofε andσ can be found in Sec. 3.6.5 of our technical report [134]. Next, we compare the error in the neural network predictions to that in the noisy answers it was trained with. The latter is represented with the line labelled ‘No Learning’ in Fig. 3.12 (a) and is the error in ˜ A. Lines labeled SNH show the error of SNH at varying model sizes (s 1 , s 2 and s 3 correspond to models with depth 5, 10 and 20 and width 15, 25 and 80 respectively) on the same query set. Whenσ is large, the data is closer to being uniformly distributed and there are fewer patterns to learn, whereas when σ is small, the data becomes more skewed towards the mean of each GMM component. The results in Fig. 3.12 (a) show that when data is skewed, SNH is especially capable of extracting patterns in the data where present, utilizing them to boost accuracy. However, when data is uniform-like, SNH performs similar to ‘No Learning’ as there are few patterns to be learned. Lastly, by varying model size (liness 1 ,s 2 ands 3 ) we show that it 57 Figure 3.13: SNH learns patterns on GMM dataset of 16 components. Color shows number of data points. is beneficial to use a larger neural network for more skewed datasets. A larger network exhibits stronger representation power and hence captures the skewness better. LearningfromObservationswithUniformityError. We generate the training data by pur- posefully inducing uniformity error when answering queries in our training set, Q. We first superimpose a coarse partitioning of 20× 20 blocks over the original 200× 200 cell grid, with each block covering exactly 100 cells. To answer the queries inQ, we first obtain the true answer for each block, and then divide that value by 100 to obtain the answer for each cell within the block (assuming uniformity within the block). The result is shown in Fig. 3.13 (b2). Note that the set of queries that fall within the same block (in the 20× 20 grid) all receive the same answers due to the uniformity assumption. Next, we train a neural network with queries inQ (corresponding to the cells in the 200× 200 grid). The result in Fig. 3.13 (c2) shows that the neural network smoothens the observations and brings them closer to the true answers. In Fig. 3.12 (b) we evaluate the ef- fect of increasing skewness (i.e., decreasingσ ): “No Learning” yields larger errors, whereas SNH, through learning, keeps the error steady for different skewness levels. 58 0.05 0.1 0.2 0.4 0.8 ε 10 −1 10 0 10 1 relative error SNH AG UG PrivTree DAWA STHole HB QuadTree Uniform Privelet DPcube Figure 3.14: Milwaukee (VS)ϵ =0.2,n=100k 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 relative error SNH No Unif. SNH Figure 3.15: Replacing uni- formity error with noise 3.6 SupplementaryExperimentalResults 3.6.1 Comparisonagainstallbaselines We compare our method against all existing baselines in Figure 3.14 (note the log scale). To the best of our knowledge, the figure contains all differentially private algorithms applicable to 2D location datasets. Existing methods are predominantly domain partitioning methods that utilize traditional data structures. For instance, DPCube[31] exploits a kd-tree structure, QuadTree[29] uses a full quadtree, HB[97] invokes a hierarchical tree with variable fanout, Privtree[140] also uses a hierarchical tree but without height constraints, UG [96] is a single level grid and AG[96] is a two level grid. Missing from the excerpt is DPCube [123], which in particular is a method best sutiable to high-dimensional data. DPCube searches for dense ‘subcubes’ of the datacube repre- sentation to release privately. A part of the privacy budget is used to obtain noisy counts using Laplace mechanism over a straightforward partitioning, which is then improved to a standard kd-tree. Fresh noisy counts for the partitions are obtaining with the remaining budget and a final inference step resolves inconsistencies between the two sets of counts, and improves accuracy. 59 Methods that perform worse than Unifrom Grid (UG), have been omitted in our experiments in Section 3.5 due to their poor performance. 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 0.20 relative error SNH@ParamSelect SNH@UG Figure 3.16: Study of ParamSelect 1 2 4 6 8 k 0.00 0.05 0.10 0.15 relative error ε =0.05 ε =0.2 ε =0.8 Figure 3.17: Impact ofk 5 10 20 40 Model depth 0.00 0.05 0.10 0.15 relative error ε=0.05 ε=0.2 ε=0.8 Figure 3.18: Impact of model depth 3.6.2 DataAugmentation: UniformityerrororLargeScaleNoise In this section, we present our empirical results motivating our design principle “P2: Spatial Data Augmentation through Partitioning”. Recall that, as discussed in Sec. 3.2.2, neural networks perform best when trained with queries of different sizes (as shown in experiments in Sec. 3.5.3.2, Figs 3.17 and 3.10). However, queries of different sizes may overlap. Hence, due to DP constraints, answering such queries can either be done by introducing more noise or more uniformity error due to sequential composition property of DP (see Sec. 3.1.1). Here, we present our results that show a considerable advantage in adding noise once and collecting more training data through data augmentation (and thereby using uniformity assump- tion) compared with adding more noise but avoiding uniformity assumption. To substantiate this claim we design an experiment (in Fig. 3.15), where for any location we generate training queries with 8 different sizes, creating 8 overlapping queries per location. Lines labelled “SNH No Unif.” and “SNH” both use the same query set for training, however the answers (i.e., labels) to the training queries are generated differently. “SNH No Unif.” answers all queries directly from the 60 database records (and thus more noise is added per query due to sequential decomposition of DP, but avoids completely uniformity assumption). On the other hand, SNH as presented in the paper and (discussed in Sec. 3.2.3) first uses a grid for data collection and then answers queries based on the grid (so it incurs uniformity error, but adds less noise per query than “SNH No Unif.”). The result shows that it is better to use uniformity assumption than to increase noise, justifying its use in data augmentation. However note that the uniformity error is introduced in the training set before learning, and mitigated through learning. 3.6.3 BenefitofParamSelect ParamSelect selects the best grid granularity ρ for SNH. An existing alternative for setting the grid granularity is using the guideline of UG [96], which, by making assumptions about the query and data distribution, analytically formulates the error for using a uniform grid. It then proposes creating an m× m grid, setting m = p nε/c for a constant c empirically set to c = 10. We call SNH with grid granularity chosen this way SNH@UG. We compare this method with SNH (referred to SNH@ParamSelect to emphasize the use of ParamSelect to setρ ). We compare the error in theρ predicted by ParamSelect to that by the UG guideline. To do so, we first empirically find ρ ∗ , the cell width at which SNH achieves highest accuracy. Then we calculate the mean absolute error (MAE),|ρ − ρ ∗ |, of the suggested cell widthρ by either UG or ParamSelect. Averaged across several privacy budgets, ParamSelect achieves MAE of 3.3m while UG results in MAE of 281.3m. That is, UG recommends a cell width far from the optimal cell width. 61 Fig. 3.16 shows how cell width impacts the accuracy of SNH. We observe a significant differ- ence between SNH@UG and SNH@ParamSelect, establishing the benefits of ParamSelect. Over- all, the results of this ablation study, and the ablation study in Sec. 3.5.3.2, show that both good modelling choices and system parameter selection are imperative in order to achieve high accu- racy. 3.6.4 Systemparametersanalysis Impact of k. Fig. 3.17 shows the impact of k on the accuracy of the models. The result shows that for large values ofε, increasingk can substantially improve the performance. Fig. 3.17 also shows the need for having access to queries of multiple sizes during training, as this is required whenk >1. Impact of Model Depth. We study how the neural network architecture impacts SNH’s per- formance in Fig.3.18. Specifically, we vary the depth (i.e., the number of layers) of the network. Increasing model depth improves slightly the accuracy of SNH due to having better expressive power from deep neural networks. However, networks that are too deep quickly decrease accu- racy as the gradients during model training diminish dramatically as they are propagated back- ward through the very deep network. Furthermore, largerε values are able to benefit more from the increase in depth, as more complex patterns can be captured in the data when it is less noisy. 3.6.5 FurtherGMMVisualizations We extend the discussion of Sec. 3.5.3.4 and visualize in various settings the ability of neural networks to reduce the errors by learning from imprecise observations. We study this behavior 62 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0.0 2.5 5.0 7.5 10.0 12.5 15.0 0 50 100 150 0.0 2.5 5.0 7.5 10.0 Figure 3.19: SNH on GMMε=0.05,σ =14 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0.0 2.5 5.0 7.5 10.0 12.5 15.0 0 10 20 30 40 2 4 6 Figure 3.20: SNH on GMMε=0.2,σ =14 for ε = 0.05 (i.e. in high-privacy regime) in Figures 3.19, 3.21, 3.23, and for ε = 0.2 (i.e. low- privacy regime) in Figures 3.20, 3.22, 3.24 for different values of standard deviation, σ , of the GMM components. SNH is especially capable in the low-privacy regime, and when the data are heavily skewed or non-uniform, justifying their use in location datasets that exhibit similarly skewed distributions. To conclude, given a set of imprecise observations, by fitting a neural network to all such observations simultaneously, we obtain a neural network with lower error than in the observations themselves. 3.7 DifferentiallyPrivateSTHolesImplementation We describe the general structure of the STHoles histograms and the specific modifications that we make to achieve DP-compliance and good utility for answering RCQs. STHoles [17] is a his- togram construction technique that exploits query workload. It generates a domain partitioning in the form of nested buckets assembled as a tree structure. In contrast to traditional domain 63 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0 10 20 30 0 50 100 150 200 0 5 10 15 20 Figure 3.21: SNH on GMMε=0.05,σ =7 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0 10 20 30 0 10 20 30 40 50 5 10 15 20 Figure 3.22: SNH on GMMε=0.2,σ =7 partitioning methods, STHoles allows buckets to overlap by permitting inclusion relationships between ancestor nodes of the tree structure, i.e., some buckets can be completely included in- side others. We defer the details of the histogram’s construction, and instead refer the reader to [17]. Our implementation is publicly available at [132]. Our DP-compliant STHoles implementation makes two adjustments to the original STHoles algorithm to allow for better accuracy when accounting for privacy. First, we allow the algorithm to use unlimited memory, so that it does not need to merge any of the buckets to reduce memory usage. This not only avoids incurring the merge penalty (discussed in the paper [17]) but also lowers the privacy budget consumption, since we can avoid calculating merge penalties that would require budget consuming accesses toD. Second, we separate the process of calculating the frequency counts for each bucket from the process of building the nested bucket structure. That is, we first build the bucket structure based on the query workload and then calculate the frequency counts within each bucket. This separation significantly reduces the privacy budget consumption, since it allows us to avoid asking overlapping queries from the database and thus, 64 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0 20 40 60 80 100 0 50 100 150 0 20 40 60 80 Figure 3.23: SNH on GMMε=0.05,σ =3.5 0 50 100 150 0 50 100 150 (a) True Answers 0 50 100 150 0 50 100 150 (b) Noisy Answers 0 50 100 150 0 50 100 150 (c) SNH Predictions 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 Figure 3.24: SNH on GMMε=0.2,σ =3.5 final privacy budget accounting can be done with only parallel composition theorem. Next, we present how we build the buckets and calculate the frequency counts in more details. First, we generate the nested budget structure using the query workloadQ W (Algorithm 4). Modified from the original algorithm, in this step, we do not calculate database related statistics such as the number of records in each bucketb∈H SR as that would necessitate spending scarce privacy budget. For the same reason, we also skip the step which merges buckets together based on a penalty caculated from database records. From the privacy analysis perspective, the query workload ispublic and using information therein incurs no privacy leakage. Hence, Algorithm 4 doesn’t use any privacy budget. In the second step (Algorithm 5), we generate sanitized frequency counts for STHoles’ buckets in the data structure. For each bucket, we query the database for the number of records that fall within its extent, sanitizing these counts using the Laplace Privacy Mechanism (see Section 3.1.1 for details of the mechanism). 65 Algorithm4 STHoles Domain Partitioning Input: Query WorkloadQ W for the spatial regionSR Output: Domain Partitioningθ 1: procedureBuildPartitioning(Q W ,SR) 2: H SR ← Initialize histogram with fixed size root bucket of 3: spatial extentSR. 4: forallq∈Q W do 5: Identifyb∈H SR that haveq∩b̸=∅. 6: Shrink candidate holes according to Sec. 4.2.1 of [17] 7: Add new holes as buckets to histogramH SR 8: returnH SR Algorithm5 DP-compliant sanitization of STHoles Input: Private DatasetD, Bucketsb∈H SR , privacy budgetε Output: DP-compliant STHoles modelθ STHoles 1: procedureSanitizeHistogram(H SR ,D,ε) 2: forallb∈H SR do 3: Set frequency ofb to be ¯ f(b) (i.e., true count + Lap(1/ε)) 4: returnθ STHoles 3.8 ParamSelectFeatureEngineeringandFeatureSelection We present experimental results supporting that we have carefully selected features for Param- Select that accurately capture the privacy-utility trade-off across spatial datasets and allow for reliable system parameter estimation. Since the training data is comprised of public datasetsD, the feature extraction process is a typical ML problem. Our feature extraction process follows two steps: (I) feature engineering where we transform raw data into a number of features that better represent the dataset for learn- ing our regression model, and (II) feature selection, where we select a subset of the engineered features that provide reliable accuracy across datasets. Feature Engineering. We engineered vari- ous features according to relations (such as epsilon-scale exchangeability) well studied in the lit- erature [51] and proposed novel features to capture data distribution in location datasets. While 66 FeatureSetϕ RelativeErrorofregressorΦ oncross-validationset ϕ (n) 0.312 ϕ (n,ε) 0.237 ϕ (n,ε,1/nε, p 1/nε) 0.193 ϕ (n,ε,1/nε, p 1/nε,POP D ) 0.207 ϕ (n,ε,1/nε, p 1/nε,ANN D ) 0.225 ϕ (n,ε,1/nε, p 1/nε,SNR D ) 0.187 ϕ (n,ε,1/nε, p 1/nε,h D ) 0.151 Table 3.3: Validation set error of ParamSelect in predictingρ data-independent features were straightforward, data region specific features posed a challenged since they need to summarize location datasets while capturing the differences in the pattern of originating location signal (e.g., cell phone location signals vs user-checkins in geo-social net- works), and differences in skewness between cities (e.g., dense sprawls of New York vs the spares expanses in Kansas City). We generated the following features; (1) Population density (POP D ), calculated as the number of people resident per square mile (as reported by the US Census) (2) Entropy profile ( h D ), which computes over a flattened grid representation of the region the Shan- non’s Entropy of the probabilities of counts in each cell; (3) Average Nearest Neighbor (ANN D ) distance feature averages the distance to the nearest neighbor for all users in the city; and (4) Signal-to-Noise ratio (SNR D ) evaluates how many cells in an overlaid grid have enough signal (in terms of number of user counts in a cell) to not be obliterated by DP noise (average noise is 2/ε when sampled from distribution Lap(1/ε)). Feature Selection. The proposed features are filtered through a feature selection process that evaluates the accuracy achieved by candidate feature subset across different datasets. This step finds a subset of the engineered features that can help genearlize the model across datasets. This 67 selection process is conducted on a validation set (J-K cross-validation folds [88] in our case, with J=3 and K=5). We utilize an iterative feature selection technique that incrementally adds features one at a time and evaluates the subset’s validation performance, ignoring features that do not contribute. In Table 3.3 we report the validation performance (relative error) for the evaluated feature subset. The proposed data region specific feature, entropy h D , is the most valuable for ParamSelect (relative error of 0.151). In brief, features used in ParamSelect are highly performant. In Section 3.6.3, we show that ParamSelect with its use of our feature extraction function vastly improves generalization ability over existing method for system parameter selection by exploiting the additional dataset features and, with the use of ML, their non-linear relationships. We conclude with a discussion on potential future work pertaining to datasets used in Param- Select module. Recall that, data region specific features (such as h D ) are obtained from a proxy dataset. This comprises public domain auxiliary information that is, at a very high-level, similar to our private dataset. In our empirical evaluation we use data sources that were collected a decade apart. While not included in the evaluation, we report that static datasets too perform well such as the positions of points of interests in a city. Other DP-compliant public releases of location datasets, such as that from “Facebook Data For Good” initiative, are also viable. Nevertheless, for regions where an auxiliary source of public information is unavailable, the data-independent fea- tures can be utilized to good effect (relative error of 0.193 for feature set ϕ (n,ε,1/nε, p 1/nε)). 68 Chapter4 ANeuralApproachtoSpatio-TemporalDataReleasewith User-LevelDifferentialPrivacy In this Chapter we focus on releasing location data withmultiple snapshots, whereing individuals contribute multiple reports to the datasets. Temporality complicates location density release, since it requires moving from event-level privacy to user-level privacy. Most work on DP-compliant publication of location data has focused on single snapshot re- leases, where each individual contributes a single location report [139, 123, 78, 52, 96, 29, 140, 69, 5, 97], and event-level privacy is sufficient for protection. However, when individuals contribute multiple reports, the ability of an adversary to breach privacy increases significantly, and a shift to user-level privacy [33] is required. To protect privacy in this scenario, an increased amount of noise is needed, which often grows linearly in the number of user contributions. Only a handful of techniques [4, 112] considered spatio-temporal location data release, and none of them is able to preserve data accuracy for any practical spatio-temporal resolutions. In the absence of approaches for high resolution spatiotemporal data release, industry projects used basic DP mechanisms that simply add noise to the population density information without 69 taking into account specific dataset characteristics [6, 39, 105, 14, 59]. The amount of privacy bud- get spent for such data releases is often not reported, or it is excessive [14, 59, 3], thus providing insufficient protection. In addition, reports of incorrect privacy accounting in such releases [14, 59] further necessitate a thorough end-to-end study of custom DP algorithms for spatio-temporal data. There are two key aspects that must be addressed for accurate private releases of spatio- temporal data. First, one needs to bound sensitivity (please see Section 2 for a formal definition) by limiting the number of location reports from any single user, which can be achieved through sampling. However, this must be done carefully, such that data utility is not compromised. Fur- thermore, density information must be adjusted to account for the fact that they are calculated on a subset of the actual data. Second, the effect of noise added by DP mechanisms must be ad- dressed. Such mechanisms consider the worst-case scenario over all possible data distributions and query configurations, and err on the safe side, adding more noise than strictly necessary, compromising accuracy. A denoising post-processing step that leverages spatio-temporal data characteristics can significantly boost accuracy, while still satisfying the privacy constraint. Re- cent advances in neural networks, such as variational auto-encoders (VAE), are good at capturing complex density patterns, and can enable effective denoising. We propose VAE-based Density Release (VDR), a novel system specifically designed to address accurate, DP-compliant release of spatio-temporal datasets. A key intuition behind VDR is the observation that noisy spatio-temporal data histograms are similar to a sequence of images, with spatial patterns in location data akin to visual patterns in images. This observation allows one to leverage a vast amount of work on image pattern recognition and apply it to spatio-temporal data releases. VDR first sanitizes density information by adding DP-compliant noise, then uses a novel 70 neural network-based approach to improve accuracy by performing an advanced post-processing denoising step based on convolutional neural networks (CNN). CNNs are able to capture subtle patterns specific to location datasets. We utilize variational auto-encoders (VAE) to capture data patterns without fitting to the noise. We also employ multi-resolution learning through a data augmentation step that captures location data patterns at multiple granularities, thus improving accuracy for a broad range of query extents. We devise a comprehensive strategy to reduce sensitivity of user-level privacy through sam- pling. We reduce the number of input samples from any individual in a DP-compliant way. To counter-balance the effect of sampling, we design a novel private statistical estimator which scales up query results to preserve accuracy. This permits us to control the sensitivity in user-level pri- vacy without significantly affecting accuracy. Specifically, our contributions are: • We propose an end-to-end privacy-preserving system for spatio-temporal datasets that sat- isfies user-level differential privacy and preserves data accuracy; • We introduce a novel approach to user-level sampling that reduces sensitivity by bounding the number of location reports from each user; the bound is chosen to provide a good trade-off between keeping sensitivity low and preserving density information across time; • We design a novel denoising technique that uses variational auto-encoders and image fea- ture extraction concepts to accurately model patterns in spatio-temporal data; • We design a technique to offset the effects of location sampling in order to provide accurate query answers; to that extent, we employ DP-compliant statistical estimators; 71 • We perform an extensive experimental evaluation on real data which shows that the pro- posed approach is effective in preserving query accuracy under strict privacy budgets, and clearly outperforms all existing approaches. 4.1 Background Differential Privacy. ε-differential privacy [36] provides a rigorous framework with formal protection guarantees. Given privacy budget parameterε∈ (0,+∞), a randomized mechanism M satisfies ε-differential privacy iff for any sibling datasetsD andD ′ differing in a single tuple, and for allE⊆ Range(M) Pr[M(D)∈E]≤ e ε Pr[M(D ′ )∈E] (4.1) Pr[M(D)∈ E] denotes the probability of mechanismM outputting an outcome in the setE for a databaseD, and Range(M) is the co-domain ofM. MechanismM hides the presence of an individual in the data by ensuring that for any given outcome, the probability difference between any two sibling datasets being the source of that outcome is bounded by e ε . The protection provided by DP is stronger whenε approaches0. Thesensitivity of a function (e.g., a query)f, denoted by∆ f , is the maximum amount the value off can change when adding or removing a single individual’s contribution from the data. The ε-DP guarantee can be achieved by adding random noise derived from the Laplace distribution Lap(∆ f /ε). For a queryf :D→R, theLaplacemechanism (LPM)M returnsf(D)+Lap(∆ f /ε), 72 where Lap(∆ f /ε) is a sample drawn from the probability density function Lap(x|(∆ f /ε)) = (ε/2∆ f )exp(−| x|ε/∆ f ) [36]. Sensitivity is significantly influenced by whether one considers event-level or user-level pri- vacy. In the studied case of spatio-temporal data,f can be formulated as a vector-valued function that outputs the population count in a location histogram at every time snapshot. When an in- dividual’s data are removed, all the location reports of that individual are deleted from the data, causing changing in multiple elements off’s output (in the worst case, the change is proportional to the maximum number of location reports across all individuals). Contrast this with event-level privacy, where one considers sibling datasets to differ in a single location report. As a result, en- forcing user-level privacy causes a significant increase in sensitivity, which must be carefully controlled to prevent utility loss due to excessive noise addition. ProblemDefinition . We are given a datasetD consisting of user location reports collected over time. Each user report consists of four attributes: latitude (lat), longitude (lon), timestamp (time) and user id. The goal is to release high-resolution density information ofD for arbitrary spatial regions over time. We do this by releasing anM× M× T histogram of the data, where M determines the spatial resolution andT determines the temporal resolution. M andT are determined by the application requirements, e.g., release a histogram at 30× 30m resolution and one hour time granularity over a duration of 24 hours [39, 69]. More specifically, we design a mechanismM that takes D as an input and outputs a histogram, ˆ H, whereM preserves ε-DP. The histogram of the original (i.e., non-private) data with true counts is denoted as H. In real- world scenarios, the DP histogram is used to answer spatio-temporal queries. In this paper we focus on the following three representative statistical query types on spatio-temporal aggregate data: 73 Rangecountqueries. Given a query range, defined by minimum and maximum values (i.e., a range) for lat, lon and time, find the number of user location reports in D that satisfy this range predicate. For a queryq, we measure the utility of its estimated DP-preserving answer,y, compared to the true answer u using the relative error metric, defined as RE (y,u) = |y− u| max{u,ψ } , whereψ is a smoothing factor necessary to avoid division by zero. Nearest hot-spot queries. Given a query location q (lat, lon, time), a density threshold, ν , and a spatio-temporal extent SR (defined by a time duration and lengths of lat and lon geo- coordinates), find the closest cell to query q within extents SR that contains at least ν number of locations signals. The hotspot query may be answered using an expanding search in the 3-d histogram until a cell withinSR having at leastν points is found. If none is found, the cell with the maximum count is reported. We evaluate this query in two ways: the distance penalty is measured as the Mean Absolute Error (MAE) between the true distance (as computed on onH) and reported distance (computed on ˆ H) to the hotspot. To capture hotspot density estimation errors, we measure Regret, defined as the deviation of the reported density of the found hotspot (on noisy histogram ˆ H) from the specified threshold ν . Regret for a query is zero if the reported hotspot meets the density threshold. Forecastingquery. Given a timeseries of density counts for a 2-d region (defined with min- imum and maximum lat and lon values), and a forecasting horizon h not covered within the timeseries, predict the count of location reports for h future timesteps. To evaluate this query, we utilize holdout testing, which removes the last h data points of the timeseries, calculates the forecasting model fit for the remaining historical data, makes forecasts for h timesteps, and com- pares the error between the forecast points and their corresponding, held-out, data points. We report the symmetric mean absolute percentage errors (sMAPE) as sMAPE = 1 h P h t=1 |Ft− At| (At+Ft)/2 , 74 whereA t are the true counts fromH in theh timesteps andF t are theh predicted counts from a forecasting algorithm fitted to the historical data points from ˆ H. 4.2 VAE-basedDensityRelease(VDR) Given a dataset of location reports over time, we perform a careful sanitization process to cre- ate a DP-compliant spatio-temporal histogram. Any number of computations performed on the histogram after it is sanitized with DP are considered safe. The histogram may be thus publicly released and used to answer arbitrary downstream queries such as the ones discussed in Sec 4.1. Our approach takes advantage of specific characteristics of spatio-temporal datasets to provide a customized mechanism that achieves DP. 4.2.1 VDROverview Our approach consists of three steps: (1) data collection, (2) learned denoising and (3) statistical refinement . We provide a brief overview of each step below, and present their complete technical descriptions in Sections 4.2.2 - 4.2.4. DataCollection creates a noisy DP-compliant histogram of the data using the Laplace mech- anism and a sampling technique that reduces sensitivity by bounding the amount of location re- ports from each individual. It makes no assumptions on the data distribution, and simply adds i.i.d random noise to each cell, with no modeling bias being introduced in the query answers to histogram cells. LearnedDenoising learns a model that captures the spatio-temporal density patterns of the original data from the noisy histogram. Using these patterns, denoising is able to construct a 75 la t time lon ( a ) Loc a tion r epor ts o v er time ( c olor sho ws da y o f r epor t ) (b ) Sample d da tase t Sample poin ts per user ( c ) Sample d histogr am Cr e a te histogr am Add noise ( d) N oisy histogr am 0 12 3 6 9 Figure 4.1: (a) and (b): real-world complete and sampled dataset of location reports over time in Houston. (c) and (d): exact and noisy 3-d histograms created from the sampled dataset, higher brightness shows higher density. histogram that answers queries more accurately. It acts as a post-processing algorithm, hence the the DP requirement is preserved, and it makes use of variational auto-encoders to offset some of the excessive noise introduced by the Laplace mechanism. StatisticalRefinement scales the denoised histogram according to a DP-compliant statisti- cal estimator to offset the effect of sampling used to decrease sensitivity during data collection. 4.2.2 DataCollection Data collection uses a combination of sampling and noise addition to create a differentially private histogram of the data without making any modelling assumptions. In the case of spatio-temporal data, simple noise addition will lead to poor quality results, as the amount of noise needed will destroy any meaningful signal in the data. We first discuss the naive solution and its specific challenge for spatiotemporal data; subsequently, we show how sampling is able to improve the accuracy; finally we present the details of the data collection mechanism. We use as running example a real world dataset of location reports from Houston, TX, USA (see Section 3.5 for exact details of dataset). This dataset is shown in Fig. 4.1 (a). 76 DP Histogram Release. Given dataset D with location reports from different users over time, the goal is to create a histogram of the data while preserving ε-DP. One way to do this (without making any modeling assumptions) is to first create the true histogram of the data H, and then add independent Laplace noise, Lap( ∆ ε ) to each cell of the histogram, where ∆ is the sensitivity of the query of number of data points falling inside a cell. This sensitivity is equal to the maximum number of points,k max , a user contributes to the dataset. Thus, the DP histogram of the data can be written as ¯ H =H+Lap( kmax ε ), where independently generated random noise is added to each cell of the histogram. Challenge for Spatiotemporal Histograms. In spatiotemporal datasets, the number of data points contributed to the dataset by each user varies significantly across users. While most users contribute few points, some prolific users may contribute a very large number of points. In fact, the number of points a user contributes often follows a power law [89, 129]. For our running example, Fig. 4.2 shows such dataset characteristics, where the maximum number of points contributed by a user is k max = 90,676 (in real datasets, location updates from mobile apps reporting user movements often leads to large number of contributions, with the amount of user contributions varying due to different app utilization across users). In this dataset, for ranges of 30 meters and 1 hour time periods, only 1 percent of the histogram cells have values more than 25. Thus, Laplace noise scaled tok max /ε, for any reasonable value ofε, wipes out any meaningful information in all but a few cells. SamplingtoBoundSensitivity. Instead of using all the data points of users when creating the histogram, we sample a maximum of k points per user, for a user parameter k < k max . Specifically, we sample a subset of points D s ⊆ D as follows: for any user with more than k points, we sample k of their points uniformly at random. For users with at most k points, we 77 0 1000 2000 3000 4000 Location signals sampled per user 0 50 100 150 200 Dataset size (in Million) 0 20000 40000 60000 80000 Location signals per user 10 1 10 3 10 5 F requency Figure 4.2: Veraset Houston Data Statistics keep all their points. This reduces the sensitivity of releasing histogram tok, requiring that we add only noiseLap( k ϵ ) to each cell of the histogram. In this way, we can exploit the skewness in user contributions to the dataset, because by settingk to a small value, we are able to retain most of the original data. In our running example in Fig. 4.2 (left), setting k to 128 captures nearly 25% percent of the data, while reducing sensitivity by 700%. Consequently, we bias the data distribution in order to reduce variance in query reporting. Nonetheless, sampling introduces sampling error in the answers over histogram D s . In our statistical refinement step (Sec. 4.2.4), we discuss how we can counter this source of error. We further discuss trade-offs arising in our method based on the choice ofk in Sec. 4.3. Summary and Example. Data collection is depicted in Fig. 3.3 for our running example, where we release a noisy 50× 50× 50 histogram of the dataset of location reports in Houston. We sample up to k for each user from the complete dataset D to create the sampled dataset D s . Then, we create a 3-dimensional histogram, H s of D s . Finally, we create the histogram ¯ H = H s + Lap( k ε ) so that the data collection process satisfies ε-DP. The output of the data collection step is the noisy histogram ¯ H. The output in Fig. 3.3 (d) shows the noisy histogram, where each cell corresponds to noisy density in a 800× 800 meter cell for a 4 hour time period. 78 0 1 14 15 34 48 slic e 0 slic e c on ten ts slic e 14 slic e 48 la t time lon Figure 4.3: Spatial patterns over time on histogram slices 4.2.3 LearnedDenoising Denoising uses machine learning, specifically VAEs, to identify spatial-temporal patterns, and utilizes them to improve histogram accuracy. Our main observations are: (1) spatio-temporal histograms are similar in nature to a sequence of images, thus methods from image representa- tion learning can be applied to capture data patterns; (2) regularized representation learning can ensure the model learns a denoised representation of the data while not over-fitting noise; and (3) multi-resolution learning can capture spatio-temporal patterns at different granularities. 4.2.3.1 DesignPrinciples Denoising with Regularized Representation Learning. We want to derive a denoised his- togram ˆ H from ¯ H that is similar toH, where similarity is measured as norm∥H− ˆ H∥, i.e., the sum of squared differences across all cells of the histograms. To achieve this, consider a function encoder( ¯ H) that creates an encoding,z, of the noisy histogram, and a functiondecoder(z), that outputs a histogram, ˆ H, from the encoding z. Consider the problem of learning an encoding z (i.e., by learning functionsencoder(.) anddecoder(.)), so that∥ ˆ H− ¯ H∥ is minimized, where we 79 call∥ ˆ H− ¯ H∥ the reconstruction error. Our goal is to obtain an encodingz that summarizes the patterns in ¯ H, since such patterns will also exist in the true histogramH. To see why this is pos- sible, observe that a constraint onz limits its representation power. For instance, by setting the dimensionality ofz to be lower than that of ¯ H,z cannot contain as much information as ¯ H. Thus, a regularized encoding, z, that minimizes the reconstruction error cannot contain all the infor- mation in ¯ H. By learning a regularized representation,z, a model is able to capture the patterns in ¯ H that best summarize the histogram. Such a summary cannot include noise information, as it does not generalize across the histogram (noise is independently added to each cell) and can only be memorized individually per cell. Thus, by properly regularizing the encoding, we can find an encoding that is denoised, i.e., contains the patterns in the data, but not the noise. Subsequently, by decoding such a representation, we can obtain a denoised histogram. That is, the regulariza- tion ensures that even though we try to minimize the reconstruction error∥ ˆ H− ¯ H∥, we obtain a histogram such that∥ ˆ H− H∥ is smaller than∥ ¯ H− H∥. Spatial Patterns as Visual Patterns. Denoising with regularized representation learning will be beneficial only if the model is able to extract the patterns in the histogram. To facilitate this, we observe that a 3-d histogram of the data can be seen as a sequence of images, as shown in Fig. 4.3 for our running example. The left side of Fig. 4.3 plots the 3-d histogram which represents a time-series of two-dimensional histograms, one per each timestamp. We call each of these 2-d histograms a slice. On the right side of the figure, we plotted various slices, each corresponding to a different timestamp. We can see that spatial patterns in the histogram are in fact visual patterns. For instance, patterns corresponding to roads or busy areas can be seen as lines or blobs in the image. Furthermore, such patterns are consistent and repeating over time, suggesting that representation learning can be achieved effectively using techniques from image feature learning. 80 Figure 4.4: A histogram slice with varying coarseness slic e r epr esen ta tion dif f er en t r esolutions aggr ega te a t tr aining se t, noisy histogr am Figure 4.5: Training set preparation Multi-ResolutionLearningatVaryingGranularity. Spatial patterns in the data exist at various granularities of the input histogram. Fig. 4.4 illustrates several degrees of histogram res- olution, from finest (left) to coarsest (right). We observe that patterns at finer resolutions feature roads more prominently, while patterns at coarser granularities feature primarily neighbourhood densities. Furthermore, the patterns in coarser granularity histograms are less affected by noise, which allows the model to still infer spatial density. Thus, we propose to train a single model based on data configured at multiple granularities to improve denoising accuracy. 81 4.2.3.2 DenoisingwithConvolutionalVAEs Based on the above principles, we utilize convolutional VAEs to denoise the histogram ¯ H. We first provide an overview of our algorithm, then provide more details on the role of regularization in our methodology, and finally present the algorithm pseudocode. Our method consists of three stages: (1) training data preparation, (2) model training and (3) model inference. We discuss each below. Training data preparation. Recall that we are given a noisy 3-d histogram, ¯ H, where the 2-d histograms resulting from each slice contain density information for different locations. Thus, we view the 3-dimensional histogram ¯ H as a set of two dimensional histograms, where thei-th element in this set, ¯ H i , is a 2-dimensional slice corresponding to thei-th timestamp. This is shown on the left side of Fig. 4.5. Then, as shown on the right side of Fig. 4.5, to utilize multi-resolution learning, we aggregate each of the slices at various resolutions. For instance, every block of2× 2 cells in ¯ H i are aggregated to obtain a new 2-d histogram which has a coarser granularity. This aggregation is done atr different resolutions, and all the aggregated histograms are put together in a training setT . Model Training. We use a convolutional VAE to perform regularized representation learning. Specifically, an encoder, which is a CNN, takes as input each 2-d slice and outputs a represen- tation for it. Denote by encoder(.;θ e ) the network whose parameters are θ e , and let ℓ be the dimensionality of the representation output. The representation is then fed to a decoder, which is another neural network, denoted by decoder(.;θ d ), where θ d are the parameters of the de- coder. The output of the decoder is a 2-d histogram, as shown in Fig. 4.6. To simplify notation, we directly input a set of 2-d histograms to the encoder, in which case the output is also a set of 82 representations (similarly for decoder). The model is trained to minimize a reconstruction loss, which is the difference between input slice and the output slice, and a regularization loss, which ensures that the learned representation follows some regularization constraints. Model Inference. Each slice, ¯ H i , is passed through the convolutional VAE, first encoded and then decoded, to obtain the denoised representation for ¯ H i . This is done for all timestamps, i, which allows us to obtain a denoised 3-d histogram, ˆ H. Note that, inference is not performed on any aggregated histogram (via multi-resolution), but only on the original noisy histogram ¯ H. In other words, the output of learned denoising is a single 3-d histogram ˆ H, which is at the same resolution as the noisy input histogram ¯ H. VAEandRegularizationDetails. We discuss parts of the VAE design relevant to the prob- lem of denoising. Further details of our model can be found in Sec. 3.5. We utilize the Vector Quan- tized variant of VAE (VQ-VAE), where the encoding is forced to follow a certain discrete structure (other variants, e.g., Gaussian latent distribution [63] are also possible, but we found VQ-VAE to perform the best). A discrete set,Υ , called acodebook, ofB different encodings, Υ= {e 1 ,...,e B }, where each e i is ℓ-dimensional, is learned, and VAE training process forces the encoder to out- put an encoding that is similar to an element in the codebook. Recall that encoder(.;θ e ) is a convolutional neural network that takes as input a 2-d histogram. For an input 2-d histograms in our training set, ¯ T ,encoder( ¯ T;θ e ) provides a set of representationsz. These representations are then input to the decoder to obtain reconstructions ˆ T = decoder(z;θ d ). VQ-VAE defines a distance function between z and Υ , d(z,Υ) , that measures how similar the encodings are to the codebook. d(z,Υ) is then minimized in the training process to ensure the encoder learns representations that are similar to the codebook. We callL G (z)=d(z,Υ) the regularization loss and define L C ( ˆ T) = P | ¯ T| i=1 ∥ ¯ T i − ˆ T i ∥ 2 as the reconstruction loss, whereT i is thei-th slice in the 83 c on v olutional enc oder de c oder encodings per input, r egulariza tion loss, r e c onstruc tion loss, tr aining se t, model output, Figure 4.6: Model Training training set and ˆ T i is the output of the VQ-VAE on thei-th training slice. We then train VQ-VAE to minimize α × L G (z) +L C ( ˆ T), where parameter α is introduced to control the amount of emphasis on the regularization. ParametersB andα control how regularization benefits denoising. (1) B controls the repre- sentation power of the encoding space: the smallerB is, the less information can be captured by different encodings, as the encodings for different slices are forced to be similar. On the other hand, whenB is large, different slices are allowed to take different representations, as the code- book allows for more variability. (2)α controls how much the encoder is forced to adhere to the codebook. When α is small, the encoder can learn representations that do not follow the dis- cretized structure. It allows learning different encodings for different slices, thus memorizing the information within slices instead of learning patterns that generalize. We empirically find significant benefit in invoking both of the above regularization aspects of VQ-VAEs. Specifically, we observed worse performance when setting α to a small value, em- phasizing the need for regularizing the encoding space. We also observed worse performance whenB is too small or too large, the former because not enough information can be stored in the learned encoding, and the latter because the encoding can become noisy (due to insufficient 84 regularization). Furthermore, privacy budget affects the extent of regularization required. For in- stance, when privacy budget is large, noise of smaller magnitude is added and thus finer grained patterns are preserved in the data. By increasingB in such a scenario, the model can learn more detailed features of the data. On the other hand, for smaller privacy budgets,B needs to be small so that the model does not overfit the noise. We further investigate these trade-offs in Sec. 3.5. Complete Denoising Algorithm. The complete denoising process is shown in Alg. 6. Lines 1-5 show how the training set is augmented with histograms at varying granularities. Lines 6-7 create a CNN as an encoder and a Transposed CNN as the decoder. The model is trained in Lines 8-16, where at Line 9 the encoder outputs encodings of the histograms in the training set and the encodings are then decoded by the decoder to reconstruct the histograms in Line 10. The model is then optimized with stochastic gradient descent to minimize the recon- struction loss and the regularization loss. Finally, after convergence, a forward pass of the model yields the denoised histogram. We have kept the discussion of convolutional VAEs at a high level and only provided details for ideas that pertain to the problem of denoising, without discussing in-depth the technical details in VAE and VQ-VAE design such as commitment and alignment loss [119], the reparametrization trick [125] and their relation to regularization. We provide implementation specific details of VQ-VAE in Sec. 3.5. 4.2.4 StatisticalRefinement Given that the values in the denoised histogram are based on the sampled dataset, they will be an underestimation of the true counts. In this section, we study how the values can be scaled 85 to accurately represent the true counts. We first discuss how differential privacy complicates this process of statistical refinement, then present notations and assumptions in our method and finally discuss the statistical refinement step. 4.2.4.1 EstimationwithDifferentialPrivacy Recall that we sampled a dataset D s from the true dataset D, and created a noisy histogram ¯ H from the sampled set. We retained up to k points per user, hence the size of D s is smaller than D. Thus, the number of data points that fall inside the histogram created based on D s will be an underestimation of the true number of data points. To adjust the observed answers based on sampled data points we need to scale them, so that they accurately represent the true numbers. However, DP affects how this scaling can be done. NoisyObservations. Scaling the values in ¯ H scales both the added noise and the observed values, thus amplifying the random noise. In other words, by scaling the observed values, we reduce the bias in our estimation (i.e., account for underestimation), but this scaling increases the variance in our estimation because the random noise gets amplified. Thus, in the case of sampling with differential privacy, it is important for our method to account for both bias and variance in the estimation. Private Sampling Procedure. The sampling procedure is data dependent, and its specific details may be unknown, due to privacy requirements. Therefore, we aim to derive a refinement approach that is agnostic to the sampling performed during data collection. For instance, the probability of sampling a point in a particular cell does not only depend on the total number of points in that cell, but also on which user they belong to. If all users in a cell have exactly one point in D, then all the points in that cell will be preserved and thus the number of points in 86 Algorithm6 Learned Denoising Input: A set of noisy 2-dimensional histograms, ¯ H Output: A set of denoised 2-dimensional histogram, ˆ H 1: ¯ T ← ¯ H 2: forj← 2 tor do 3: for ¯ H i in ¯ H do 4: ¯ H j i ← Histogram from aggregatingj× j blocks of ¯ H i 5: ¯ T ← ¯ T ∪ ¯ H j i 6: encoder(.;θ e )← CNN encoder with params. θ e 7: decoder(.;θ d )← TransposedCNN decoder with params. θ d 8: repeat 9: z← encoder( ¯ T;θ e ) 10: ˆ T ← decoder(z;θ d ) 11: L C ( ˆ T)← P i ∥ ¯ T i − ˆ T i ∥ 2 12: L G (z)← d(z,Υ) 13: L← α × L G (z)+L C ( ˆ T) 14: θ ← θ e ∪θ d 15: Updateθ in direction−∇ θ L 16: until convergence 17: returndecoder(encoder( ¯ H;θ e )θ d ) that cell inD s will be the same as number of points in the corresponding cell inD. However, if users in a cell have more than k points, then the number of points in the cell in D s is less than D. Due to differential privacy, information about the number of points per user in a cell can only be known by spending privacy budget, which is undesirable. 4.2.4.2 Estimationalgorithm Taking into account the above observations, we use mean square error minimization to decide how the answer should be scaled, which accounts for both bias and the variance, and thus ensures that if the noise is too severe it is not amplified. Moreover, rather than spending privacy budget to estimate the sampling procedure, we make simplifying assumptions to create a tractable sampling model that can be mathematically analyzed. In the remainder of this section, we first describe 87 our sampling model and then show how mean square error minimization can be used to decide how the observed noisy answers should be scaled to accurately represent the true data. NotationandModelingAssumptions. LetN =|D| be the total number of data points and n =|D s | be the observed number of data points after sampling. We make simplifying assump- tions about the sampling procedure for the purpose of our analysis. Specifically, we consider the case when then points are sampled independently and uniformly at random. LetX c i be the indicator random variable equal to 1 if thei-th point falls in a cellc. Furthermore, letµ c be the proportion of data points in the complete dataset that are in the cellc, so thatN× µ c is the total number of data points in cell c. We assume that the i-th point is sampled uniformly at random across all data points, so thatP(X c i =1)=µ c . Algorithm. Our goal is to design an estimator to estimate N × µ c for all cells, c, in the histogram. Our estimator needs to be accurate, but at the same time has to preserve differential privacy. We consider the estimatorθ c = γ ( P i X c i +Lap( k ϵ )). θ c obtains a differentially private estimate of the observed number of points in the cell c and scales it by a parameter γ . We find the parameter γ by minimizing the mean squared error of our estimator θ c . To do so we first calculate bias and variance of our estimator. Bias(θ c )=E[θ c − Nµ c ]=µ c (γn − N) Var(θ c )=γ 2 (nµ c (1− µ c )+2k 2 ϵ − 2 ) Thus given the mean squared error of an estimator, MSE(θ c )= Bias(θ c ) 2 + Var(θ c ), we obtain MSE(θ c )=γ 2 (nµ c (1− µ c )+2k 2 ϵ − 2 )+µ 2 c (γn − N) 2 . 88 Algorithm7 VDR algorithm Input: A dataset D, privacy budget ϵ , spatial,M, and temporal,T, discretization granularity, sampling parameterk, and refinement factor C. Output: Differentially private 3d-histogram ˆ H ofD 1: D s ← samplek points per user inD 2: H s ← M× M× T histogram ofD s 3: ¯ H ← H s +Lap( k ϵ ) 4: ˆ H ← Denoise( ¯ H) 5: returnγ C × ˆ H Next, we find the γ value that minimizes error across all cells. Letm=M× M× T be the num- ber of cells in the histogram. We minimize P m c=1 MSE(θ c ) by taking its derivative with respect to γ and setting it to zero. We obtain that γ = nNC 2mk 2 ϵ − 2 +(1− C)n+Cn 2 (4.2) minimizes P m c=1 MSE(θ c ), where C = P m c=1 µ 2 c is a data-dependent constant. It is left to de- termine the value of C, but doing so on the private data itself may require spending privacy budget. However, due to inherent properties of location datasets, in practice,C can be treated as a system parameter and set—in a data-independent manner thus not consuming valuable privacy budget—to a fixed value that works. We further discuss how C can be set in Sec. 4.3.3. 4.3 SystemDesignandAnalysis 4.3.1 PrivacyAnalysis Alg. 7 shows our proposed end-to-end algorithm. Lines 1-3 correspond to the data collection step, line 4 calls Alg. 6 to perform learned denoising and line 5 uses the value ofγ C calculated in Eq. 4.2 89 to scale the results (we write it as γ C to make explicit the dependence on the factor C). Alg. 7 only accesses the data in the data collection step. Thus, lines 4-5 can be seen as a post processing step and do not consume any privacy budget. Data collection is ϵ -DP, and thus the entire VDR providesε-DP, as proved by the following: Theorem2 Algorithm 7 isϵ -DP The proof is in our extended technical report [9]. 4.3.2 VDRDesignChoices Real world use-case of spatio-temporal data extends beyond simple range count queries that are commonly studied and optimized-for in common approaches to location data release, which are typically partitioning-based [96, 96, 137, 97, 139, 141, 49]. Other common query types such as forecasting POI visits, or finding hotspots, are sensitive to biases that such approaches introduce, causing them to perform poorly. VDR’s approach of denoising a histogram created by Laplace mechanism (LPM) offers significant benefits across different spatio-temporal queries by avoiding such biases. ForecastingQueries. Forecasting methods are often robust to random noise present in real data, some even explicitly incorporating its effects in their models. Thus, a DP mechanism that only introduces random noise, such as LPM, can perform well, whereas those that approximate the density of regions by cleverly grouping and partitioning the domain introduce additional bias and obliterate trends and seasonal effects present in the timeseries. Hotspot Queries. For hotspots discovery, if a DP mechanism underestimates counts in the region of the hotspot, then it will receive a distance penalty due to not having found the correct 90 spot, and may in addition incur a large regret, up to the maximum of the density threshold. This happens if an approach creates coarse partitions of the data, thus underestimating the density for ‘hot’ peaks. Selectively creating finer partitions can improve the result, since some ‘hot’ peaks may be preserved. Nonetheless, modelling errors in deciding where to create fine partitions can cause underestimation in some regions, resulting in large distance penalty. On the other hand, a bias-free approach such as LPM performs better since it is not affected by a systematic reduction in data utility that partitioning approaches incur. RangeCountQueries. Answering larger query ranges over LPM requires aggregating more histogram cells, each contributing additional error to the answer. Therefore, VDR is specifically designed to denoise (reduce variance) of a bias-free mechanism, smoothing out the noise by ex- ploiting the inductive bias that spatial patterns exist in location datasets. In this way, it can im- prove forecasts significantly by preserving timeseries specific factors and discovering hotspots that likely meet the threshold, while not sacrificing the quality of results for range count queries. 4.3.3 SystemParameterSelection We discuss the impact of system parametersk,C and collection granularity on the performance of the system, and provide guidelines on how they should be set in practice. 4.3.3.1 RefinementFactorandSamplingParameter Recall that in our data collection step (Alg. 7 line 1), we sample up to k points per user and in the statistical refinement step we scale our result by a factor γ C (Alg. 7 line 5), which depends on the refinement factor C. Both parameters, as discussed below, depend on data skewness as well as distribution of user contributions. However, due to DP, measuring data dependent properties 91 requires spending privacy budget, which is scarce. Next, we discuss potential trade-offs in the values of these system parameters, and heuristics to set each. Samplingparameter,k. For accuracy, the sampled dataset should retain density character- istics of the original dataset. After scaling with our statistical refinement step, the obtained query answers should be close to original counts. In our real-world datasets, the true data size,N, plays an important role in the interplay between true data characteristics and sampled ones. Specifi- cally, when N increases, most cells in the true 3-d histogram, H, remain empty or retain small values, due to data sparsity, while the number of reported locations in dense cells increases. This results in a more skewed true dataset. Thus, for the sampled datasetD s to capture this skewness, we need a larger number of samples, or otherwise our estimation will have a very large variance. We conclude that the value of k should grow with data size. Our experiments in Section 4.4.3 corroborate this heuristic, showing that the growth ratio λ , defined as k ∗ N , where k ∗ is the best possible sampling rate, stays almost constant across datasets of various sizes. In fact, we observe that this value remains constant across different cities, suggesting that due to similarity in skew- ness, inherent to location datasets, we can set the value ofk to be a constant fraction ofN. Details of these observation are presented in Sec. 3.5. Refinement Factor, C. Recall that C determines how query answers are scaled to obtain the final histogram. Our theoretical model suggests that C = P c µ 2 c , but we do not have access toµ c due to differential privacy, thus we treat C as a system parameter. Note that,C depends on data distribution. For instance, if the data are uniformly distributed,µ c will be equal to N/m N = 1 m , so that P c µ 2 c = m× 1 m 2 = 1 m . On the other hand, if all the N points are in a single cell, then P c µ 2 c = 1. For location datasets, we expect similar skewness across the space, and as a result, similar values of C should perform well across datasets. Our results in Sec. 4.4.4 confirm that 92 the same value of C can be used with distinct datasets and sampling rates. We suggest setting the value of C to one that performs well on a public dataset. Having C as a system parameter is advantageous because it allows for correction of errors that have been introduced due to our theoretical modeling. For instance, our analysis in Sec. 4.2.4.2 does not take into account the impact of denoising, which we expect to be consistent across datasets. By settingC as a system parameter, we can avoid any adverse impacts of modeling errors in practice. Moreover, since the modelling error is consistent across datasets, the same value ofC can be set for all datasets. 4.3.3.2 DiscretizationGranularity We assume throughout our discussion that a 3-d histogram of a predefined granularity is required. This is often the case since the choice of discretization tends to be domain specific. In the case of location datasets, high-resolution density maps are preferred [39], and VDR is the first attempt at releasing high resolution spatio-temporal density counts, where we choose to release counts at the same granularity as done by industry data release projects, e.g., [39]. This has been typically challenging since publicly available location datasets are small, and thus not suitable for studying at high resolution. The primary concern is that a fine grained histogram will have small true count values per cell, and since scale of DP-added noise is proportional to sensitivity and not actual data counts, the resulting signal-to-noise ratio will be low, compromising accuracy. On the other hand, if a coarse discretization is used, it induces an implicit bias in the results, since fine resolution queries need to estimate answers by assuming uniform distribution of points within the coarse cell, and will thus be inaccurate. 93 4.3.4 DataReleaseoverTime So far we have considered the release of a static dataset D. In practice, spatiotemporal data is released over time, with new data coming in continuously. In such a setting, privacy budget is often allocated per time period, e.g., a budget of ε i would be allocated for the i-th week (ε i is typical set to go to zero so that P i ε i is bounded). Thus, the release consists of a sequence of datasets D 1 ,D 2 ,..., where each D i covers a fixed period of time. Let τ denote the duration covered by each D i , which we call release duration. To use VDR in this setting, Alg. 7, can be called for every release, where in the i-th release, the input dataset is D i and privacy budget is ϵ i . However, an important characteristic of VDR is that the model does not need to be retrained for every data release. That is, rather than retraining the model in the learned denoising step for every release, after the model is trained once, it is still able to denoise the input histograms. We verify this empirically in our experiments. This also shows that our model is learning recurring patterns from data, which generalize well to unseen data points. 4.4 ExperimentalEvaluation Sec. 4.4.1 describes the experimental testbed. Sec. 4.4.2 compares VDR with state-of-the-art ap- proaches. Sec. 4.4.3 ablates design choices. Sec. 4.4.4 presents a case-study of a user-level density release of spatiotemporal data, and the effect of statistical refinement on answer quality. Sec. 4.4.5 showcases VDR’s effectiveness for non-uniform datasets. 94 4.4.1 ExperimentalSettings Dataset Description. All datasets comprise of user check-ins specified as tuples of: user iden- tifier, latitude and longitude of check-in location, and timestamp. Our primary dataset is propri- etary, obtained from Veraset [115] (VS), a data-as-a-service company that collects anonymized movement data from 10% of the cellphones in the U.S [116]. For a single day in Jan 2020, there were 2.4 billion readings from 27.2 million distinct users. Where applicable, we also present our results on public datasets. Public datasets typically contain sporadic check-ins made over a rela- tively long period of time, as opposed to real longitudinal trajectories of users that the proprietary dataset offers. Our first public dataset from the Foursquare geo-social network (4SQ) [128] is col- lected during a time period of 22 months from April 2012 to 2014. There are 22M checkins by 114k users at 3.8 M POIs. Our second public dataset is a subset of the user check-ins in the US col- lected from the Gowalla (GW) network by the SNAP project [27]. It contains 6.4 million records from 196k unique users during a time period between February 2009 and October 2010. Our third public dataset is the San Francisco taxi dataset (CABS_SF) [95] derived from the GPS coordinates of approximately 500 taxis collected over 24 days in May 2008. To simulate a realistic environment of a city and its suburbs, we consider urban areas in the US covering20km 2 . For the Veraset data, we select cities based on their population density [71]. In particular we have Salt Lake (VS_SL) city in Utah as a low density city (41M points from 600k users), Los Angeles (VS_LA) in California as medium density city (80M points from 852k users), and Houston (VS_HT) in Texas as high density city (221M points from 1.28M users). For all primary datasets, we discretize the temporal domain to 3 hours, giving a total ofT = 96 slices for the 12 day period from Jan 7 to 19, 2020. For each of the secondary datasets, we discretize the 95 temporal dimension such that each slice covers the duration of one month for a total ofT = 24 slices. From the Foursquare dataset we consider Tokyo, Japan (4SQ_TKY). The subset contains a total of 755k location updates from 8k unique users. From the Gowalla dataset we select the San Francisco (GW_SF) city with 568k location updates from 14k users and New York (GW_NY) city with 520k location updates from 16k users. For the CABS dataset, following [51, 96], we keep only the start and end points of the mobility traces of the 500 taxis, for a total of 846k records. Parameter Settings. Following recommendations in the industry [39] and research literature [69, 140], we partition the space into a576× 576 (i.e.M = 576) grid to obtain30m× 30m cells. As described above, the temporal granularity is set specific to each dataset. The default value of privacy budgetε is set to0.2. Query Evaluation Metrics. Evaluation metrics are defined in Sec. 4.1. To evaluate the range count query metric, we construct query sets of 5,000 RCQs centered at cells of randomly selected data records. Each query has side length that varies uniformly from 30 meters to 120 meters. To set the smoothing factorψ we extend the recommended value of 0.1% of the cardinalityn of the spatial dataset for snapshot queries [140, 29, 96] to 0.1% of the cardinalityn/T of the average slice of the spatio-temporal dataset. When comparing multiple datasets with each other the smallest smoothing constant among them is used to remain consistent. We evaluate the forecasting query only on subsets of the Veraset data, since other datasets do not contain timeseries of sufficient length. We generate 100 forecastable timeseries by sam- pling positions of POIs in the city, extracting count data of between 8-12 days (random lengths), selecting only those series that satisfy a Autocorrelation Seasonality test[77] (90% confidence) at seasonal period of 8 (meaning a daily seasonality according to 3 hour temporal discretization of 96 the 24 hour period) To make forecasts, we use the winning algorithm of the M3 forecasting com- petition [77], Theta [10], which is a variant of the Simple Exponential Smoother. We use the data of all-but-last day to fit the forecaster and evaluate its predictions for last day (i.e., a forecasting horizon ofh=8). We report the sMAPE error as defined in Sec. 4.1. Lastly, the hotspot query is evaluated at a specified threshold ν . The query, originating from the cells of 1000 randomly selected data points, is answered using an expanding search within the 3-d spatial regionSR with lengths 5km in lat and lon and no bounds in time dimensions. Implementation. All algorithms were implemented in Python, and executed on Linux machines with Intel i9-9980XE CPU, 128GB RAM and a RTX3090 GPU. Neural networks are implemented in JAX [16]. Given this setup, VDR took up to 50 minutes to train for 12 days of the Veraset Houston data. The inference time of VDR is less than 1ms and the model takes 9 MB of space. We publicly release the source code at [8]. ModelTraining. For Multi-Resolution Learning we augment the training set atr =3 granular- ities chosen at equal spacing between the minimum (30m) and maximum (120m) query ranges to be evaluated. To train the model, we utilize the Adam [62] optimizer with Exponential Moving Average updates [100]. The encoder and decoder architecture is based on a ResNet[111]. The model takes in batches, the slices of the histogram, each of size 576× 576, passes it through the ConvNet encoder θ e and decoder θ d . The EMA version trains much faster than the non-EMA version, especially when using Multi-Resolution Learning. In all our experiments, we utilize hy- perparameters consistent with those utilized in previous work [114, 100]; i.e., ℓ = 64,B = 128 and a batch size ofb=8. 97 4.4.2 ComparisonwithBaselines Baselines. We use as benchmarks Uniform Grid (UG) [96], Adaptive Grid (AG) [96], HB_striped [137, 97], PrivBayes [139], AHP [141] and MWEM [49]. We utilize Ektelo [137, 50], an operator- based framework for implementing privacy algorithms. To extend approaches designed to orig- inally support range queries in two-dimensional data (spatial-only) to the 3-d case, we partition the temporal domain into parallel ‘stripes/slices’ of that domain for each fixed value of the rest of dimensions, so that the measurements are essentially the 2D histograms resulting from each slice. For example, HB_striped[137] performs on each slice the HB algorithm [97], which builds an op- timized hierarchical set of queries. We similarly implement Uniform Grid (UG)[96] and Adaptive Grid (AG)[96]. We use as-is algorithms that are designed to extend to the multi-dimensional set- ting such as PrivBayes [139], AHP [141] and MWEM [49]. Lastly, we also considered QuadTree [29], but the results were far worse than the above approaches and thus are not reported. We were unable to run the DAWA [69] algorithm directly on such a large domain due to memory and computational constraints. DAWA is designed for 1D-inputs and extended to at best the 2-d setting using a Hilbert Curve based domain reduction. Privacy Model. Since none of the existing baselines consider user-level privacy, to allow for a fair comparison with the baselines, we disable the user-level features of VDR and present experiments with event-level privacy. That is, for the results in this section, VDR does not perform sampling or statistical refinement and we assume each data record belongs to a separate user, hence the privacy protection offered degrades to event-level. We also note that since the public datasets have very limited number of points (orders of magnitude fewer than Veraset), user-level privacy evaluation on them is not feasible. Thus, presenting event-level results here also allows 98 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (a) 4SQ_TKY 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (b) CABS_SF 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (c) GW_NY 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (d) VS_HT 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (e) VS_LA 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 relative error (f) VS_SL VDR AG UG LPM HB_striped MWEM PrivBayes AHP No Noise Figure 4.7: Impact of privacy budget on range count query (RCQ) accuracy. 30m 60m 120m query size (in meters) 0.2 0.4 0.6 0.8 relative error (a) GW_NY 30m 60m 120m query size (in meters) 0.2 0.4 0.6 0.8 relative error (b) VS_HT Figure 4.8: Impact of query size on RCQ ac- curacy. for reproducibility of our results on public datasets. We evaluate thoroughly the application of user-level privacy using Veraset data in Sec. 4.4.4. 4.4.2.1 RangeCountQuery Figure 4.7 presents the error of VDR and the compared approaches when varyingε. Recall that a smallerε means stronger privacy protection (hence larger noise magnitude). Unsurprisingly, the error reduces as noise magnitude decreases, with VDR consistently outperforming all competitor approaches. This shows that VDR is effective in capturing spatial patterns present in real-world 99 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 1.0 MAPE (a) VS_HT 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 1.0 MAPE (b) VS_LA 0.05 0.1 0.2 0.4 0.8 ε 0.2 0.4 0.6 0.8 1.0 MAPE (c) VS_SL VDR AG UG LPM HB_striped MWEM PrivBayes AHP No Noise Figure 4.9: Impact of privacy budget on forecast query er- ror (sMAPE) locations datasets and smoothing excessive noise by exploiting such information. We also eval- uate the impact of query size on accuracy of range count queries by considering test queries of three different sizes for datasets GW_NY and VS_HT. Figures 4.8(a) and 4.8(b) show that the error for all the algorithms increases when query size grows. Answering larger query ranges requires aggregating more grid cells, each contributing additional error to the answer. 4.4.2.2 ForecastingQuery Figs. 4.9(a)-(c) show results on Veraset data subsets VS_HT, VS_LA and VS_SL. Due to noise, all DP mechanisms produce worse forecasts (compared to ‘No Noise’), as a direct result of poor fitting of the Theta forecasting algorithm. DP mechanisms that only introduce random noise, such as LPM, can perform well, whereas those that infer the density of regions by cleverly grouping and partitioning the domain introduce a bias and obliterate trends and seasonal effects present in the timeseries. VDR, with its ability to smooth out the noise, improves significantly these forecasts by preserving timeseries specific factors. Lastly, in some instances, such as for ε=0.05 of Fig 4.9(c), UG can perform well by learning an average value for the timeseries period and making naive forecasts that predict the last period’s actuals as next period’s value, without establishing causal factors. 100 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (a) 4SQ_TKY MAE(10) 0.05 0.1 0.2 0.4 0.8 ε 0 2 4 6 8 10 regret (b) 4SQ_TKY Regret(10) 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (c) 4SQ_TKY MAE(20) 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 regret (d) 4SQ_TKY Regret(20) 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (e) 4SQ_TKY MAE(40) 0.05 0.1 0.2 0.4 0.8 ε 0 10 20 30 40 regret (f) 4SQ_TKY Regret(40) VDR AG UG LPM HB_striped MWEM PrivBayes AHP Figure 4.10: Impact of privacy budget on the hotspot query for the 4SQ Tokyo dataset at various thresholds. 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (a) GW_NY MAE 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 regret (b) GW_NY Regret 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (c) VS_HT MAE 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 regret (d) VS_HT Regret 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 25 30 mean absolute error (e) VS_LA MAE 0.05 0.1 0.2 0.4 0.8 ε 0 5 10 15 20 regret (f) VS_LA Regret VDR AG UG LPM HB_striped MWEM PrivBayes AHP Figure 4.11: Impact of privacy budget on hotspot query accuracy. 4.4.2.3 HotspotQuery Figure 4.10 reports the Mean Absolute Error (MAE) and Regret metrics of the hotspot query for the Foursquare Tokyo dataset at density thresholds of 10, 20 and 40. Fig.4.11 reports the accuracy for hotspot queries on various Veraset subsets at the fixed density threshold of ν =20. Mechanisms that model directly the data distribution such as MWEM, AHP and PrivBayes tend to underestimate density globally, and incur a large MAE and regret penalty, up to the maximum of the density threshold. To the same effect, UG, due to its coarse partitioning of the data domain, underestimates the ‘hot’ peaks that the query searches for, also experiencing both a large MAE and regret. AG improves these estimates to some extent by building a finer domain partitioning 101 0.5 1 1.5 2 3.5 7 10.5 14 17.5 Learning rolling window (in days) 0.06 0.08 0.10 0.12 relative error VDR in-sample VDR out-of -sample LPM out/in-sample Figure 4.12: Impact of vary- ing learning period 1 2 3 4 5 6 7 8 9 Transfer P eriod (in days from end) 0.06 0.07 0.08 0.09 0.10 0.11 0.12 relative error TP3 out-of -sample TP3 in-sample TP6 out-of -sample TP6 in-sample Figure 4.13: Out-of-sample denoising. 1 2 4 8 16 Number of learning slices 0.08 0.10 0.12 0.14 0.16 0.18 0.20 relative error 4SQ_TKY CABS_SF VS_LA Figure 4.14: Learning Period Analysis in the lower level of its hierarchy, and while it may not locate the closest hotspot (high MAE), it still finds one that meets the density threshold (lower regret). LPM fares well for this query as it is not affected by the biases that partitioning approaches bring about. VDR further improves on LPM in both metrics. In all instances, VDR finds an effective balance between the MAE and regret error metrics. Our results show VDR clearly outperforms the existing state-of-the-art mechanisms. In par- ticular, by starting with an unbiased estimate of the density counts and denoising them, our approach has clear advantages when used for answering range count queries, finding hotspots or forecasting POI visit counts. In the rest of this section, we no longer consider competitor approaches, and we focus solely on analyzing the parameters of our system. 102 GW_SF CABS_SF VS_HT VS_LA 4SQ_TKY 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 relative error with MRL without MRL Figure 4.15: Multi Resolution Learning 2 4 32 256 1024 4096 Bottleneck Size 0.10 0.12 0.15 0.18 0.20 0.23 0.25 relative error VDR 4SQ_TKY CABS_SF VS_LA R. F ocus 4SQ_TKY CABS_SF VS_LA VDR 4SQ_TKY CABS_SF VS_LA R. F ocus 4SQ_TKY CABS_SF VS_LA Figure 4.16: Model regular- ization analysis 4.4.3 SystemAnalysis 4.4.3.1 Modelingchoices Effectofmodelregularizationonperformance . We evaluate the effectiveness of Variational- AutoEncoder in denoising DP histograms. Recall that, by training a lower dimension representa- tion of the data, we wish to learn patterns without overfitting to the noise. In Fig. 4.16 we evaluate two methods: ‘VDR’ is our approach of using a VAE as a regularization-enabled AutoEncoder, while ‘R. focus’ simulates an AutoEncoder by over-emphasizing reconstruction loss (i.e. by set- tingα to a very small value). We consider both public and proprietary datasets while varying the bottleneck size. We notice that a small bottleneck performs poorly due to having limited repre- sentation power to learn the input data. When increasing the bottleneck, we see polar effects in the presence and absence of regularization. In the case of ‘R. Focus’, the model quickly overfits to the noise, decreasing accuracy. Whereas if the AutoEncoder is sufficiently regularized, accuracy remains good even for large models due to the learning of generalizable patterns, emphasizing the need for regularizing the encoding space. 103 Effect of learning period . Fig. 4.14 shows the accuracy of denoising when we train the VAE with a varying number of slices. When the number of learning slices is one, we have in essence a snapshot dataset in 2D. As we add more slices to training, the learning is stabilized and the the learned patterns help achieve better denoising performance in the entire dataset. Effect of Multi Resolution Learning on accuracy . Fig. 4.15 shows that across all datasets, by augmenting the training set with coarser granularity histograms, we learn a model that can answer queries more accurately. This is also due to the smoothing effect of the ConvNet, as learned information from one slice helps denoise another within the same dataset. 4.4.3.2 Datareleaseovertime We study the effectiveness of VDR when releasing data over time. Specifically, we measure how often VDR needs to be retrained when new data arrive. For this experiment we utilize the Veraset Houston data for a period of 24 days, with each slice representing a one hour time period. We consider a training period, T b to T e , where T b is the beginning of the training period and T e is the end, and a testing period that starts at T e and ends at T t . We refer to the period T b to T e as in-sample and T e to T t as out-of-sample. We evaluate the performance of the model in two scenarios. In Figure 4.12, we test the denoising performance of VDR on an out-of-sample period of 3.5 days (84 slices). VDR in-sample and out-of-sample show the accuracy of VDR on the in- sample and out-of-sample period, respectively. The training data period is varied by movingT b forward in time but keepingT e andT t the same (so training period ranges from 420 to 12 slices). We see that the performance first improves when the training period is up to 3.5 days, as having more data helps the model denoise via better generalizabilty. But as even more data is used in the training, specificity of the patterns is reduced, hence accuracy suffers. In the second setting 104 2 4 8 16 32 64 128 256 sampling rate k 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 relative error VS_HT SE SE+NE VS_LA SE SE+NE VS_SL SE SE+NE VS_HT SE SE+NE VS_LA SE SE+NE VS_SL SE SE+NE Figure 4.17: Sampling Error and and Noise Error at vary- ing sampling rate. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 scaling factor g 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 relative error VS_HT SE k = 8 SE+NE k = 8 SE k = 32 SE+NE k = 32 SE k = 128 SE+NE k = 128 VS_LA SE k = 32 NE k = 32 Figure 4.18: Effect of Scaling factor g on SE and SE+NE (Fig. 4.13), we train VDR over slices of ‘TP’ (Trainingperiod) number of days, and use the trained model to denoise. For in-sample testing the Transfer Period refers to the accuracy of VDR on the training data itself. We see that the model accuracy is mostly unaffected. For out-of-sample testingTransfer period refers to the number of days fromT e toT t . We increaseT t so that transfer period ranges from 1 day to 9 days. We see that performance degrades when denoising out-of- sample periods far from the training periods. We recommend retraining of the VAE model every couple of days. 4.4.4 User-levelprivacyandstatisticalrefinement We consider several Veraset data subsets, and setε=6, analogous to [6, 22]. Recall that, in order to release data without consuming excessive privacy budget, we bound the maximum number of contributions of a user tok, consequently having access to only the sampled subsetD s to learn our model. Note that in Figs. 4.17, 4.18, 4.19 and 4.20 the relative error metric is evaluated w.r.t to the true dataD, while the query computations are performed overD s . 105 Boundingusercontribution. In Fig 4.17, we evaluate the accuracy of answering range count queries when varying the sampling rate k. Sampling error (denoted as SE) measures the error induced purely due to bounding the contribution of each user to k. As expected, SE decreases as the sampled subsetD s comes closer to representing the true datasetD (also refer to Fig. 4.2). We also report the error in the DP-preserving answer using LPM. Denoted bySE+NE is LPM which contains errors induced to DP-compliant noise in addition to those due to sampling. Ask increases, the benefits from increased sampling dominate the total effect (see k = 2 to16), how- ever the sensitivity of the query (hence, the noise) eventually exceeds the rate at which sampling error decreases, causing a sharp increase in the error of the reported answer (seek =32 to256). We discuss how to set the ideal value ofk at the end of this section. Analysing the effects of brute-force debiasing. Recall that the answer to the range count query reported onD s can be scaled according toN/n to potentially debias the result. However, since the data is skewed and with added noise, such a scaling can affect the results negatively. We vary the degree of scaling asg× N/n for values ofg from 0.1 to 1. Figure 4.18 shows that for sampling induced error SE, scaling the answer can be useful. However, if there is in addition noise error SE+NE, then upscaling also amplifies the noise in the reported counts and almost always yields poor outputs. Therefore, to utilize any form of scaling it is important to capture accurately the model of the data, such as by denoising with VDR. Statistical refinement . The refinement constant C determines the degree of scaling α that is applied to the query answer. For example, at C = 1, γ approaches N/n, equivalent to a basic scaling. In Figure 4.19, we evaluate the accuracy of reporting for VDR while varyingC at various degrees of samplingk. Remarkably, among all settings the lowest error is achieved atC = 5e-5, substantiating our claim that a fixed value of C is sufficient to refine answers. In Fig. 4.20 we 106 1 10 −2 10 −3 5 * 10 −4 10 −4 5 * 10 −5 10 −5 5 * 10 −6 10 −6 refinement constant C 0.40 0.50 0.60 0.70 0.80 0.90 1.00 relative error VS_HT k = 8 k = 32 k = 128 VS_LA k = 8 k = 32 k = 128 VS_SL k = 8 k = 32 k = 128 VS_HT k = 8 k = 32 k = 128 VS_LA k = 8 k = 32 k = 128 VS_SL k = 8 k = 32 k = 128 Figure 4.19: Varying refine- ment constant C for VDR. 2 4 8 16 32 64 128 256 sampling rate k 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 relative error VS_HT Unref . VDR VS_LA Unref . VDR VS_SL Unref . VDR VS_HT Unref . VDR VS_LA Unref . VDR VS_SL Unref . VDR Figure 4.20: Benefit of VDR refinement with C = 5e-5. compare the accuracy of VDR (atC =5e-5) with the approach that reports the answer computed onD s as-is. In all datasets, there is a clear benefit to using statistical refinement, improving the error by up to 40% in the case of VS_HT, a high density city. We conclude with recommendations on how to set k, continuing our discussion from Sec. 4.3.3. We determine the value of growth ratioλ empirically from the data of another city (assumed public for the sake of privacy accounting), where we find λ =2.5e-7 to give the best accuracy. For this value our heuristic (k = λN ) recommends settingk = 10 for VS_SL,k = 20 for VS_LA and k =53 for VS_HT. As we see in Fig. 4.20, these values ofk achieve close to the best accuracy for their corresponding cities. This suggests that due to similarity in skewness, inherent to location datasets, we can set the value ofk to be a constant fraction ofN. 4.4.5 LearningAbilityonNon-UniformDatasets Setup. We synthesize 2M points from a Gaussian Mixture Model (GMM) [102] with 50 com- ponents positioned at random in the 3-d integer latticeZ 3 of size 9× 9× 9. All components are equally weighted and have the covariance matrix ⃗ I × σ 2 , where ⃗ I is the identity matrix. To control the variation around its mean value, we adjust the parameterσ . The synthesized data is 107 Figure 4.21: Gaussian Mixture Model visualization atσ varying from 3(top), 5, 7, 9(bottom) partitioned into a 3D histogram of 100× 100× 100 cells. We train and denoise with VDR on the 100 slices. We report σ in terms of the number of such cells, with a smaller variance implying a data spread tighter around the mean of each GMM component, thus mimicking the skewed data distributions typically present in spatio-temporal location datasets. Figure 4.21 visualizes a single slice of this dataset with its true values (left), noisy data collections (middle) and denoised reconstructions (right). VDR has a strong ability to recover the underlying patterns of GMMs from even highly distorted observations. Moreover, Figure 4.22 (ε = 0.2) shows that as we in- crease the varianceσ 2 of the GMM components, the model performance suffers, since at a large variance (such as the one depicted in Figure 4.21σ =5,7) data is more uniformly distributed and 108 1 4 9 16 25 36 V ariance of each component 0.05 0.10 0.15 0.20 relative error = 2 = 4 = 8 = 32 = 128 Figure 4.22: Varyingσ 2 2 4 8 32 128 V AE bottleneck size 0.05 0.10 0.15 0.20 relative error σ = 1, ε = 0.1 σ = 3, ε = 0.1 σ = 5, ε = 0.1 σ = 1, ε = 0.8 σ = 3, ε = 0.8 σ = 5, ε = 0.8 Figure 4.23: Varying bottleneck. lacks the spatial patterns typically exhibited in location datasets. Lastly, we evaluate the effect of varying the bottleneck size of the VAE on the learning ability of VDR. Figure 4.23 shows that, for a given privacy budget, a larger bottleneck is required to capture more skewed datasets. More- over, when the data is skewed (compare lines forε = 0.1 andε = 0.8 atσ = 1), less DP noise in the data collection step helps further emphasize the data spread, thus benefiting from having a larger model capacity to learn precisely such patterns. 109 Chapter5 Differentially-PrivateNext-LocationPredictionwithNeural Networks In this Chapter we focus on the task of next-POI prediction. Given historical trajectories, sev- eral approaches exploit recent results in neural networks to produce state-of-the-art POI recom- mender systems [21, 127, 73]. Subsequently, these models can be (i) placed on user devices to improve the quality of location-centric services; (ii) shared with business affiliates interested in expanding their customer base; or (iii) offered in a Machine-Learning-as-a-Service (MLaaS) in- frastructure to produce business-critical outcomes and actionable insights (e.g., traffic optimiza- tion). Figure 5.1 illustrates these cases. We extend the training of sophisticated ML models to also include DP protections. Because even though individual trajectory data are not disclosed directly, the model itself retains signifi- cant amounts of specific movement details, which in turn may leak sensitive information about an individual’s health status, political orientation, entertainment preferences, etc. The problem is exacerbated by the use of neural networks, which have the tendency to overfit the data, leading to unintended memorization of rare sequences which act as quasi-identifiers of their owners [30, 110 20]. Hence, significant privacy risks arise if individual location data are used in the learning process without any protection. User 2 User N Mobile Users ... User 1 Search Ride Sharing Machine Learning Networking Service Providers Location Updates Publish Model Business Insights Targeted Advertisement Next-POI Recommendation Figure 5.1: System Model The research literature identified several fundamental privacy threats that arise when per- forming machine learning on large collections of individuals’ data. One such attack ismembership inference [109, 53] where an adversary who has access to the model and some information about a targeted individual, can learn whether the target’s data was used to train the model. Another attack called model inversion [120] makes it possible to infer sensitive points in a trajectory (e.g., a user’s favorite bar) from non-sensitive ones (e.g., a user’s office). Within the MLaaS setting— where a third party is allowed to only query the model—this implies extracting the training data using only the model’s predictions [43]. Iterative procedures such as stochastic gradient descent (SGD) [15] are often used in training deep learning models. Due to the repeated accesses to the data, they raise additional challenges when employing existing privacy techniques. In order to prevent the inference of private infor- mation from the training data, recent approaches rely on the powerful differential privacy (DP) 111 model [32]. Sequential querying using differentially private mechanisms degrades the overall privacy level. The recent work in [2] provides a tight-bound analysis of the composition of the Gaussian Mechanism for differential privacy under iterative training procedures, enabling the utility of a deep learning model to remain high [82], while preventing the exposure of the train- ing data [56, 13]. While integrating differential privacy techniques into training procedures like stochastic gradient descent is relatively straightforward, computing a tight bound of the privacy loss over multiple iterations is extremely challenging. The seminal work in [2] provided record-level privacy for a simple feed-forward neural net- work trained in a centralized manner. The approach provides protection only when each individ- ual contributes a single data item (e.g., a single trajectory). When an individual may contribute multiple data items, a more strict protection level is required, calleduser-level privacy. McMahan et. al. [82] showed that one can achieve user-level privacy protection in a federated setting for simple learning tasks. However, ensuring good utility of the trained model for datasets with vari- ous characteristics remains a challenge. McMahan et. al. [82] remove skewness in their inputs by pruning each user’s data to a threshold, thus discounting the problems of training neural models on inherently sparse location datasets, usually having density around 0.1% [129]. Existing work on privacy-preserving deep learning either assume large and dense datasets, or are evaluated only on dummy datasets [45] that are replicated to a desired size using techniques such as [81]. Such techniques overlook the difficulty of training models on smaller or sparse datasets, which often prevent models from converging [83]. Moreover, they require extensive hyperparameter tuning to achieve good accuracy, and the rough guidelines offered to tune these parameters [80] do not extend to more complex neural architectures, or to datasets different from those used in their work. 112 We propose a technique that can accurately perform learning on trajectory data. Specifically, we focus on next-location prediction, which is a fundamental and valuable task in location-centric applications. The central idea behind our approach is the use of the skip-gram model [85, 84]. One important property of skip-grams is that they handle well sparse data. At the same time, the use of skip-grams for trajectory data increases the dimensionality of intermediate layers in the neural network. This creates a difficult challenge in the context of privacy-preserving learning, because it increases data sensitivity, and requires a large amount of noise to be introduced, therefore decreasing accuracy. To address this challenge, we capitalize on the negative sampling (NS) technique that can be used in conjunction with skip-grams. NS turns out to be extremely valuable in private gradient descent computation, because it helps reduce the gradient update norms, and thus boosts the ratio of the useful signal compared to the noise introduced by differential privacy. In addition, we introduce a data grouping mechanism that makes learning more effective by combining multiple users into a single bucket, and then training the model per bucket. Grouping has a dual effect: on the positive side, it increases the information diversity in each bucket, improving learning outcomes; on the negative side, it heightens the adverse effect of the introduced Gaussian noise. We study closely this trade-off, and investigate the effect of grouping factors in practice. Our specific contributions are: 1. We propose a private learning technique for sparse location data using skip-grams in con- junction with DP-SGD. To our knowledge, this is the first approach to combine skip-grams 113 with DP to build a private ML model. Although our analysis and evaluation focus on lo- cation data, we believe that DP-compliant skip-grams can also benefit other scenarios that involve sparse data. 2. We address the high-dimensionality challenge introduced by skip-grams through the care- ful use ofnegativesampling, which helps reduce the norm of gradient descent updates, and as a result preserves a good signal-to-noise ratio when perturbing gradients according to the Gaussian mechanism of DP. In addition, we group together data from multiple users into buckets, and run the ML process with each bucket as input. By increasing the diversity of the ML input, we are able to significantly boost learning accuracy. 3. We perform an extensive experimental evaluation on real-world location check-in data. Our results demonstrate that training a differentially private skip-gram for next-location recommendation clearly outperforms existing approaches for DP-compliant learning. We also perform a thorough empirical exploration of the system parameters to understand in- depth the behavior of the proposed learning model. Our findings show that DP-compliant skip-grams are a powerful and robust approach for location data, and some of the trends that we uncovered can also extend to other types of sparse data, beyond locations. The rest of the paper is organized as follows: we provide background information in Sec- tion 5.1. Section 5.2 introduces the system architecture, followed by the details of our private location recommendation technique in Section 5.3. We perform an extensive experimental eval- uation in Section 5.4. 114 5.1 Background 5.1.1 DifferentialPrivacy Differential Privacy (DP) [36] represents the de-facto standard in protecting individual data. It provides a rigorous mathematical framework with formal protection guarantees, and is the model of choice when releasing aggregate results derived from sensitive data. The type of analyses sup- ported by DP range from simple count or sum queries, to the training of machine learning models. A popular DP flavor that is frequently used in gradient descent due to its refined composition the- orems is(ε,δ )-differential privacy. Given non-negative numbers (ε,δ ), a randomized algorithm M satisfies (ε,δ )-differential privacy iff for all datasets D andD ′ differing in at most one element, and for allE⊆ Range(M), the following holds: Pr[M(D)∈E]≤ e ε Pr[M(D ′ )∈E]+δ (5.1) The amount of protection provided by DP increases as ε and δ approach 0. Dwork et al. [36] recommend setting δ to be smaller than 1/n for a dataset of cardinality n. The parameter ε is called privacy budget. DatasetsD andD ′ that differ in a single element are said to be neighboring, or sibling. When the adjacency between the datasets is defined with respect to a single data record, then the DP formulation providesrecord-level privacy guarantees. The amount of protection can be extended to account for cases when a single individual contributes multiple data records. In this case, the sibling relationship is defined by allowing D and D ′ to differ only in the records provided by a single individual. This is a stronger privacy guarantee, called user-level privacy. 115 To achieve(ε,δ )-DP, the result obtained by evaluating a function (e.g., a query)f on the input data must be perturbed by adding noise sampled from a random variableZ. The amount of noise required to ensure the mechanismM(D)=f(D)+Z satisfies a given privacy guarantee depends on how sensitive the function f is to changes in the input, and the specific distribution chosen forZ. The Gaussian mechanism (GM) [35] is tuned to the sensitivityS f computed according to the globalℓ 2 -norm asS f = sup D≃D ′||f(D)− f(D ′ )|| 2 for every pair of sibling datasetsD,D ′ . GM adds zero-mean Gaussian noise calibrated to the function’s sensitivity as follows: Theorem3 For a query f : D → R, a mechanismM that returns f(D) + Z, where Z ∼ N(0, σ 2 S 2 f ) guarantees (ε,δ )-DP ifσ 2 ε 2 ≥ 2ln(1.25/δ ) andε∈[0,1] (see [36] for the proof). The composability property of DP helps evaluate the effect on privacy when multiple func- tions are applied to the data (e.g., multiple computation steps). Each step is said to consume a certain amount of privacy budget, and the way the budget is allocated across multiple steps can significantly influence data utility. 5.1.2 NeuralNetworks Modern machine learning (ML) models leverage the vast expressive power of artificial neural networks to dramatically improve learning capabilities. Convolutional networks have shown exceptional performance in processing images and video [64]. Recurrent networks can effectively model sequential data such as text, speech and DNA sequences [28, 58]. A neural network is composed of one or more interconnected multilayer stacks, most of which compute non-linear input-output mappings. These layers transform the representation at one level (starting with the 116 raw input) into a representation at a higher, more abstract level. The key to improving inference accuracy with a neural net is to continually modify its internal adjustable parameters. Stochastic gradient descent (SGD) is the canonical optimization algorithm for training a wide range of ML models, including neural networks. It is an iterative procedure which performs parameter updates for each training examplex i and labely i . Learning the parameters of a neural network is a nonlinear optimization problem. At each iteration, a batch of data is randomly sampled from the training set. The error between the model’s prediction and the training labels, also calledloss, is computed after each iteration. The loss is then differentiated with respect to the model’s parameters, where the derivatives (or gradients) capture their contribution to the error. A back-propagation step distributes this error back through the network to change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Each internal parameter of the model θ is brought closer to predicting the correct label as follows: θ =θ − η ·∇ θ J(θ ;x (i) ;y (i) ) whereη is the learning rate hyper-parameter andJ is the loss function. Iteratively recomputing gradients and applying them to update the model’s parameters is referred to as descent, and this operation is performed until the model’s performance is satisfactory. 5.1.3 DifferentiallyPrivate-SGD(DP-SGD) Introduced in [1], DP-SGD integrates(ε,δ )-DP with neural networks. It modifies traditional SGD in that after calculating the changes in its internal parameters, it obfuscates the gradient values with noise sampled from the Gaussian distribution. 117 DP-SGD averages together multiple gradient updates induced by training-data examples,clips (i.e., truncates) each gradient update to a specified maximum ℓ 2 -norm, and adds Gaussian random noise to their averaged value. Clipping each gradient bounds the influence of each training- data example on the model. Accordingly, the sensitivity of the average query can be adjusted as desired, and due to the added noise tuned to the sensitivity of the query, differential privacy is ensured in each iteration. Typically, repeatedly executing a query results in sharp degradation of the privacy protection, as more information is leaked by multiple usages of private iterations. Themomentsaccountant technique [1] computes the privacy loss resulting from the composition of Gaussian mechanisms under random sampling. It tracks the moments of the privacy loss variable in each step of the descent, and provides a much tighter upper bound on privacy budget consumption than the standard composition theorem [36]. 5.2 SystemArchitecture In Section 5.2.1 we define the problem statement. We outline the learning model architecture in Section 5.2.2 and we show how it is utilized in Section 5.2.3. Table 5.1 summarizes notations used throughout the paper. 5.2.1 ProblemStatement Data Representation. The input to our learning model consists of check-in data from a set of N users U = {u 1 ,u 2 ,...,u N }. The set of L check-in locations (e.g., points of interest) is denoted asP ={l 1 ,l 2 ,...,l L }. Each useru∈ U has a historical record of check-ins denoted as 118 Table 5.1: Summary of Notations in PLP Notation Definition U,P Sets of users and check-in locations, respectively N,L Cardinalities of setsU andP , respectively U u Historical record of useru’s check-ins dim Dimension of location embedding space b,η Batch size and learning rate, respectively q User sampling probability per step m Expected user sample size per step ε,δ Privacy parameters of Gaussian mechanism σ Noise scale λ Data grouping factor H Set of training buckets C Per-layer clipping norm U u ={c 1 ,c 2 ,...}, where each elementc i is a triplet⟨u,l,t⟩ comprised of user identifier, location and time. LearningObjective. The objective of our model is to predict the location that a given user u will check into next, given a time-ordered sequence of previous check-ins of the user. The past check-ins can represent the user’s current trajectory or his entire check-in history. For each scenario, we describe the usage of the model in Section 5.2.3. In an initial step, we employ an unsupervised learning method, specifically the skip-gram model [85], to learn the latent distri- butional context [106] of user movements over the setP of possible check-in locations. A latent representation of every location in a reduced-dimension vector space is the intermediate output. Next, we determine for each user u its inclination to visit a particular location l by measuring how similarl is in the latent vector space to the locations previously visited byu. 119 0 0 0 1 0 . . . . . . X x 1 x 2 x 3 x i x L Input layer Hidden layer y 1 y 2 y 3 y j y L Output layer Softmax 0 0 0 1 0 . . . . . X dim-dimensional vector h 1 h 2 hdim . . . . . . W (L x dim) Embedding of location l i Embedding Matrix W (L x dim) Embedding of location l i Embedding Matrix W′ (dim x L) W′ (dim x L) Context Matrix B′ (1 x L) B′ (1 x L) Bias Vector Figure 5.2: Architecture of the location-recommendation model 5.2.2 LearningModel The skip-gram negative sampling (SGNS) model [85, 84] was initially proposed to learn word embeddings from sentences. However, several recent efforts [73, 21, 127] show that the model is also appropriate for location recommendation tasks. Specifically, the model is used to learn location embeddings from user movement sequences, where each location corresponds to a word, and a user’s check-in history to a sentence. Given the set of check-ins of a user, we treat the consecutively visited locations as a trajectory that reflects her visit patterns. A data pre-processing step is required to make the data format compatible with the input of a neural network: every location in P is tokenized to a word in a vocabulary of size L = |P|. Given a target location check-in c, a symmetric window of win context locations to the left and win to the right is created to output multiple pairs of target and context locations as training samples. The assumption is that if a model can distinguish between 120 actual pairs of target and context words from random noise, then good location vectors will be learned. Figure 5.2 illustrates the neural network used in our solution. The model parameters consist of three tensorsθ ={W,W ′ ,B ′ } and two hyper-parameters representing the embedding dimen- sionsdim and the negative samples drawnneg. Consider a target-context location pair(l x ,l y ). First, both locations are one-hot encoded into binary vectors⃗ x and⃗ y of sizeL. The multiplica- tion of⃗ x with embedding matrixW produces the embedding vector for the input locationl x (i.e., the i th row of matrix W ). W × x represents the mapping of input location x to a vector ⃗ h in an dim-dimensional space. Next, for each positive sample (i.e., true target/context pair), a neg number of negative samples are drawn. The context location vector ⃗ y along with the negative samples are passed through a different weight matrix W ′ and bias vectorB ′ . Finally, a sampled softmax loss function is applied to calculate the prediction error. At a high level (we refer the reader to [104] for a detailed look), the parameters are modified such that the input word (and the corresponding embedding) is tugged closer to its neighbors (i.e., paired context locations), and tugged away from the negative samples. As a result, during back-propagation, onlyneg+1 vectors inW orW ′ are updated instead of entire matrices. In the original work [85, 84], negative sampling was devised to improve computational efficiency, as updating the entire model in each iteration can be quite costly. In private learning, it also plays an important role in controlling the adverse affects of noisy training. We remark here that techniques such as Noise Contrastive Estimation [47] and Negative Sam- pling use a non-uniform distribution for drawing the samples—for example, by decreasing the sampling weight for the frequent classes—whereas, we use a sampled softmax function with a uniform sampling distribution. This is a necessity for preserving privacy, since estimating the 121 frequency distribution of locations from user-submitted data will cause privacy leakage. Lastly, the embedded vectors are normalized to unit length for efficient use in downstream applications. On top of improving performance [118, 68], normalizing the vectors assists similarity calculation by making cosine similarity and dot-product equivalent. We detail the privacy-preserving learning model in Section 5.3. In the remainder of this sec- tion, we show how the model, once computed in a privacy-preserving fashion, can be utilized. 5.2.3 ModelUtilization We provide an overview of how our proposed privacy-preserving next-location prediction model is utilized. Once our privacy-preserving learning technique is executed, the resulting model can be shared with consumers, since the users who contributed the data used in the training are protected according to the semantic model of DP. While the utilization of our model is orthogonal to our proposal, we include it in this section in order to provide a complete, end-to-end description of our solution’s functionality. A typical use of our model is for a mobile user to download ∗ it to her device, provide her location history as input, and receive a next-location recommendation. Alternatively, a service provider who already has the locations of its subscribers, will perform the same process to provide a next-location suggestion to a customer. We emphasize that, the model utilization itself does not pose any privacy issues. In both cases above, neither the input, nor the output to the model are shared, so there is no privacy concern. The only time we need to be concerned about privacy is when training the model, since a large amount of trajectories from numerous users is required for that task. ∗ To reduce communication costs, only the embedding matrix is deployed. 122 Consider a user who has recent check-ins ζ in a relatively short time period (e.g., last few hours). This set of locations forms the basis for recommending to the user the next location to visit. The normalized embedded matrixW in the fully-trained model encodes the latent feature vector of all locations. For each location check-in l i ∈ ζ , the embedding vectors w(l i ) are ex- tracted and stacked on top of each other. More precisely, to obtain the embedding vectorw(l i ), the binary vector of l i is multiplied with W (similar to the first step of the training process). This process is equivalent to extracting the dim-dimensional row corresponding to location l i . Then, the average of elements across dimensions of the stacked vectors is computed to produce a representationF(ζ ) of the recent check-ins of the user. Finally, cosine similarity scores are computed as the dot-product of the vectorF(ζ ) to the embedding vector of each location in the universeL. We rank all locations by their scores and select the top-K locations as the potential recommendations for the user. In the case when the user has no recent check-ins, the representationF() can be computed over her movement profile comprising of historical check-ins. Other methods include training an additional model to learn latent feature vectors of each user from her preferences and locations visited. As in [41, 127], a user’s feature representation can be used to determine her inclination to visit a particular location. However, modeling each user with such personalized representations, while at the same time preservinguser-level privacy, is a fundamentally harder problem (in terms of both system design and privacy framework), and is left as future work. When the model is deployed at an untrusted location-based service provider (LBS), additional privacy concerns must be addressed. In this case, the mobile user must protect the setζ (orF(ζ )) locally. Techniques such as geo-indistinguishability [7] can be applied to protect the check-in history. For example, the check-in coordinates can be obfuscated to prevent adversaries from 123 pinpointing the user to a certain location with high probability. Addressing these vulnerabilities in the MLaaS setting is orthogonal to the scope of this paper. 5.3 PrivateLocationPrediction(PLP) Section 5.3.1 presents in detail our proposed approach for private next-location prediction. Sec- tion 5.3.2 provides a privacy analysis of our solution. 5.3.1 PrivateLocationPrediction(PLP) PLP is a customized solution to location recommendation. It learns latent factor representations of locations while controlling the influence of each user’s trajectory data to the training process. Bounding the contribution of a single data record in the SGD computation has been proposed in previous work [2, 110]. We make several extensions and contribute data grouping techniques to boost model performance. Even while combining data of multiple users, we guarantee user- level privacy (such as in [82, 45]). By grouping data records of multiple users, we benefit from cross-user learning to improve model performance. Algorithm 8 depicts the procedure of this learning process. Model hyperparameters labeled batch size β , learning rate η and loss functionJ are related to gradient descent optimization, whereas hyperparameters labeled grouping factor λ , sampling probability q, gradient clipping norm bound C, noise scale σ and privacy parameters ε,δ are introduced to create an efficient and privacy-preserving system. We briefly describe each component in isolation before coupling them together to illustrate the big picture. 124 U1 U2 U3 U5 U6 U4 User Subsampling With Probability q = 0.66 User-partitioned data Data Grouping With grouping factor λ = 2 Training buckets H E[ |Usample| ] = m Usample = {U1,U2,U4,U6} H 2 ={U4,U6} H 1 ={U1,U2} Figure 5.3: Data sampling and grouping User Sampling. Given a sampling probability q = m/N, each element of the user set is subjected to an independent Bernoulli trial which determines whether the element becomes part of the sample. As a consequence, the size of sampled set of users U sample is equal to m only in expectation. This is a necessary step in correctly accounting for the privacy loss via the moments accountant [2]. DataGrouping. Data grouping is essentially a pre-processing technique that significantly boosts model performance. It has a dual purpose. The first is to reduce the effects of skewness and sparsity inherent to location data, where the frequency of check-ins of users at locations follows the Zipf’s law [27]. The second is to provide cross-user learning to smooth updates in the model parameters produced by the function in lines 15-22. The underlying intuition is simple: to ensure good performance of the context model, each update of a training step must contribute to the final result. By combining the profiles of multiples users we also reduce minor observation errors that may be produced from specific data points in a user’s profile. 125 Our data grouping technique agglomerates the data of multiple users into bucketsH. Given a grouping factorλ , users (and their entire data) are randomly assigned to buckets such that each bucket containsλ users. This operation is encapsulated in thegroupData(·) function in line 6. As a separate method, we also tried equal frequency grouping, where a global pass over the record count of each user is used to produce buckets such that each contains approximately the same number of records (while ensuring that the data records of each user are not split into multiple buckets). However, we noticed no statistically significant benefit in model accuracy from equal frequency grouping than with a random grouping. Accordingly, we use the latter in the rest of the work. Figure 5.3 illustrates the data sampling and grouping process (corresponding to lines 5-6) for a sampling probability of 0.66 and λ = 2. Grouped data in each bucket is organized as a single array for processing by gradient descent optimization. Recall from Section 5.2.2 that a symmetric moving window is applied to create training examples, after the array is read by the generateBatches() function (in line 17). A numberβ of target-context location pairs are placed in each batch. In brief, at each step of PLP, we sample a random subset of users (line 5), combine the data of multiple users into buckets (line 6), compute a gradient update with boundedℓ 2 norm from each bucket (lines 7-8), add noise to the sum of the clipped gradients (line 9), take their approximate average, and finally update the model by adding this approximation (line 10). Alongside, a privacy ledger is maintained to keep track of the privacy budget spent in each iteration by recording the values of σ and C (lines 3 and 11). This tracker has the added benefit of allowing privacy accounting at any step of the training process. Given a value of δ and the recorded ledger, the moments accountant can compute the overall privacy cost in terms of ε. This functionality is 126 Algorithm8 Algorithm for Private Location Prediction with user-level privacy. Input: loss functionJ(θ ), grouping factorλ , learning rateη , sampling probabilityq =m/N, gradient norm boundC, batch sizeβ , privacy parametersε,δ 1: procedureTrainPrivateLocationEmbedding 2: Initialize: Modelθ 0 ={W,W ′ ,B ′ }, 3: Privacy Accounting ledgerA(δ,q ) 4: foreach stept=1,...do 5: U sample ← a random sample ofm t users 6: Initialize bucketsH← groupData(U sample ,λ ) 7: foreach data bucketd h ∈Hdo 8: ¯g h ← ModelUpdateFromBucket(θ t ,d h ) 9: b g t = 1 |H| ( P h∈H ¯g h +N(0,σ 2 C 2 I)) ▷ Noise. 10: θ t+1 =θ t +b g t ▷ Model Update. 11: A.track_budget(C,σ ) 12: ifA.cumulative_budget_spent()≥ εthen: 13: returnθ t− 1 14: 15: functionModelUpdateFromBucket(θ t ,d h ) 16: Φ ← θ t 17: B← generateBatches(d h ,β ) 18: foreachb∈B do 19: Φ ← Φ − η 1 |b| P (x i ,y i )∈b ∇ Φ J(Φ ,x i ,y i ) 20: g h =Φ − θ t 21: ¯g h =g h /max(1, ∥g h ∥ 2 C ) ▷ Gradient Norm Clipping. 22: return ¯g h 127 provided by the cumulative_budget_spent() function in line 12, which implements the moments accountant from [2]. Privacy Mechanism. The gradient values computed in line 20 do not have an a-priori bound. This complicates the application of the Gaussian Mechanism (GM), which is generally tuned to the sensitivity of the performed query. In this particular use case, we employ a Gaussian sum query in line 9, the results of which are then averaged using a fixed-denominator estimator. To bound the sensitivity of this query, a maximum sensitivity ofC is enforced on every gradient computed on bucketh as follows (equivalent to line 21): ∥¯g h ∥ 2 = ∥g h ∥ 2 for∥g h ∥ 2 ≤ C C for∥g h ∥ 2 >C. Gradient clipping places a strict limit on the maximum contribution—in terms of itsℓ 2 norm— of the gradient computed on a bucket. Formally,||¯g h || 2 ≤ C. The sensitivity of the scaled gradient updates with respect to the summing operation is thus upper bounded byC. Finally, dividing the GM’s output by the number of buckets|H| yields an approximation of the true average of the buckets’ updates. We note that increasing the number of users in each bucket increases the valuable informa- tion in each gradient update. At the same time, the noise introduced by the Gaussian mechanism is scaled to the sensitivity of each bucket’s update (i.e., C). If too few buckets are utilized, this distortion may exceed a limit, meaning that too much information output by the summing op- eration is destroyed by the added noise. This will impede any learning progress. We treat the grouping factorλ as a hyper-parameter and tune it. 128 In a multi-layer neural network such as the one described in our work, each tensor can be set to a different clipping threshold. However, we employ the per-layerclipping approach of [80], where given an overall clipping magnitude C, each tensor is clipped to C/ p |θ |. In the skip- gram model, θ 0 ={W,W ′ ,B ′ }, hence|θ | = 3, so we clip the ℓ 2 -norm of each tensor to C/ √ 3. However, the effect of clipping on the three tensors is rather different due to the difference in their dimensionality. Context matrixW ′ is clipped to the same degree as bias vectorB ′ , despite the fact that they have dimensions (L× dim) and(1× L), respectively. While the dimensionality of the embedding matrix W is (L× dim), only a fraction of the weights—proportional to neg, instead ofL—are considered for clipping due to the sampling ofneg number of negative examples in the sampled softmax function. Simply put,||W|| 2 is proportional to neg and when carefully tuned, the clipping parameter is large enough that nearly all updates are smaller than the clip value ofC/ p |θ |, improving the signal-to-noise ratio over iterative computations. We discuss the effect of this parameter in controlling the distortion of Gaussian noise in Section 5.4. 5.3.2 PrivacyAnalysis Recall that, our proposed system provides user-level differential privacy to individuals who con- tribute their check-in history to the training data. This ensures that all individuals are protected, regardless of how much data they contribute (i.e., even if the length of the check-in history varies significantly across users). Let U k denote the data of a single user. The sensitivity of the Gaussian Sum Query (GSQ) function w.r.t. to neighboring datasets that differ in the records of a single user is defined as S GSQ = max {U sample ,U k } ∥GSQ(U sample ∪U k )− GSQ(U sample )∥ 2 129 In Algorithm 8, GSQ is executed over the bucket gradients, which complicates the analysis of the privacy properties of the algorithm. We consider two distinct scenarios where a user’s data may be assigned to: (i) exactly one bucket; or (ii) more than one bucket. We define ω as the data split factor, meaning that a user’s data may be placed in at mostω buckets. Case 1 [ω = 1]. This represents the scenario where multiple (up to λ ) users’ data may be present in a single bucket, but a single user’s data may be allocated to at most one bucket. Figure 5.4(a) depicts this case, which is assumed by default in Algorithm 8. This is a sufficient condition to ensure that the per-user contribution to a bucket’s gradient update is tightly bounded. Formally, there exists a uniqued h ∈H s.t. U k ⊆ d h . In addition, when theℓ 2 norm of the gradient||¯g h || 2 computed on a data-bucketd h is upper-bounded by the clipping factorC, we get S GSQ ≤ max {H,d h } ∥GSQ(H∪d h )− GSQ(H)∥ 2 ≤ C An informal proof that this approach satisfies ( ε,δ )-DP is as follows: The sensitivity of the gaus- sian sum queryGSQ= P h∈H ¯g h is bounded asS GSQ ≤ C, if for all buckets we have||¯g h || 2 ≤ C. By extension, if a sampled user (and his location visits) can be assigned to exactly one bucket, sibling datasets that differ in the data of a single user can change the output of GSQ by at most C. Therefore, Gaussian Noise drawn fromN(0,σ 2 C 2 I) guarantees user-level (ε,δ )-DP. Case 2 [ω > 1]. If the data of a single user is split over multiple buckets, then it is possible that even after scaling the bucket gradients toC, the sensitivity of the Gaussian sum query is no longerC w.r.t. to user-neighboring datasets. Figure 5.4(b) illustrates an example withω = 2. A similar split strategy (proposed in [81]) is used in the empirical evaluation of [45], wherein a small dataset is scaled up to amplify privacy accounting. However, the authors fail to regulate 130 their noise scale to reflect the altered data sensitivity or alternatively recompute the achieved privacy guarantee. We show that when the data of a userU k is split across multiple buckets, the sensitivity of the query increases toω. Assuming that|H|≤| U sample |, we can write, ω = max {U k ∈U sample } |{d h :d h ∈H andd h ∩U k ̸=∅}| meaning that the data of a user can influence the gradients of at most ω buckets. Accordingly, if for all buckets||¯g h || 2 ≤ C, a single user can change the output ofGSQ by at mostωC. Therefore, to guarantee user-level (ε,δ )-DP, Gaussian Noise must be drawn fromN(0,σ 2 ω 2 C 2 I) . Usample = {U1,U2,U4,U6} (a) Buckets generated with ω = 1 H 2 ={U2,U6} H 1 ={U1,U4} U1 U2 U4 U6 ω = 1, λ = 2 ω = 2, λ = 1 H 4 {U2,U6} H 3 {U4,U6} H 2 {U2,U1} H 1 {U1,U4} (b) Buckets generated with ω = 2 Figure 5.4: Sensitivity of Gaussian Sum Query over U sample users: (a) ω = 1, a single user’s data is placed in exactly one bucket; (b) ω = 2, a single user’s data is split across two buckets. Since gradients computed over the generated bucketsH 1 ,...,H 4 , are bounded by C, a user can contribute at most2C to the computed sum. We remark here that values of ω > 1 produced no positive effect in our evaluation. We experimented with ω = 2 by splitting a user’s data to exactly two random buckets. We found that the signal-to-noise ratio is adversely affected, since the marginally improved signal from the 131 split data is offset by the now quadrupled (proportional to ω 2 ) noise variance. In the rest of the work, we setω to 1. 5.4 Experiments Section 5.4.1 provides the details of the experimental testbed. Section 5.4.2 focuses on the eval- uation of the proposed technique in comparison with the state-of-the-art DP-SGD approach. In Section 5.4.3 we evaluate in detail our approach when varying system parameters, and provide insights into hyper-parameter tuning. 5.4.1 ExperimentalSettings Dataset. We use a real dataset collected from the operation of a prominent geo-social network, namelyFoursquare [128]. The data consist of a set of user check-ins. Every check-in is described by a record comprising of user identifier, the latitude and longitude of the check-in, and the identifier of the POI location. In order to simulate a realistic environment of a city and its suburbs, we focus on check-ins within a single urban area, namely Tokyo, Japan. In particular, we consider a large geographical region covering a 35× 25km 2 area bounded to the South and North by latitudes35.554,35.759, and to the West and East by longitudes139.496,139.905. We filter out the users with fewer than ten check-ins, as well as the locations visited by fewer than two users (such filtering is commonly performed in the location recommendation literature [131, 70]). The remainder of the data contains a total of739,828 check-ins from4,602 unique users over5,069 locations during a time period of22 months from April 2012 to January 2014. 132 Implementation. All algorithms were implemented in Python on a Ubuntu Linux 18.04 LTS operating system. The experiments were executed on an Intel Xeon Platinum 8164 CPU, with 64GB RAM. All data and intermediate structures (e.g., neural network parameters, gradients) are stored in main memory. The proposed neural model is built using Google’s Tensorflow li- brary [1]. To account for the privacy budget consumption of the complex iterative mechanism used in learning, we use the privacy accounting method from [117], which allows for a tight composition of privacy-preserving steps. At each step of the computation, we calculate the (ε,δ ) tuple from moment bounds, according to the moments accountant procedure introduced in [80]. Evaluation Metric. To evaluate the performance of location recommendation, we adopt the “leave-one-out” approach, which has been widely used in the recommender systems litera- ture [54, 124, 127, 73, 21, 41]. This metric simulates the behavior of a user looking for the next location to visit. Given a time-ordered user check-in sequence, recommendation models utilize the first (t− 1) location visits as an input and predict thet th location as the recommended loca- tion. The recommendation quality is measured byHit-Rate(HR).HR@k is a recall-based metric, measuring whether the test location is in the top-k locations of the recommendation list. The outcome of the evaluation is binary: 1 if the test location is included in the output set of the recommender, and 0 otherwise. In the rest of the section, we use the terms prediction accuracy andHR@k interchangeably. Model Training. Our testing and validation sets consist of location visits of users who are not part of the training set. Since we do not train models to learn user specific representations (such as in [73, 127, 21]), this is an accurate representation of real-life model utilization at a user’s device. Validation and testing sets are created in a similar fashion. First, a randomly selected set of100 users and their corresponding check-ins are removed from the dataset. From these, time 133 ordered sequences of trajectories are generated. Each individual trajectory does not exceed a total duration of six hours (following the work in [72, 21]). The remaining4402 users and their check-ins represent the training dataset for learning the parameters of the proposed model. To train the model, we utilize Adam [62], a widely adopted optimization algorithm for deep neural network models that has specific properties to mitigate disadvantages of traditional SGD, such as its difficulty in escaping from saddle points, or extensive tuning of its learning rate param- eter. We implement the optimizer in a differentially private manner by tracking an exponential moving average of the noisy gradient and the squared noisy gradient, as illustrated in [48]. We found that tuning the initial learning rate and decay scheme for Adam only affects the learning in the very first few steps. Typically, Adam does not require extensive tuning [62] and a learning rate between0.001 to0.1 is most often appropriate. In our experiments, we found that a learning rate valueη ∈[0.02,0.07] produces similar results, so we set it to0.06 for all our runs. Parameter Settings. We select the training hyper-parameters of the skip-gram model via a cross validation grid search. Figure 5.5 depicts the validation accuracy over 200 data epochs using the non-private learning approach. We plot the validation Hit-Rate for k = 5, 10 and 20 candidates, respectively. We look for those models that reach the highest accuracy. The embed- ding dimensiondim is set to50. While a larger number of hidden units allows more predictive power, the accuracy improvement reaches a plateau when the embedding dimension is in the range [50,150]. In non-private training, it is preferable to use more units, whereas for private learning a larger model increases the sensitivity of the gradient. We keep our model at the lower end of the dim range to keep the number of internal parameters of the models low. The batch size is set to b = 32, and the context window parameter win = 2 (for a total window size of 5). These parameters are also consistent with those utilized in previous work [127, 21]. Varying 134 the number of negative examples sampled (denoted by neg) marginally affects the non-private model, whereas with private learning we find that it directly controls the sensitivity of the pri- vate sum query (in Section 5.4.3 we show experiments on how to tune it). The default value for negative samples isneg =16. 25 50 75 100 125 Embedding dim 0.15 0.20 0.25 0.30 0.35 0.40 1 2 3 4 5 Skip Window win 16 32 64 128 256 Batch Size b 0.15 0.20 0.25 0.30 0.35 0.40 4 8 16 32 64 Negative Samples neg 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 vali HR@5 vali HR@10 vali HR@20 Validation Accuracy Figure 5.5: Non-private model hyperparameter tuning For the privacy parameters, we fix the value of δ = 2× 10 − 4 < 1/N as recommended in previous work on differentially-private machine learning [82, 2]. For a given value of δ , the privacy budgetε affects the amount of steps we can train until we exceed that budget threshold. We set the default value of the hyper-parameters toq = 0.06, σ = 2.5, C = 0.5, λ = 4 (please see Table 5.1 for a summary of notations). Recall that, the sampling ratio of each lot isq =m/N, so each epoch consists of1/q steps. 135 0 50 100 150 200 250 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Training Loss 0 50 100 150 200 250 Data Epochs 0.1 0.2 0.3 0.4 Prediction Accuracy Loss test HR@5 test HR@10 test HR@20 vali HR@5 vali HR@10 vali HR@20 Figure 5.6: Non-private model performance 5.4.2 ComparisonwithBaseline We evaluate the performance of our proposed approach in comparison with two baselines: (i) a non-private learning approach using SGD, and (ii) the state-of-the-art user-level DP-SGD ap- proach from [2, 82]. First, we evaluate the non-private location prediction model described in Section 5.2.2. Fig- ure 5.6 illustrates the validation and testing Hit-Rate atk =5,10 and20. The model generalizes well to the test set, and there appears to be no evidence of overfitting up to 250 data epochs. The presented results are competitive with existing approaches in [73, 127], suggesting that the model hyper-parameters are suitable to capture the underlying semantics of mobility patterns. The best testing accuracy of the non-private model for theHR@10 setting is 29.5%. Throughout our evaluation we found that, when the model is trained in a differentially private manner, there is only a small difference between the model’s accuracy on the training and the test sets. This is consistent with both the theoretical argument that differentially private training generalizes well [34, 12], and the empirical evidence in previous studies [2, 82]. For brevity of presentation, in the rest of this section we only showHR@10 evaluation results (similar trends were recorded forHR@5 andHR@20). 136 0.5 1 2 3 4 Privacy parameter ε 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 Prediction Accuracy PLP, λ =6, q = 0.06 PLP, λ =4, q = 0.06 DP-SGD, q = 0.06 PLP, λ =6, q = 0.10 PLP, λ =4, q = 0.10 DP-SGD, q = 0.10 Figure 5.7: PLP vs DP-SGD: varying privacy budgetε 0.04 0.06 0.08 0.1 0.12 Sampling probability q 0.12 0.14 0.16 0.18 0.20 Prediction Accuracy PLP,λ =6 PLP,λ =4 DP-SGD Figure 5.8: PLP vs DP-SGD: varying sampling ratioq 2 3 4 5 6 Grouping Factor λ 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Running Time improvement factor q=0.06, σ = 2.5 q=0.06, σ = 1.5 q=0.10, σ = 2.5 q=0.10, σ = 1.5 Figure 5.9: Running time, vary- ing grouping factorλ Next, we evaluate our proposed Private Location Prediction (PLP) approach in comparison with DP-SGD [2], which is summarized in Section 5.1. We adapt the model to work on user- partitioned data, so that it guarantees user-level privacy. The improvements of PLP over DP-SGD passed the pairedt-test with significance value p<0.01. Figure 5.7 plots the prediction accuracy of the privately trained models for varying levels of privacy ε. For each ε value, we consider two settings each for sampling probability q = 0.06 (upper left) andq = 0.10 (bottom right). We setσ = 1.5. We compare PLP against the baseline DP-SGD for two values of the grouping factorλ . As expected, a general trend we observed is that providing more privacy budget allows the models to train to a higher accuracy. However, for the baseline approach, the convergence of the model is thwarted because the model update computed on the data of a single user contributes a limited signal, which is often offset by the introduced Gaussian noise. On the other hand, the results show that by incorporating data grouping in its design, PLP is able to ameliorate the data sparsity problem inherent to location datasets. The gain is more pronounced when the grouping factor increases (i.e., higherλ ). Next, we measure the effect of sampling probability q on accuracy. From the theoretical model [2], we know that q directly affects the amount of privacy budget utilized in each iter- ation (q is also called “privacy amplification factor”). A lower sampling rate includes less data in 137 each iteration, hence the amount of budget consumed in each step is decreased. Our results in Figure 5.8 confirm this trend. We vary the rate of user sampling q from 4% to 12%. For all runs, we fixed the budget allowance at ε = 2. For a higher sampling probability, the privacy budget is consumed faster, hence the count of total training steps is smaller, leading to lower accuracy. Our proposed PLP method clearly outperforms DP-SGD, whose accuracy drops sharply with an increase inq. We note that, due to the proposed grouping strategy, PLP is more robust to changes in sampling rate, as its accuracy degrades gracefully. In general, a larger bucket cardinality leads to better accuracy, except for the lowest considered sampling rate, where the small fraction of records included in the computation at each step prevents buckets from reaching a significant diversity in their composition. Finally, we provide a result on the runtime improvements offered by PLP. The y-axis in Fig- ure 5.9 depicts the multiplicative factor by which PLP is faster that DP-SGD. We show results for two values of q, and for each we present the runtime with two values of noise scale. Linearly scaling the grouping factor has two opposing effects: on the one hand, fewer buckets implies that equally few bucket gradients need to be computed and averaged. On the other hand, as each bucket gets assigned more users, it takes longer to compute each bucket gradient. When fewer users are sampled (i.e., q = 0.06) the latter effect begins to dominate for λ > 5, whereas forλ ∈ [2,5], the computational efficiency scales from 1.6× to2.5× . In the setting where sam- pling rate is higher, atq =0.10, the runtime improvements scale monotonically, to over 4.8× for λ = 6. These results are consistently observed even with a different number of total iterations (as a largerσ allows more iterations). In summary, our results so far show that PLP clearly outperforms the existing state-of-the-art DP-SGD. Furthermore, its accuracy was observed to reach values as high as24%, which is quite 138 reasonable compared to the maximum of29.5% reached by the non-private learning approach. In the rest of the evaluation, we no longer consider DP-SGD, and we focus on tuning the parameters of the proposed PLP technique. 5.4.3 Hyper-parameterTuning The objective of tuning model hyper-parameters is to obtain a good balance of accuracy and computational overhead of learning. We focus on the following parameters, which we observed throughout the experiments to have a significant influence: grouping factor λ , noise scaleσ , the magnitude ofℓ 2 clipping norm, and the number of negative samplesneg. 1 2 3 4 5 6 Grouping Factor λ 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Prediction Accuracy q=0.06, σ = 2 q=0.06, σ = 3 q=0.10, σ = 2 q=0.10, σ = 3 Figure 5.10: Effect of varying λ Effect of Grouping factor λ . Figure 5.10 shows the influence on accuracy of grouping factor λ . We consider two distinct settings each of sampling parameterq and noise scaleσ (for a total of four lines in the graph). To limit sensitivity, we clip the gradient norm of each tensor to a maximuml 2 norm of0.5. Choosing the grouping factor must balance two conflicting objectives: on the one hand, assigning the data of multiple users to the same bucket improves the signal in each bucket gradient, by improving the data diversity within the bucket. On the other hand, the Gaussian noise must be scaled to the sensitivity of a bucket gradient, and a larger bucket size results to fewer buckets, which in turn increases the effect of added noise. Our results confirm 139 this trade-off: initially, when λ increases there is a pronounced increase in accuracy. After a certain point, the accuracy levels off, and reaches a plateau around the value of λ = 5. When the grouping factor is increased further (not shown in the graph), the accuracy starts decreasing, because there is no significant gain in per-bucket diversity, whereas the relative noise-to-signal ratio keeps increasing. EffectofNoiseScale σ . The noise scale parameterσ directly controls the noise added in each step. A larger σ leads to more noise, but at the same time it decreases the budget consumption per step, which in turn allows the execution of more learning steps. Figure 5.11 depicts the model accuracy for varying settings of noise scale. The results presented correspond to two settings each of sampling rate and privacy budget (for a total of four lines). We observe that for the lower-range ofσ values, the accuracy is rather poor, especially for smaller settings of privacy budgetε. This is explained by the fact that too little noise is added per step, and the privacy consumption per step is high. As a result, only a small number of steps can be executed before the privacy budget is exhausted, leading to insufficient learning. For larger ε settings, the effect is less pronounced, because there is sufficient budget to allow more steps, even when the noise scale is low. Conversely, a largerσ allows more steps to be executed, so the best accuracy is obtained for the largest σ = 3.0 setting. However, we also note that the accuracy levels off towards that setting. For larger σ values (not showed in the graph), we observed that the noise magnitude is too high, and even if budget is slowly exhausted, the training loss in each learning step is excessively high, preventing the model from converging, and leading to very low accuracy. We conclude that the choice of noise scale must be carefully considered relative to the total privacy budget, such that a sufficient number of steps are allowed to execute, while at the same time the loss function value per step is not excessive. 140 The total number of executed steps also influences the computational overhead of learning. If execution time is a concern, one may want to reduce the number of steps by reducingσ , in an attempt to accelerate the learning (intuitively, since less noise is added at each step, the model will converge faster). This approach is still subject to ensuring that a sufficient number of steps are executed, as neural networks need to perform several complete passes over the dataset. 1.0 1.5 2.0 2.5 3.0 Noise Scale σ 0.12 0.14 0.16 0.18 0.20 0.22 0.24 Prediction Accuracy q=0.06, ε = 2 q=0.10, ε = 2 q=0.06, ε = 4 q=0.10, ε = 4 Figure 5.11: Effect of varying σ EffectofClippingnorm C. We vary the clipping bound of every tensor in the modelθ 0 = {W,W ′ ,B ′ }. The value C represents the magnitude of per-tensor clipping, which is set to be equal for every tensor in the model. Clipping more aggressively decreases sensitivity, which in turn leads to a lower privacy budget consumption per step, and allows additional learning iterations to be executed. Conversely, setting the threshold too low also limits the amount of learning that the model can achieve per step. Figure 5.12 plots the obtained results for several combinations of sampling probability and grouping factor. We observe that the for the range of values considered, the decrease in sensitivity has a more pronounced impact, and as a result the smaller clipping bounds lead to better accuracy. Of course, one cannot set the clipping bound arbitrarily low, as that will significantly curtail learning. An- other factor to consider is the nature of the data, and the effect on gradient values. If the norm of the resulting tensors following gradient computation is high, then a low clipping threshold 141 Figure 5.12: Effect of varying ℓ 2 clipping norm 4 8 16 32 64 Negative Samples neg 0.14 0.16 0.18 0.20 0.22 Prediction Accuracy q = 0.06, C = 0.5 q = 0.06, C = 0.3 q = 0.10, C = 0.5 q = 0.10, C = 0.3 Figure 5.13: Effect of varying neg will destroy the information and prevent learning. In our case, we were able to keep the gradient norm low by using negative sampling, which in turn allowed us to obtain good accuracy for that setting. In cases where this is not possible, it is recommended to increase the clipping threshold value. Effect of Negative samples neg. In our final experiment, we investigate the effect on ac- curacy of negative sampling, which is an important factor in the training success of a skip-gram model. We plot the model accuracy for various values of negative sampling in Figure 5.13. The number of negative samplesneg controls the total fraction of weights that are updated for each training sample, and as a side effect it helps keeping the gradient norm low. We can observe a clear ‘U’-shaped dependency, reaching a maximum atneg =16. The observed trend is the result of two conflicting factors: if the number of negative samples is too low, training is slowed down, due to the fact that only a small part of the layers are updated per step. Conversely, if too many samples are drawn, then the correspondingly many parameters that need to be updated lead to a large norm. Gradient clipping has an aggressive effect, and as a result, the amount of information that can be learned in each update is obliterated by the noise. 142 Chapter6 Conclusions In this work, we show that learned representations of data can out-perform conventional DP mechanisms. We learn powerful machine learning (ML) models that exploit patterns within lo- cation datasets to provide more accurate location services. For population-density maps, we proposed SNH: a novel method for answering range count queries on location datasets while preserving differential privacy. To address the shortcomings of existing methods (i.e., over-reliance on the uniformity assumption and noisy local informa- tion when answering queries), SNH utilizes the power of neural networks to learn patterns from location datasets. We proposed a two stage learning process: first, noisy training data is col- lected from the database while preserving differential privacy; second, models are trained using this sanitized dataset, after a data augmentation step. In addition, we devised effective machine learning strategies for tuning system parameters using only public data. Our results show SNH outperforms the state-of-the-art on a broad set of input data with diverse characteristics. In future work, we plan to extend SNH to releasing high data dimensional user trajectories datasets. When releasing multiple snapshots, we proposed VDR, a technique for accurate DP-compliant release of spatio-temporal histograms that uses a combination of sampling to reduce sensitivity, 143 VAE-based learning to counter the effect of DP-added noise, and statistical estimators to offset the effect of sampling. The resulting approach captures well spatio-temporal data patterns, and significantly outperforms existing approaches. In future work, we plan to extend our work by creating DP-compliant synthetic datasets based on spatio-temporal histograms. This is more challenging, since it needs to take into account any type of downstream processing that may be performed. One direction to achieve this goal is to sample from the compressed latent space conditioned on the time-of-day, and train a conditional image generation model such as Pixel- CNN [113] over the latent pixel values. Lastly, we proposed PLP, a new approach for differentially-private next-location prediction using the skip-gram model. To the best of our knowledge, ours is the first technique that deploys DP-SGD for skip-grams. We made use of negative sampling to reduce the norms of gradient up- dates when dealing with high-dimensional internal neural network layers, and provided a data grouping technique that can improve the signal-to-noise ratio and allows for effective private learning. Our extensive experiments show that the proposed technique outperforms the state-of- the-art, and they also provide insights into how to tune system parameters. Although our results focus on location data, we believe that our findings can be extended to other types of sparse data. In future work, we plan to test the viability of our approach for other learning tasks. Finally, we will study more sophisticated data grouping approaches that make informed decisions on which users to place together in the same bucket. Since such decisions are data dependent, a careful trade-off must be considered between the budget consumed performing the grouping and the remaining budget used for learning, such that prediction accuracy is maximized. 144 Bibliography [1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “Tensorflow: A system for large-scale machine learning”. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 2016, pp. 265–283. [2] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep learning with differential privacy”. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM. 2016, pp. 308–318. [3] John M. Abowd. “The U.S. Census Bureau Adopts Differential Privacy”. In: Proceedingsof the 24th ACM SIGKDD International Conference on Knowledge Discovery, Data Mining. KDD ’18. London, United Kingdom, 2018, p. 2867. [4] Gergely Acs and Claude Castelluccia. “A case study: Privacy preserving release of spatio-temporal density in paris”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 1679–1688. [5] Gergely Acs, Claude Castelluccia, and Rui Chen. “Differentially private histogram publishing through lossy compression”. In: 2012 IEEE 12th International Conference on Data Mining. IEEE. 2012, pp. 1–10. [6] Ahmet Aktay, Shailesh Bavadekar, Gwen Cossoul, John Davis, Damien Desfontaines, Alex Fabrikant, Evgeniy Gabrilovich, Krishna Gadepalli, Bryant Gipson, Miguel Guevara, et al. “Google COVID-19 community mobility reports: anonymization process description (version 1.1)”. In: arXiv preprint arXiv:2004.04145 (2020). [7] Miguel E Andrés, Nicolás E Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. “Geo-indistinguishability: Differential privacy for location-based systems”. In: ACM CCS. 2013. [8] Anonymous. VDR Implementation. https://anonymous.4open.science/r/paper_code-C76E. 2022. 145 [9] Anonymous. VDR Technical Report. https://anonymous.4open.science/r/paper_code-C76E/vdr_technical_report.pdf. 2022. [10] Vassilis Assimakopoulos and Konstantinos Nikolopoulos. “The theta model: a decomposition approach to forecasting”. In: International journal of forecasting 16.4 (2000), pp. 521–530. [11] Borja Balle, Gilles Barthe, and Marco Gaboardi. “Privacy amplification by subsampling: Tight analyses via couplings and divergences”. In: Advances in Neural Information Processing Systems. 2018, pp. 6277–6287. [12] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. “Algorithmic stability for adaptive data analysis”. In: Proceedings of ACM Symposium on Theory of Computing. 2016, pp. 1046–1059. [13] Raef Bassily, Adam Smith, and Abhradeep Thakurta. “Private empirical risk minimization: Efficient algorithms and tight error bounds”. In: IEEE Symposium on Foundations of Computer Science. 2014, pp. 464–473. [14] Aleix Bassolas, Hugo Barbosa-Filho, Brian Dickinson, Xerxes Dotiwalla, Paul Eastham, Riccardo Gallotti, Gourab Ghoshal, Bryant Gipson, Surendra A Hazarie, Henry Kautz, et al. “Hierarchical organization of urban mobility and its connection with city livability”. In: Nature communications 10.1 (2019), pp. 1–10. [15] Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMPSTAT’2010. 2010, pp. 177–186. [16] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Version 0.2.5. 2018.url: http://github.com/google/jax. [17] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. “STHoles: A Multidimensional Workload-Aware Histogram”. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery, 2001, pp. 211–222. [18] Mark Bun and Thomas Steinke. “Concentrated differential privacy: Simplifications, extensions, and lower bounds”. In: Theory of Cryptography Conference. Springer. 2016, pp. 635–658. [19] Yang Cao, Masatoshi Yoshikawa, Yonghui Xiao, and Li Xiong. “Quantifying differential privacy under temporal correlations”. In:2017IEEE33rdInternationalConferenceonData Engineering (ICDE). IEEE. 2017, pp. 821–832. 146 [20] Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. “The secret sharer: Measuring unintended neural network memorization & extracting secrets”. In: arXiv preprint arXiv:1802.08232 (2018). [21] Buru Chang, Yonggyu Park, Donghyeon Park, Seongsoon Kim, and Jaewoo Kang. “Content-Aware Hierarchical Point-of-Interest Embedding Model for Successive POI Recommendation.” In: IJCAI. 2018, pp. 3301–3307. [22] Serina Chang, Emma Pierson, Pang Wei Koh, Jaline Gerardin, Beth Redbird, David Grusky, and Jure Leskovec. “Mobility network models of COVID-19 explain inequities and inform reopening”. In: Nature 589.7840 (2021), pp. 82–87. [23] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. “Differentially private empirical risk minimization.” In: Journal of Machine Learning Research 12.3 (2011). [24] Kamalika Chaudhuri and Staal A Vinterbo. “A stability-based validation procedure for differentially private machine learning”. In: Advances in Neural Information Processing Systems 26 (2013), pp. 2652–2660. [25] Rui Chen, Gergely Ács, and Claude Castelluccia. “Differentially private sequential data publication via variable-length n-grams”. In: the ACM Conference on Computer and Communications Security, CCS’12, Raleigh, NC, USA, October 16-18, 2012. Ed. by Ting Yu, George Danezis, and Virgil D. Gligor. ACM, 2012, pp. 638–649.doi: 10.1145/2382196.2382263. [26] Rui Chen, Benjamin C. M. Fung, Bipin C. Desai, and Nériah M. Sossou. “Differentially private transit data publication: a case study on the montreal transportation system”. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD’12,Beijing,China,August12-16,2012. Ed. by Qiang Yang, Deepak Agarwal, and Jian Pei. ACM, 2012, pp. 213–221.doi: 10.1145/2339530.2339564. [27] Eunjoon Cho, Seth A Myers, and Jure Leskovec. “Friendship and mobility: user movement in location-based social networks”. In: Proc. of ACM SIGKDD Conf. on Knowledge discovery and data mining. 2011, pp. 1082–1090. [28] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. “Natural language processing (almost) from scratch”. In: Journal of machine learning research 12.Aug (2011), pp. 2493–2537. [29] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. “Differentially private spatial decompositions”. In: 2012 IEEE 28th International Conference on Data Engineering. IEEE. 2012, pp. 20–31. [30] Yves-Alexandre De Montjoye, César A Hidalgo, Michel Verleysen, and Vincent D Blondel. “Unique in the crowd: The privacy bounds of human mobility”. In: Scientific reports 3 (2013), p. 1376. 147 [31] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. “Differentially private data cubes: optimizing noise sources and consistency”. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, pp. 217–228. [32] Cynthia Dwork. “Differential Privacy: A Survey of Results”. In: Theory and Applications of Models of Computation. 2008, pp. 1–19.isbn: 978-3-540-79228-4. [33] Cynthia Dwork. “Differential privacy: A survey of results”. In: International conference on theory and applications of models of computation. Springer. 2008, pp. 1–19. [34] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toni Pitassi, Omer Reingold, and Aaron Roth. “Generalization in adaptive data analysis and holdout reuse”. In: Advances in Neural Information Processing Systems. 2015, pp. 2350–2358. [35] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity in private data analysis”. In: Theory of cryptography conference. Springer. 2006, pp. 265–284. [36] Cynthia Dwork, Aaron Roth, et al. “The algorithmic foundations of differential privacy”. In: Foundations and Trends® in Theoretical Computer Science 9.3–4 (2014), pp. 211–407. [37] Cynthia Dwork and Guy N Rothblum. “Concentrated differential privacy”. In: arXiv preprint arXiv:1603.01887 (2016). [38] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. “Rappor: Randomized aggregatable privacy-preserving ordinal response”. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 2014, pp. 1054–1067. [39] Facebook Data for Good: High Resolution Density Maps. 2022.url: https://dataforgood.facebook.com/dfg/tools/high-resolution-population-density-maps (visited on 04/11/2022). [40] Maryam Fanaeepour and Benjamin I. P. Rubinstein. “Histogramming Privately Ever After: Differentially-Private Data-Dependent Error Bound Optimisation”. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). 2018, pp. 1204–1207.doi: 10.1109/ICDE.2018.00111. [41] Shanshan Feng, Gao Cong, Bo An, and Yeow Meng Chee. “Poi2vec: Geographical latent representation for predicting future visitors”. In: Thirty-First AAAI Conference on Artificial Intelligence . 2017. [42] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. “Auto-sklearn 2.0: The next generation”. In: arXiv preprint arXiv:2007.04074 24 (2020). 148 [43] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures”. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM. 2015, pp. 1322–1333. [44] Pierre Geurts, Damien Ernst, and Louis Wehenkel. “Extremely randomized trees”. In: Machine learning 63.1 (2006), pp. 3–42. [45] Robin C Geyer, Tassilo Klein, and Moin Nabi. “Differentially private federated learning: A client level perspective”. In: arXiv preprint arXiv:1712.07557 (2017). [46] Marco Gruteser and Dirk Grunwald. “Anonymous usage of location-based services through spatial and temporal cloaking”. In: Proceedings of the 1st international conference on Mobile systems, applications and services. ACM. 2003, pp. 31–42. [47] Michael U Gutmann and Aapo Hyvärinen. “Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics”. In: Journal of Machine Learning Research 13.Feb (2012), pp. 307–361. [48] Roan Gylberth, Risman Adnan, Setiadi Yazid, and T Basaruddin. “Differentially private optimization algorithms for deep neural networks”. In: 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE. 2017, pp. 387–394. [49] Moritz Hardt, Katrina Ligett, and Frank McSherry. “A simple and practical algorithm for differentially private data release”. In: Advances in neural information processing systems 25 (2012). [50] Michael Hay. Ektelo. https://github.com/ektelo/ektelo. 2022. [51] Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. “Principled evaluation of differentially private algorithms using dpbench”. In: Proceedings of the 2016 International Conference on Management of Data. 2016, pp. 139–154. [52] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. “Boosting the accuracy of differentially-private histograms through consistency”. In: arXiv preprint arXiv:0904.0942 (2009). [53] Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. “LOGAN: Membership inference attacks against generative models”. In: Proceedings on Privacy Enhancing Technologies 2019.1 (2019), pp. 133–152. [54] Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. “NAIS: Neural attentive item similarity model for recommendation”. In: IEEE Transactions on Knowledge and Data Engineering 30.12 (2018), pp. 2354–2366. 149 [55] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. “DeepDB: Learn from Data, not from Queries!” In: Proceedings of the VLDB Endowment 13.7 (2019). [56] Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-Cruz. “Deep models under the GAN: information leakage from collaborative deep learning”. In: Proc. of ACM Conf. on Computer and Communications Security. 2017, pp. 603–618. [57] Tin Kam Ho. “Random decision forests”. In:Proceedingsof3rdinternationalconferenceon document analysis and recognition. Vol. 1. IEEE. 1995, pp. 278–282. [58] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural computation 9.8 (1997), pp. 1735–1780. [59] Florimond Houssiau, Luc Rocher, and Yves-Alexandre de Montjoye. “On the difficulty of achieving Differential Privacy in practice: user-level guarantees in aggregate location data”. In: Nature communications 13.1 (2022), pp. 1–3. [60] Daniel Im Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. “Denoising criterion for variational auto-encoding framework”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 31. 1. 2017. [61] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. “Private convex empirical risk minimization and high-dimensional regression”. In: Conference on Learning Theory. JMLR Workshop and Conference Proceedings. 2012, pp. 25–1. [62] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [63] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint arXiv:1312.6114 (2013). [64] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105. [65] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. “Noise2void-learning denoising from single noisy images”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 2129–2137. [66] Jaewoo Lee and Daniel Kifer. “Concentrated differentially private gradient descent with adaptive per-iteration privacy budget”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM. 2018, pp. 1656–1665. 150 [67] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. “Noise2Noise: Learning image restoration without clean data”. In: arXiv preprint arXiv:1803.04189 (2018). [68] Omer Levy, Yoav Goldberg, and Ido Dagan. “Improving distributional similarity with lessons learned from word embeddings”. In: Transactions of the Association for Computational Linguistics 3 (2015), pp. 211–225. [69] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. “A Data- and Workload-Aware Algorithm for Range Queries under Differential Privacy”. In: Proc. VLDB Endow. 7.5 (Jan. 2014), pp. 341–352.issn: 2150-8097. [70] Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, and Yong Rui. “GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation”. In: Proc. of ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 831–840. [71] List of United States cities by population. https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population. Accessed July 2021. 2021. [72] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. “Predicting the next location: A recurrent model with spatial and temporal contexts”. In: Thirtieth AAAI Conference on Artificial Intelligence . 2016. [73] Xin Liu, Yong Liu, and Xiaoli Li. “Exploring the Context of Locations for Personalized Location Recommendations.” In: IJCAI. 2016, pp. 1188–1194. [74] Ziqi Liu, Yu-Xiang Wang, and Alexander Smola. “Fast differentially private matrix factorization”. In: Proceedings of the 9th ACM Conference on Recommender Systems. ACM. 2015, pp. 171–178. [75] Min Lyu, Dong Su, and Ninghui Li. “Understanding the Sparse Vector Technique for Differential Privacy”. In: Proc. VLDB Endow. 10.6 (Feb. 2017), pp. 637–648.issn: 2150-8097. [76] Qingzhi Ma and Peter Triantafillou. “Dbest: Revisiting approximate query processing engines with machine learning models”. In: Proceedings of the 2019 International Conference on Management of Data. 2019, pp. 1553–1570. [77] Spyros Makridakis and Michele Hibon. “The M3-Competition: results, conclusions and implications”. In: International journal of forecasting 16.4 (2000), pp. 451–476. [78] Ryan McKenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. “Optimizing error of high-dimensional statistical queries under differential privacy”. In: arXiv preprint arXiv:1808.03537 (2018). 151 [79] Ryan McKenna, Daniel Sheldon, and Gerome Miklau. “Graphical-model based estimation and inference for differential privacy”. In: International Conference on Machine Learning. PMLR. 2019, pp. 4435–4444. [80] H Brendan McMahan and Galen Andrew. “A General Approach to Adding Differential Privacy to Iterative Training Procedures”. In: arXiv preprint arXiv:1812.06210 (2018). [81] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. “Communication-efficient learning of deep networks from decentralized data”. In: arXiv preprint arXiv:1602.05629 (2016). [82] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. “Learning differentially private language models without losing accuracy”. In: ICLR (2018). [83] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. “Exploiting unintended feature leakage in collaborative learning”. In: IEEE Symposium on Security and Privacy. 2019. [84] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient estimation of word representations in vector space”. In: arXiv preprint arXiv:1301.3781 (2013). [85] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed representations of words and phrases and their compositionality”. In: Advances in neural information processing systems. 2013, pp. 3111–3119. [86] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černock ` y, and Sanjeev Khudanpur. “Recurrent neural network based language model”. In: Eleventh annual conference of the international speech communication association. 2010. [87] Ilya Mironov. “Rényi differential privacy”. In: 2017 IEEE 30th Computer Security Foundations Symposium (CSF). IEEE. 2017, pp. 263–275. [88] Henry B Moss, David S Leslie, and Paul Rayson. “Using JK fold cross validation to reduce variance when tuning NLP models”. In: arXiv preprint arXiv:1806.07139 (2018). [89] Lev Muchnik, Sen Pei, Lucas C Parra, Saulo DS Reis, José S Andrade Jr, Shlomo Havlin, and Hernán A Makse. “Origins of power-law degree distribution in the heterogeneity of human activity in social networks”. In: Scientific reports 3.1 (2013), pp. 1–8. [90] A. Narayanan and V. Shmatikov. “Robust De-anonymization of Large Sparse Datasets”. In: 2008 IEEE Symposium on Security and Privacy (sp 2008). 2008. [91] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “Smooth sensitivity and sampling in private data analysis”. In: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. 2007, pp. 75–84. 152 [92] Tongyao Pang, Huan Zheng, Yuhui Quan, and Hui Ji. “Recorrupted-to-Recorrupted: Unsupervised Deep Learning for Image Denoising”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, pp. 2043–2052. [93] Nicolas Papernot, Steve Chien, Shuang Song, Abhradeep Thakurta, and Ulfar Erlingsson. “Making the shoe fit: Architectures, initializations, and tuning for learning with privacy”. In: (2019). [94] Rafael Pérez-Torres, César Torres-Huitzil, and Hiram Galeana-Zapién. “Full on-device stay points detection in smartphones for location-based mobile applications”. In: Sensors 16.10 (2016), p. 1693. [95] Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. CRAWDAD data set epfl/mobility (v. 2009-02-24) . 2009. [96] Wahbeh Qardaji, Weining Yang, and Ninghui Li. “Differentially private grids for geospatial data”. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE. 2013, pp. 757–768. [97] Wahbeh Qardaji, Weining Yang, and Ninghui Li. “Understanding hierarchical methods for differentially private histograms”. In: Proceedings of the VLDB Endowment 6.14 (2013), pp. 1954–1965. [98] Yuhui Quan, Mingqin Chen, Tongyao Pang, and Hui Ji. “Self2self with dropout: Learning self-supervised denoising from single image”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 1890–1898. [99] Daniele Quercia, Ilias Leontiadis, Liam McNamara, Cecilia Mascolo, and Jon Crowcroft. “Spotme if you can: Randomized responses for location obfuscation on mobile phones”. In: 2011 31st International Conference on Distributed Computing Systems. IEEE. 2011, pp. 363–372. [100] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. “Generating diverse high-fidelity images with vq-vae-2”. In: Advances in neural information processing systems 32 (2019). [101] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. “Factorizing personalized markov chains for next-basket recommendation”. In: Proc. of Intl. Conf. on World Wide Web. 2010, pp. 811–820. [102] Douglas A Reynolds. “Gaussian mixture models.” In: Encyclopedia of biometrics 741 (2009), pp. 659–663. [103] Daniele Riboni and Claudio Bettini. “Differentially-private release of check-in data for venue recommendation”. In: 2014 IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE. 2014, pp. 190–198. 153 [104] Xin Rong. “word2vec parameter learning explained”. In: arXiv preprint arXiv:1411.2738 (2014). [105] SafegraphWeeklyPatterns. https://docs.safegraph.com/docs/weekly-patterns. Accessed: 2022-04-11. [106] Magnus Sahlgren. “The distributional hypothesis”. In: Italian Journal of Disability Studies 20 (2008), pp. 33–53. [107] Adam Sealfon and Jonathan Ullman. “Efficiently Estimating Erdos-Renyi Graphs with Node Differential Privacy”. In: Journal of Privacy and Confidentiality 11.1 (2021). [108] Hyejin Shin, Sungwook Kim, Junbum Shin, and Xiaokui Xiao. “Privacy enhanced matrix factorization for recommendation with local differential privacy”. In: IEEE Transactions on Knowledge and Data Engineering 30.9 (2018), pp. 1770–1782. [109] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference attacks against machine learning models”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE. 2017, pp. 3–18. [110] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. “Stochastic gradient descent with differentially private updates”. In: 2013 IEEE Global Conference on Signal and Information Processing. IEEE. 2013, pp. 245–248. [111] Sasha Targ, Diogo Almeida, and Kevin Lyman. “Resnet in resnet: Generalizing residual architectures”. In: arXiv preprint arXiv:1603.08029 (2016). [112] Hien To, Gabriel Ghinita, Liyue Fan, and Cyrus Shahabi. “Differentially Private Location Protection for Worker Datasets in Spatial Crowdsourcing”. In: IEEE Trans. Mob. Comput. 16.4 (2017), pp. 934–949.doi: 10.1109/TMC.2016.2586058. [113] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. “Conditional image generation with pixelcnn decoders”. In: Advances in neural information processing systems 29 (2016). [114] Aaron Van Den Oord, Oriol Vinyals, et al. “Neural discrete representation learning”. In: Advances in neural information processing systems 30 (2017). [115] Veraset. https://www.veraset.com/about-veraset. Accessed: 2021-05-10. 2021. [116] Veraset Movement Data for the OCONUS. https://datarade.ai/data-products/veraset-movement-data-for-the-oconus-the- largest-deepest-and-broadest-available-movement-dataset-veraset. Accessed: 2021-07-20. 2021. 154 [117] Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. “Subsampled Renyi Differential Privacy and Analytical Moments Accountant”. In: arXiv preprint arXiv:1808.00087 (2018). [118] Benjamin J Wilson and Adriaan MJ Schakel. “Controlled experiments for word embeddings”. In: arXiv preprint arXiv:1510.02675 (2015). [119] Hanwei Wu and Markus Flierl. “Vector quantization-based regularization for autoencoders”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34. 04. 2020, pp. 6380–6387. [120] Xi Wu, Matthew Fredrikson, Somesh Jha, and Jeffrey F Naughton. “A methodology for formalizing model-inversion attacks”. In: 2016 IEEE 29th Computer Security Foundations Symposium (CSF). IEEE. 2016, pp. 355–370. [121] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. “Bolt-on differential privacy for scalable stochastic gradient descent-based analytics”. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, pp. 1307–1322. [122] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. “Differential privacy via wavelet transforms”. In: IEEE Transactions on knowledge and data engineering 23.8 (2010), pp. 1200–1214. [123] Yonghui Xiao, Li Xiong, Liyue Fan, and Slawomir Goryczka. “Dpcube: differentially private histogram release through multidimensional partitioning”. In: arXiv preprint arXiv:1202.5358 (2012). [124] Xin Xin, Xiangnan He, Yongfeng Zhang, Yongdong Zhang, and Joemon Jose. “Relational Collaborative Filtering: Modeling Multiple Item Relations for Recommendation”. In: arXiv preprint arXiv:1904.12796 (2019). [125] Ming Xu, Matias Quiroz, Robert Kohn, and Scott A Sisson. “Variance reduction properties of the reparameterization trick”. In: The 22nd International Conference on Artificial Intelligence and Statistics . PMLR. 2019, pp. 2711–2720. [126] Anatoly Yakovlev, Hesam Fathi Moghadam, Ali Moharrer, Jingxiao Cai, Nikan Chavoshi, Venkatanathan Varadarajan, Sandeep R Agrawal, Sam Idicula, Tomas Karnagel, Sanjay Jinturkar, et al. “Oracle automl: a fast and predictive automl pipeline”. In: Proceedings of the VLDB Endowment 13.12 (2020), pp. 3166–3180. [127] Carl Yang, Lanxiao Bai, Chao Zhang, Quan Yuan, and Jiawei Han. “Bridging collaborative filtering and semi-supervised learning: A neural approach for poi recommendation”. In: Proc. of ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining. 2017, pp. 1245–1254. 155 [128] Dingqi Yang, Bingqing Qu, Jie Yang, and Philippe Cudre-Mauroux. “Revisiting User Mobility and Social Relationships in LBSNs: A Hypergraph Embedding Approach”. In: Proc. of Intl. Conf. on World Wide Web. 2019. [129] Dingqi Yang, Daqing Zhang, Longbiao Chen, and Bingqing Qu. “Nationtelescope: Monitoring and visualizing large-scale collective behavior in lbsns”. In: Journal of Network and Computer Applications 55 (2015), pp. 170–180. [130] Yang Ye, Yu Zheng, Yukun Chen, Jianhua Feng, and Xing Xie. “Mining individual life pattern based on location history”. In: 2009 tenth international conference on mobile data management: Systems, services and middleware. IEEE. 2009, pp. 1–10. [131] Quan Yuan, Gao Cong, Zongyang Ma, Aixin Sun, and Nadia Magnenat Thalmann. “Time-aware point-of-interest recommendation”. In: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. ACM. 2013, pp. 363–372. [132] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. Private STHoles Implementation. https://github.com/szeighami/stholes. 2021. [133] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. SNH Implementation. https://github.com/szeighami/snh. 2021. [134] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. SNH Technical Report. https://infolab.usc.edu/DocsDemos/snh.pdf. 2021. [135] Sepanta Zeighami and Cyrus Shahabi. “NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond”. In: arXiv preprint arXiv:2107.04922 (2021). [136] Chao Zhang, Keyang Zhang, Quan Yuan, Luming Zhang, Tim Hanratty, and Jiawei Han. “Gmove: Group-level mobility modeling using geo-tagged social media”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. 2016, pp. 1305–1314. [137] Dan Zhang, Ryan McKenna, Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. “Ektelo: A framework for defining differentially-private computations”. In: Proceedings of the 2018 International Conference on Management of Data. 2018, pp. 115–130. [138] Jia Dong Zhang, Gabriel Ghinita, and Chi Yin Chow. “Differentially private location recommendations in geosocial networks”. In: 2014 IEEE 15th International Conference on Mobile Data Management. Vol. 1. IEEE. 2014, pp. 59–68. 156 [139] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. “Privbayes: Private data release via bayesian networks”. In: ACM Transactions on Database Systems (TODS) 42.4 (2017), pp. 1–41. [140] Jun Zhang, Xiaokui Xiao, and Xing Xie. “Privtree: A differentially private algorithm for hierarchical decompositions”. In: Proceedings of the 2016 International Conference on Management of Data. 2016, pp. 155–170. [141] Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. “Towards accurate histogram publication under differential privacy”. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM. 2014, pp. 587–595. [142] Shenglin Zhao, Tong Zhao, Irwin King, and Michael R Lyu. “Geo-teaser: Geo-temporal sequential embedding rank for point-of-interest recommendation”. In: Proc. of Intl, Conf. on World Wide Web. 2017, pp. 153–162. [143] Dihan Zheng, Sia Huat Tan, Xiaowen Zhang, Zuoqiang Shi, Kaisheng Ma, and Chenglong Bao. “An Unsupervised Deep Learning Approach for Real-World Image Denoising”. In: International Conference on Learning Representations. 2020. [144] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. “Random erasing data augmentation”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34. 07. 2020, pp. 13001–13008. 157
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient crowd-based visual learning for edge devices
PDF
Privacy-aware geo-marketplaces
PDF
Mechanisms for co-location privacy
PDF
Location-based spatial queries in mobile environments
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
PDF
Generalized optimal location planning
PDF
Responsible AI in spatio-temporal data processing
PDF
Modeling intermittently connected vehicular networks
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Spatiotemporal traffic forecasting in road networks
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Human appearance analysis and synthesis using deep learning
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Learning controllable data generation for scalable model training
PDF
Location privacy in spatial crowdsourcing
PDF
Heterogeneous federated learning
PDF
Deep learning for subsurface characterization and forecasting
Asset Metadata
Creator
Ahuja, Ritesh
(author)
Core Title
Differentially private learned models for location services
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
11/21/2022
Defense Date
04/29/2022
Publisher
University of Southern California. Libraries
(digital)
Tag
differential privacy,learned models,location datasets,location-based networks,machine learning,OAI-PMH Harvest,privacy
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Korolova, Aleksandra (
committee member
), Krishnamachari, Bhaskar (
committee member
)
Creator Email
riteshah@usc.edu,riteshahuja13@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111336132
Unique identifier
UC111336132
Identifier
etd-AhujaRites-10723.pdf (filename)
Legacy Identifier
etd-AhujaRites-10723
Document Type
Thesis
Rights
Ahuja, Ritesh
Internet Media Type
application/pdf
Type
texts
Source
20220527-usctheses-batch-944
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
differential privacy
learned models
location datasets
location-based networks
machine learning