Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Responsible AI in spatio-temporal data processing
(USC Thesis Other)
Responsible AI in spatio-temporal data processing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Responsible AI in Spatio-Temporal Data Processing
by
Sina Shaham
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2024
Dedication
I would like to dedicate this dissertation to Professor Gabriel Ghinita, whose tremendous
help and unwavering support have been instrumental throughout my Ph.D. journey. His
profound knowledge and exceptional professionalism have not only enriched my learning
experience but have also profoundly influenced my academic and personal growth.
ii
Acknowledgements
In the realm of knowledge, where thoughts take flight, I pen these words with gratitude
bright. To powers unseen, whose presence divine, guided my steps, through dark and shine.
A force unknown, yet felt in the heart, in life’s every phase, played its part.
With a heart full of gratitude, I extend my deepest appreciation to my advisor, Professor
Bhaskar Krishnamachari. His approach to mentorship, his endless patience, and his life
philosophy have not only been exemplary but also a significant source of inspiration and
aspiration for me. I am truly fortunate to have had such an exceptional guide in my life.
Furthermore, my sincere thanks go to Professor Cyrus Shahabi for enabling me to be a
part of USC and for his unwavering support throughout this journey. His profound
knowledge and the unique ability to be both a friend and an advisor have been invaluable
to my growth and success.
I am also very grateful to my father, Rahman Shaham, a dedicated medical doctor who
devoted his life to serving the less fortunate in the poorest regions of my hometown. And to
my mother, a compassionate psychiatrist, who has tirelessly worked to aid those battling
drug addiction and mental health issues in the suburbs of Tehran. Their selfless dedication
iii
to helping others has been a constant source of inspiration and has shaped my values and
aspirations.
Lastly, I extend my gratitude to all my coauthors from both the academic and industrial
sectors, including Prof. Matthew E. Kahn, Prof. John Krumm, Prof. Pubudu N Pathirana,
Prof. Yao-Yi Chiang, Dr. Sara Abdali, Dr. Charith Peris, Dr. Dinh C Nguyen, Haowen
Lin, Arash Hajisafi, and Minh K Quan. Their contributions and teamwork were
indispensable to this journey’s success.
Sina Shaham
University of Southern California
March 2024
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3: Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Philosophies of Fairness in Context . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Notions and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 4: State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.1 DP Publication of Location Histograms . . . . . . . . . . . . . . . . 24
4.1.2 DP Publication of Time Series . . . . . . . . . . . . . . . . . . . . . 28
4.2 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Fairness in ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.2 Fairness in Spatial Domain . . . . . . . . . . . . . . . . . . . . . . . 30
v
Chapter 5: HTF: Homogeneous Tree Framework for Differentially-Private Release of
Location Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Homogeneous-Tree Framework . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 Homogeneity-based Partitioning . . . . . . . . . . . . . . . . . . . . . 44
5.4.2 HTF Index Structure Construction . . . . . . . . . . . . . . . . . . . 50
5.4.3 Leaf Node Count Perturbation . . . . . . . . . . . . . . . . . . . . . 53
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5.2 HTF vs Data Dependent Algorithms . . . . . . . . . . . . . . . . . . 61
5.5.3 HTF vs Grid-based Algorithms . . . . . . . . . . . . . . . . . . . . . 62
5.5.4 HTF vs Data Independent Algorithms . . . . . . . . . . . . . . . . . 64
5.5.5 Additional Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 6: Fair Spatial Indexing: A Paradigm for Group Spatial Fairness . . . . . . 67
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Fairness Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Spatial Fairness through Indexing . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.1 Fair KD-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3.2 Iterative Fair KD-tree . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3.3 Multi-Objective Fair KD-tree . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.2 Evidence for Disparity in Geospatial ML . . . . . . . . . . . . . . . . 91
6.4.3 Mitigation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.4 Performance of multi-objective approach. . . . . . . . . . . . . . . . 97
6.4.5 Synthetic Data Results . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.6 Multi-Objective Performance Evaluation . . . . . . . . . . . . . . . . 99
Chapter 7: Differentially-Private Publication of Origin-Destination Matrices with
Intermediate Stops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vi
7.2.2 Trajectory Modeling with OD Matrices . . . . . . . . . . . . . . . . . 108
7.3 Data-Independent Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.3.1 Extended Uniform Grid (EUG) . . . . . . . . . . . . . . . . . . . . . 110
7.3.2 Entropy-based Partitioning (EBP) . . . . . . . . . . . . . . . . . . . 114
7.4 Data-Dependent Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4.2 DAF-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.4.3 DAF-Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.4.4 Budget Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.5.2 Results on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . 131
7.5.3 Results on Real-World Datasets . . . . . . . . . . . . . . . . . . . . . 133
Chapter 8: Privacy-Preserving Publication of Eletrcity Datasets . . . . . . . . . . . 138
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.2 Differential Privacy Principles . . . . . . . . . . . . . . . . . . . . . . 143
8.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3.1 Time-Series Representation . . . . . . . . . . . . . . . . . . . . . . . 146
8.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3.3 A Simple Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.4 STPT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.2 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4.3 Sanitization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.5.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.5.3 Detailed Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Chapter 9: Fairness-Aware Emergency Demand Response Program for Electricity . . 172
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.2.2 Message Passing Framework . . . . . . . . . . . . . . . . . . . . . . . 178
9.3 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.3.1 Program Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.3.2 Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.3.3 Rate Hikes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
vii
9.4 Implementation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.4.1 Pattern Recognition GNN . . . . . . . . . . . . . . . . . . . . . . . . 187
9.4.2 Household Selection GNN . . . . . . . . . . . . . . . . . . . . . . . . 192
9.5 Fairness-Aware Candidate Selection . . . . . . . . . . . . . . . . . . . . . . . 195
9.5.1 Fairness Metric & Evaluation . . . . . . . . . . . . . . . . . . . . . . 195
9.5.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.6.3 Performance Analysis of ILB . . . . . . . . . . . . . . . . . . . . . . 205
9.6.4 Evaluating the Integrated Framework and ILB . . . . . . . . . . . . 207
9.6.5 Fairness-Aware Selection of Participants . . . . . . . . . . . . . . . . 211
Chapter 10: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.2.1 Privacy and Fairness in Large Language Models . . . . . . . . . . . . 217
10.2.2 Fairness Through Privacy . . . . . . . . . . . . . . . . . . . . . . . . 219
10.2.3 Fair Privacy Protection . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2.4 Going Beyond Race and Gender . . . . . . . . . . . . . . . . . . . . 221
10.2.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
viii
List of Tables
6.1 Summary of Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.1 Summary of Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Summary of Compared Approaches . . . . . . . . . . . . . . . . . . . . . . . 128
7.3 Running Time of Algorithms (Seconds), 2D, ϵ = 0.1 . . . . . . . . . . . . . 137
8.1 Summary of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.2 Electricity Consumption Data Summary . . . . . . . . . . . . . . . . . . . . 162
9.1 Summary of Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.2 Accuracy of ILB Candidate Selection Framework. . . . . . . . . . . . . . . . 210
ix
List of Figures
1.1 Categorization of Chapters Based on Data Complexity, Trustworthiness
Pillar and Data Type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Computation of Discrimination Based on Different Metrics. . . . . . . . . . 20
5.1 System Model for Private Location Histograms. . . . . . . . . . . . . . . . . 37
5.2 Example of HTF Partitioning. Dashed Rectangles Represent the Query. . . 40
5.3 Comparison with Data Dependent Algorithms, Los Angeles Dataset. . . . . 59
5.4 Comparison to Grid-based Algorithms, Los Angeles Dataset. . . . . . . . . . 60
5.5 Comparison to Data Independent Algorithms, Los Angeles Dataset. . . . . . 63
5.6 Mixed Workloads, ϵtot = 0.1, All Datasets. . . . . . . . . . . . . . . . . . . . 65
6.1 An Example of the Miscalibration Problem with Respect to Neighborhoods. 71
6.2 Overview of the Proposed Mitigation Techniques. . . . . . . . . . . . . . . . 76
6.3 Overview of Fair KD-tree Algorithm. . . . . . . . . . . . . . . . . . . . . . . 78
6.4 Overview of Iterative Fair KD-tree Algorithm. . . . . . . . . . . . . . . . . . 85
6.5 Aggregation in Multi-Objective Fair KD-tree. . . . . . . . . . . . . . . . . . 86
6.6 Evidence of Model Disparity on Geospatial Neighborhoods. . . . . . . . . . 90
x
6.7 Performance Evaluation with Respect to ENCE. . . . . . . . . . . . . . . . 91
6.8 Performance Evaluation with Respect to Other Indicators. . . . . . . . . . . 92
6.9 Impact of Features on Decision-making. . . . . . . . . . . . . . . . . . . . . 93
6.10 Performance Evaluation of Multi-objective Algorithm. . . . . . . . . . . . . 95
6.11 Performance Evaluation on Synthetic Data. . . . . . . . . . . . . . . . . . . 97
6.12 Multi-objective Performance Evaluation. . . . . . . . . . . . . . . . . . . . . 98
7.1 System Model for Private Frequency Matrices. . . . . . . . . . . . . . . . . . 105
7.2 Capturing Trajectory Data Using OD Matrices. . . . . . . . . . . . . . . . . 109
7.3 Intuition Behind DAF Sanitization Approaches. . . . . . . . . . . . . . . . . 125
7.4 Synthetic Dataset Results, Gaussian Distribution, Random Shape and Size
Queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5 Synthetic Dataset Results, Zipf Distribution, Random Shape and Size
Queries, ϵtot = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.6 Population Histograms in 2D for Real Datasets. . . . . . . . . . . . . . . . . 132
7.7 Population Histograms in 2D on Real Datasets, no Baselines. . . . . . . . . 134
7.8 Origin-Destination Matrices in 4D, Real Datasets. . . . . . . . . . . . . . . 136
8.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Consumption Matrix in Different Stages. . . . . . . . . . . . . . . . . . . . . 148
8.3 Workflow of Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4 Total Weekly Consumption per Week Day of Datasets. . . . . . . . . . . . . 163
8.5 Assessment of Algorithm Performance Across Varied Datasets and Query
Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
xi
8.6 Detaileld Analysis of STPT. . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1 Pattern Recognition Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2 Household Selection Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
9.3 Household Similarity in Five Counties: Higher Values Show Greater Similarity.196
9.4 Average Hourly Consumption and Standard Deviation Error of Counties. . 202
9.5 Evaluation of Total Responsiveness Cost. . . . . . . . . . . . . . . . . . . . . 205
9.6 Evaluation of Total Demand Reduction. . . . . . . . . . . . . . . . . . . . . 207
9.7 Responsiveness Performance of ILB. . . . . . . . . . . . . . . . . . . . . . . 208
9.8 Rate Hike on Non-participants in ILB. . . . . . . . . . . . . . . . . . . . . . 208
9.9 Pareto Analysis of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . 211
xii
Abstract
The fusion of Algorithms and Machine Learning (ML), broadly categorized as Artificial
Intelligence (AI), has been a forefront drive of innovation in the recent decade, propelling
humanity to unprecedented heights in science and technology. The typical process of an AI
model or algorithm begins with gathering and utilizing user data, analyzing and learning
patterns, and ultimately producing valuable and actionable insights for users. A key source
of information used in these models is the spatial and temporal information of users,
including the geographical whereabouts, user trajectories, health-related data such as
individuals’ heartbeats and patient visits over time, shopping habits, and the energy
consumption of users. Such data enables AI models to offer a wide range of applications,
including sophisticated decision-making, understanding user behavior, personalized
advertising, predicting movements, and aiding in everyday tasks, thereby improving daily
life and enhancing the development of smart city infrastructure. Nevertheless, the immense
benefits of AI also bring significant responsibilities and have raised societal concerns about
their trustworthiness and accountability. Questions arise about user data control, fairness in
treating users and communities, consideration of long-standing issues like gender and racial
xiii
equality, and the reflection of societal values in the design and implementation of AI models.
This dissertation explores the following thesis: It is feasible to design AI/ML algorithms
process and publish spatiotemporal datasets, such as location trajectories and electricity
time series, in a way that not only enhances privacy and fairness but also does so without
incurring a significant loss of utility.
In this work, we methodically explore the design and development of algorithms aimed
at enhancing privacy and fairness in handling spatio-temporal data. After providing the
required background and an overview of the state-of-the-art, our first step involves
introducing an algorithm dedicated to publishing location histograms, a typical process
used by the US Census Bureau. This algorithm is designed with a focus on Differential
Privacy (DP) to ensure privacy in the gathering and dissemination of user statistics.
Subsequently, we delve into the challenge of fairness in the processing of spatial user data.
We highlight the existing biases in ML models related to the geospatial neighborhoods of
users. By developing a new algorithm, our goal is to mitigate this unfairness. The proposed
approach is designed to reconfigure the geospatial neighborhoods of users, altering their
representation in a way that seeks to enable more equitable treatment of their data in ML
models.
In the next phase, we incorporate temporal dimension into the spatial data, thereby
increasing its complexity, and concentrate on the ethical implications in processing
spatio-temporal datasets. Specifically, we focus on user energy consumption and develop an
algorithm to demonstrate how user privacy can be maintained while sharing their energy
xiv
usage data with third parties and untrusted entities. Following this, we propose an
incentive-based program aimed at balancing electricity demand, taking into account
socio-economic family attributes and ensuring fair treatment. The dissertation then shifts
its focus to the publication of spatio-temporal trajectories of users. We explore how such
data can be responsibly released for industrial applications and data analysis in a sanitized,
privacy-preserving manner. Extensive numerical evaluations are conducted at every step of
the dissertation, involving experiments with both real-world and synthetic datasets. These
evaluations are diverse, covering various aspects such as query types, geospatial regions,
data distribution, and encompassing a wide range of evaluation metrics and performance
analyses. Through these comprehensive evaluations, the dissertation not only demonstrates
the progress made over previous works but also sheds light on potential areas for future
studies, particularly in the realm of responsible handling of complex spatio-temporal data.
xv
Chapter 1
Introduction
Machine Learning (ML) and Algorithms under the umbrella of Artificial Intelligence (AI)
have brought significant advancements to society, revolutionizing how we interact with
technology and data. These technologies excel in analyzing and interpreting complex
patterns, making them indispensable tools in various sectors. In healthcare, AI-driven
diagnostics improve patient outcomes by identifying diseases early and accurately. In the
realm of transportation, AI algorithms optimize traffic flow, reducing congestion and
environmental impact. Furthermore, AI and ML have substantial implications in enhancing
personalized user experiences in digital platforms, tailoring content and services to
individual preferences and behaviors. This personalized approach is particularly evident in
the context of spatial and temporal data recording of users, where they enable the precise
tracking and analysis of user movements and interactions over time and space. Such
capabilities not only enhance user experience but also open paths for improved urban
1
planning, targeted marketing, and more efficient resource allocation, ultimately contributing
to smarter, more responsive services in a connected world.
Trustworthiness. The initial belief that more data would result in better
decision-making in the world of AI was quickly shattered as it became clear that accurate
algorithms alone are not enough to make responsible decisions [37, 24, 86]. The significance
of trustworthiness in AI can be explored by making an analogy with the stages of human
development. Consider a child who inherits characteristics from their parents – this is akin
to the initial model selection in AI, taking into account the mathematical limitations
inherent in the chosen structure. As the child matures, they absorb crucial knowledge
during their formative years - comparable to an AI model being trained with carefully
selected datasets. The child’s interaction with their socio-economic environment shapes
their behavior and choices, just as an AI model’s responses are influenced by the dataset it
interacts with and the feedback it receives. The child experiences both opportunities and
limitations in society, just as an AI model’s functionality is affected by the boundaries of its
mathematical design and the quality of its datasets. Ultimately, the child matures into an
individual whose ethical decisions impact those around them, much like an AI model that
must make responsible decisions affecting real-world outcomes. To develop a trustworthy AI
pipeline, each element in the learning cycle, like each stage in a child’s life, carries shared
responsibility. In this research, we focus on understanding and integrating the two main
pillars of trustworthy AI, i.e., Privacy and Fairness, on Spatial and Temporal data
generated by communities.
2
Privacy. Information privacy refers to an individual’s right to maintain a certain level
of control over how their personal data is gathered and utilized [133]. Take, for instance, a
photo shared by an individual on social media platforms, intended solely for communication
and social interaction. Even basic data mining techniques can extract sensitive information
from this image, which could be exploited by malicious attackers. Aspects like ornaments,
background, and facial features may inadvertently disclose the individual’s religion,
geographical location, gender, race, or other sensitive details. This raises the question of
how much control users have over their own data. Expanding this concept to the vast
quantities of data utilized in modern AI models highlights the importance of privacy for
ensuring trustworthy AI. Incidents like the 2020 Facebook scandal [137] and the Edward
Snowden revelations [31] underscore the critical nature of user data privacy in the context
of AI.
Fairness. From a different perspective, while privacy deals with the extent of control
over data, fairness aims to ensure that the revealed user information is handled fairly and
equitably. The philosophical notions of fairness have existed for centuries [6]; however, with
the rapid growth of ML, algorithmic fairness and its application to AI and society have
emerged as some of the most critical challenges of the decade. Unfortunately, models
intended to intelligently avoid errors and biases in decision-making have themselves become
sources of bias and discrimination within society. Various forms of unfairness in AI have
raised concerns, including racial biases in criminal justice systems [7] and disparities in
employment [103] and loan approval processes [68]. The entire life-cycle of an AI model –
3
encompassing input data, modeling, evaluation, and feedback – is vulnerable to both
external and inherent biases, leading to unjust outcomes. Compounding the issue is the
tendency of the pipeline’s life-cycle to amplify biases due to oversimplification and
assumptions made throughout the process. Moreover, unlike the concept of privacy, for
which there are well-defined and accepted metrics, the large number of varied and often
conflicting definitions of fairness presents a significant challenge in establishing trustworthy
AI systems.
Responsible AI & Communities. Combining the immense advantages that AI offers
to society with the risks and challenges it poses, particularly concerning privacy and
fairness, a critical question emerges: How can communities harness these benefits while
ensuring their privacy is protected and they are treated fairly? This dissertation aims to
explore and provide answers to this significant question.
The chapters, based on their application, are divided into two primary categories:
Movement Datasets and Electricity Datasets. Movement data pertains to the spatial
information of users, either captured as stationary points or recorded over time in
trajectories, forming longitudinal datasets. Examples that inspire this category include,
• Population histograms, which serve as a crucial source of information for
decision-making in both political and economic spheres. For instance, in the United
States, the Census Bureau is dedicated to collecting and disseminating such
information. A significant challenge that arises is how to publish this data privately
4
without compromising the identities of individuals, to enable further processing and
analysis of the information.
• Origin-destination matrices, which illustrate the movement of users between
geographically tagged areas, are instrumental for planning and decision-making
purposes. For example, the placement of charging stations for electric vehicles is often
informed by this data. A pressing question that emerges is how to share such data
with the public and third parties in a manner that safeguards the privacy of
individuals.
The energy consumption data, specifically electricity data, represents another significant
dataset examined in this dissertation, given its critical role in society. Two critical
applications highlighting the necessity of such datasets are outlined in the following:
• Electricity time series data are essential for forecasting future electricity demand at
various scales, from individual households to entire regions. Accurate demand
forecasts enable utility companies to plan generation and distribution effectively,
ensuring that supply meets the fluctuating demand efficiently. This is crucial for
optimizing operations, reducing energy wastage, and ensuring reliability of the power
grid. Advanced machine learning models and statistical techniques are often applied
to historical time series data to predict peak times, seasonal trends, and overall
demand patterns.
5
Trustworthiness Pillar
Data Complexity
Chap 5 Chap 6
Chap 7 Chap 8 Chap 9
Data Type: Movement Electricity
Figure 1.1: Categorization of Chapters Based on Data Complexity, Trustworthiness Pillar
and Data Type.
• Time series data are instrumental in the real-time management of electricity loads
across the power grid. By analyzing patterns of electricity use over time, utility
providers can implement strategies for load balancing, which is the process of
adjusting or distributing energy load in a way that maintains the stability of the
power system. This is particularly important for integrating renewable energy sources,
which can be variable and less predictable than traditional sources. Effective load
balancing helps in reducing the reliance on peaking power plants, which are expensive
and have a higher environmental impact, thereby facilitating more sustainable and
cost-effective energy management.
6
The contextual structure of the chapters is demonstrated in Figure 1.1. The dissertation
initially concentrates on data of lower complexity, considering only the spatial information
of users. Chapter 5 is dedicated to exploring the privacy concerns related to such data,
while Chapter 6 examines the fairness in handling this data. The latter part of the
dissertation expands the focus to include both the temporal and spatial dimensions of data.
Chapters 7 and 8 are geared towards tackling privacy issues associated with processing this
comprehensive data, and Chapter 9 is aimed at addressing fairness concerns in the context
of spatio-temporal data.
The structure of the document is organized as follows: Chapter 2 and Chapter 3 provide
an overview of privacy and fairness in the context of spatio-temporal data processing.
Chapter 4 offers a detailed review of relevant literature, pinpointing the gaps that this work
aims to fill. In Chapter 5, an algorithm for the privacy-preserving publication of location
histograms is discussed. Chapter 6 details a strategy for incorporating fairness in the use of
spatial data within machine learning models. Chapter 7 focuses on the privacy-preserving
publication of trajectories. Chapter 8 focuses on a privacy-preserving technique for
releasing spatio-temporal data, particularly in relation to electricity datasets, while
Chapter 9 introduces a fairness-aware method for the management of electricity, especially
in times of crisis. Chapter 10 concludes the dissertation and provides future directions.
In the following, we summarize the contributions made by key technical chapters, the
questions they address, and their organization.
7
Chapter 5 introduces an innovative approach for releasing population histograms. The
chapter starts with a formulation of the problem and a description of the system model.
Following this, the proposed method is detailed. The chapter concludes with an extensive
review and comparison with existing state-of-the-art methodologies. This research has been
published as a conference paper [119] and later as a journal article [117].
Chapter 6 presents a new method aimed at ensuring fair representation of households
in ML models. The primary objective is to maximize the utility of the spatial model within
the ML framework while safeguarding against any unfair treatment of individuals from
diverse neighborhoods. The chapter begins with a detailed problem formulation and system
model description, then proceeds to describe the proposed methodology and its
experimental evaluation. The outcomes of this novel approach have been published in [123].
Chapter 7 concentrates on formulating techniques for privacy-preserving publication of
multi-dimensional origin-destination matrices, a critical aspect in contexts such as
COVID-19 spread analysis. The chapter initiates by identifying the current shortcomings
and privacy concerns inherent in OD matrices, then advances to propose several strategies
for their privacy-preserving release while ensuring data utility. Chapter 7 concludes with a
detailed experimental assessment, employing a mix of real-world and synthetic datasets.
Chapter 8 introduces a new strategy for the privacy-preserving release of electricity
time series data. The chapter commences with the formulation of the problem and an
outline of the system model. It then progresses to explain how time series data can be
8
represented as spatio-temporal histograms and delves into the proposed algorithm designed
for the private publication of these histograms.
Chapter 9 is dedicated to a fairness-aware method for managing electricity demand
and supply during crisis situations. The chapter initiates with the formulation of the
problem and a description of the system model. It then introduces the proposed approach,
emphasizing how neighborhoods can be represented in a fair manner. The chapter
concludes with an experimental evaluation and a comparative analysis of the method.
9
Chapter 2
Privacy
This section is centered on Privacy [125, 78, 1, 79, 113, 114, 130, 115, 116, 159, 110, 122], a
fundamental component of Responsible AI. It explores essential ideas necessary for grasping
the concept of privacy, paying special attention to a privacy metric known as Differential
Privacy, and the Laplace mechanism, a principal technique for implementing differential
privacy.
2.1 Preliminaries
ML has transformed various industries, including healthcare [92, 111],
telecommunications [128, 112, 147], transportation [138, 127], and finance [34, 82]. The
capacity of these algorithms to process vast quantities of data, identify patterns, generate
predictions, and deliver precise and efficient recommendations is remarkable. However, the
use of personal data in ML models has given rise to significant privacy concerns. A primary
concern is the potential misuse of sensitive personal information [93], such as names,
10
addresses, social security numbers, and medical records. If not adequately safeguarded,
such data could lead to identity theft, financial fraud, and other adverse consequences. The
issue of data privacy was starkly highlighted in 2018 when Cambridge Analytica illicitly
harvested data from millions of Facebook users [58]. This data was subsequently used to
craft effective political advertisements during the 2016 US presidential election [11],
sparking widespread concern over the use of personal information in political campaigns.
To address these privacy concerns, researchers are developing privacy-preserving
algorithms that secure sensitive data while allowing accurate and efficient model creation.
Differential privacy (DP) is one such technique, which adds noise to the data to prevent
individual records from being identified [36]. Another technology that allows data to be
processed without being decrypted is homomorphic encryption [3], which ensures that
sensitive information is never divulged. While some research have focused on specific
privacy concerns connected to certain types of ML learning methodologies, such as
supervised, unsupervised, semi-supervised, and RL, a comprehensive evaluation of all
potential privacy hazards and remedies is still lacking. Furthermore, there may be privacy
issues that are specific to specific domains or applications that necessitate a more
concentrated analysis. Consequently, it is necessary to continue exploring privacy concerns
in these ML techniques to guarantee that all potential dangers are properly detected and
addressed. This will ensure that ML technologies are developed and deployed in a way that
respects individuals’ privacy rights while also minimizing the possible damages associated
with these technologies.
11
2.2 Differential Privacy
DP is a privacy protection method commonly used in different stages of the ML pipeline to
enhance privacy of individuals. In this subsection, we will examine the concepts and
definitions, common DP mechanisms, and applications of DP in various ML techniques.
Notions and Definitions
Definition 1 (ϵ-Differential Privacy[36]). A randomized algorithm M is said to be
(ϵ, δ)-differentially private if, for any two datasets D1 and D2 that differ in only one data
point, and any subset of the range of M, the following holds:
P r[M(D1) ∈ S] ≤ e
ϵP r[M(D2) ∈ S] + δ, (2.1)
where ϵ and δ are privacy parameters, and S is any subset of the range of M. This
inequality ensures that the probability of observing a certain output of M on a dataset D1
is almost the same as the probability of observing the same output on a dataset D2 that
differs in only one data point, with the exception of a small amount of random noise
controlled by ϵ and δ. The parameter ϵ controls the strength of the privacy guarantee, with
lower values providing stronger privacy protection, while δ is a parameter that accounts for
the probability that the privacy guarantee is violated due to the randomness introduced by
the algorithm.
12
Definition 2 (L1-Sensitivity[39]). L1-sensitivity is a measure of how much the output of a
function changes when a single data point is added or removed from a dataset. It is defined
as the maximum absolute difference between the output of the function on two adjacent
datasets that differ in only one data point. Formally, given a function f : D → R
n
that
maps datasets in domain D to vectors in R
n
, the L1-sensitivity of f is defined as:
∆f = max
d∈D,d′∼d
|f(d) − f(d
′
)
|1, (2.2)
where d
′
is the neighboring dataset that differs from d by a single data point, and || · ||1
denotes the L1-norm. Intuitively, L1-sensitivity captures the largest change that can occur
in the output of f due to the presence or absence of a single data point. It is a fundamental
parameter in DP, as it determines the amount of noise that needs to be added to the output
of f to achieve a desired level of privacy protection.
Laplace Mechanism. Common methods for achieving DP include adding controlled
noise to data to protect individual privacy while aiming to maintain data utility. This
dissertation primarily examines the Laplace Mechanism, a widely used technique for
attaining DP.
In more detail, the Laplace mechanism [59] is a method for achieving DP by adding
random noise to the output of a query in a way that satisfies DP guarantees. Specifically,
13
given a function f : D → R that we want to compute on a dataset D, the Laplace
mechanism adds random noise to f(D) according to the following formula:
f(D) + Lap
∆f
ϵ
!
, (2.3)
where Lap(∆f /ϵ) is a random variable drawn from the Laplace distribution with mean 0
and scale parameter ∆f /ϵ, where ∆f is the sensitivity of the function f and ϵ is the
privacy parameter that controls the amount of noise added. More formally, the Laplace
distribution is defined as:
Lap(x | µ, b) = 1
2b
exp
−
|x − µ|
b
, (2.4)
where µ is the mean and b is the scale parameter. In the case of the Laplace mechanism,
the mean is 0 and the scale parameter is ∆f /ϵ, so the Laplace distribution becomes:
Lap(x | 0, ∆f /ϵ) = 1
2(∆f /ϵ)
exp
−
|x|
∆f /ϵ!
. (2.5)
Adding Laplace noise to the output of f(D) in this way ensures that the output is
differentially private with parameter ϵ. The amount of noise added is proportional to the
sensitivity of the function f, with higher sensitivities resulting in more noise being added to
the output.
Although the Laplace mechanism is commonly used to achieve DP in ML, it has several
14
limitations [67] [46]. The amount of noise added to the data depends on the sensitivity of
the function being computed, which can be significant for some functions. This can lead to
a considerable loss of output accuracy, making it difficult to obtain meaningful results.
Moreover, the Laplace mechanism assumes that the data is continuous and unbounded,
which may not always hold for all datasets. Furthermore, the Laplace distribution employed
in the mechanism may not be optimal, as it assumes that the noise added to the data is
symmetric, which may not be the case in reality. However, this technique is also commonly
employed in terms of DP. For example, methods have been devised in [104] and [121] for
releasing counts on specific types of data, such as time series. The authors from [154], [118]
and [119] concentrate on releasing histograms, while other authors in [150], [29] present
ways for reducing the worst-case error of a specified set of count queries.
15
Chapter 3
Fairness
This section highlights Fairness, another critical aspect of responsible AI. It details and
illustrates important concepts associated with fairness, such as bias, various philosophies,
and metrics throughout the chapter.
3.1 Preliminaries
The concept of fairness in society has been a recurring study subject throughout history [6].
Although early discussions were mainly philosophical, the rise of data and ML in the past
decade has attracted tremendous attention to fairness in algorithms. As opposed to the
initial perception of models and algorithms being trustworthy, soon it was realized that
they could lead to severe unjust decisions, affecting especially individuals from
disadvantaged groups. Perhaps, the most significant of such discoveries was revealed in an
article published by ProPublica in 2016, highlighting the significance of algorithmic fairness.
The article focuses on a software named COMPAS [148] designed to determine the risk of a
16
person committing another crime and assist US judges in making release decisions. The
investigation found that COMPAS was biased against African Americans as it had a higher
rate of false positives for this group compared to Caucasians. This and numerous other
examples indicate the necessity to quantify and mitigate unfairness-related issues in ML.
3.2 Bias
The term "bias" in ML has a distinct meaning that is different from the typical
understanding of the term in social and news contexts [21]. Bias is seen as the root cause of
unfairness and is often tied to a specific term that indicates where in the process, the data
is being distorted. Over time, many different types of bias have been introduced in the
literature, some of which are subcategories of others, leading to confusion in properly
defining each one.
3.3 Philosophies of Fairness in Context
In the domain of work and employment, principles of fairness and non-discrimination guide
the relationships among employees, employers, and labor unions. Two core fairness
principles, often identified as ‘Disparate Impact’ and ‘Disparate Treatment’, are observed in
this context. Disparate Treatment [162] acknowledges that unjust behaviors towards
individuals due to their protected attributes, such as race, are unacceptable. An instance
reflecting this principle in action could be prohibiting the exclusive skill examination of job
17
applicants based on their ethnic group affiliation. Disparate impact [107] pertains to
practices that inadvertently disadvantage a protected group, even though the policies
implemented by organizations appear neutral on the surface. This principle recognizes that
discrimination is not always direct, and it can affect individuals and groups in indirect ways.
A classic example includes policies that, while appearing neutral, disproportionally impact
members of a protected group in a negative manner [109].
In ML, fairness principles are implemented in various ways to uphold the
aforementioned principles for sensitive attributes. Notably, different organizations provide
guidelines on what constitutes sensitive attributes. The most commonly protected features
include race, gender, religion, and national origin.
3.4 Notions and Definitions
As opposed to privacy, where at least for statistical databases, there is a consensus on DP,
there does not exist such an agreement on a common notion for fairness. One suggested
guideline is to select the notion based on the underlying application. This section reviews
some of the most widely adopted fairness notions for supervised learning. In this context,
we use terms notion and definition interchangeably. Also, deviation from a fairness notion is
referred to as discrimination level. Discrimination is usually manifested as the absolute
value of the difference in metrics for different groups. Moreover, we denote the set of
18
sensitive attributes by A, all observed attributes by X, latent attributes not observed by U,
true label to be predicted by Y , and finally, predictor by Yˆ .
We debut our discussions on notions with statistical parity, one of the primary
group-level fairness notions.
Definition 3. (Demographic or Statistical Parity [37, 19]). A predictor Yˆ satisfies
demographic parity if:
P(Yˆ = 1|A = 0) = P(Yˆ = 1|A = 1). (3.1)
Statistical parity dictates that regardless of an individual’s group, they should have an
equal chance of being assigned to a positive class. Figure 3.1 exemplifies statistical parity.
Consider two groups of male and female job applicants and an ML model that decides
whether a person should proceed for further evaluation in their application. Here, the
likelihood of moving ahead with male and female applicants is 5/10 and 7/10, respectively.
Hence, discrimination based on statistical parity is 20%. The notion of equalized odds,
presented next, takes a step further and requires an equal true positive rate across groups.
Definition 4. (Equalized Opportunity [37]). A predictor Yˆ satisfies equal opportunity
with respect to protected attribute A and outcome Y , if Yˆ and A are independent
conditional on Y ,
P(Yˆ = 1|A = 0, Y = 1) = P(Yˆ = 1|A = 1, Y = 1). (3.2)
19
Figure 3.1: Computation of Discrimination Based on Different Metrics.
That means the true positive rate should be the same for both groups. Going back to
the example in Figure 3.1, the true positive rate for males and females is 3/6 and 5/7,
leading to discrimination of 21.4% based on equalized opportunity. The next notion,
equalized odds, dictates an even stricter fairness notion requiring equal true and false
positive rates across groups.
Definition 5. (Equalized Odds [37]). A predictor Yˆ satisfies equalized odds with respect
to protected attribute A and outcome Y if:
P(Yˆ = 1|A = 0, Y = y) = P(Yˆ = 1|A = 1, Y = y), y ∈ {0, 1}. (3.3)
20
In the example, discrimination based on equalized odds is 38.1%. As can be seen,
imposing higher fairness guarantees intuitively results in a higher percentage of
discrimination.
Definition 6 introduces the concept of calibration, which is a crucial idea borrowed from
ML. This notion ensures that the confidence scores produced by the model can be
interpreted as probabilities and is considered a group-level fairness notion.
Definition 6. (Calibration [49, 100]). An ML model is said to be calibrated if it produces
calibrated confidence scores. Formally, the outcome score R is said to be calibrated if for all
the scores r in the support of R following stands,
P(y = 1|R = r) = r. (3.4)
Calibration ensures that the set of all instances assigned a score value r has an r
fraction of positive instances among them. Note that the metric is defined on a group level,
and it does not mean that an individual who has a score of r corresponds to r probability of
a positive outcome. For example, given 10 people who are assigned a confidence score of 0.7,
in a well-calibrated model, we expect to have 7 individuals with positive labels among them.
So far, the fairness definitions discussed were all focused on group-level fairness. In the
following, two of the common notions to achieve fairness at an individual level are presented.
Definition 7. (Counterfactual Fairness [69]). Given a causal model (U, V , F), where U,
V , and F represent the set of latent (unobserved) background variables, the set of
21
observable variables, and a set of functions defining the mapping U ∪ V → V , respectively,
a predictor Yˆ is considered counterfactually fair if, under any context X = x and A = a,
the following equation holds:
P(YˆA←a(U) = y|X = x, A = a) = P(YˆA←a
′(U) = y|X = x, A = a). (3.5)
This holds for all y and for any value a
′ attainable by A. Here, A, X, and Yˆ represent
the set of sensitive attributes, remaining attributes, and decision output, respectively. In
other words, the model’s predictions for a person should not change in a counterfactual
world in which the person’s sensitive features are different.
Definition 8. (Individual Fairness by Dwork et al. [37]). For a mechanism M mapping u
in the input space X to value y in the output space Y, individual fairness is satisfied when
for any u, v ∈ X :
dX (u, v) ≥ dY (M(u),M(v)), (3.6)
where dX : X × X → R+ and dY : Y × Y → R+.
To illustrate, let’s consider the scenario of classification, where the classifier’s predictor
Yˆ serves as the mapping mechanism. In the context of individual fairness, the fundamental
idea is that two individuals who are alike in relevant ways should receive comparable
outcomes. To operationalize this concept, we rely on two crucial distance metrics: (1) a
22
similarity distance metric dX that gauges how similar two individuals are to each other, and
(2) a distance metric dY that quantifies the disparity between the distributions of outcomes.
23
Chapter 4
State-of-the-art
This chapter examines the current research and cutting-edge methods, pointing out the
existing gaps and the advancements achieved1
.
4.1 Privacy
4.1.1 DP Publication of Location Histograms
Frequency Matrices (FMs), also known as Location Histograms when they are based on
geospatial data, are types of matrices where each element represents a statistic related to a
distinct area on a map, and are predominantly utilized as a tool for decision-making and
planning. Prior works on private publication of frequency matrices can be classified into
three categories: data-independent, partially data-dependent, and data-dependent
algorithms. The algorithms in the first category are independent of the underlying dataset.
The partial data-dependent algorithms are the category of algorithms where some generic
1This chapter is based on the publication in [126].
24
information such as the total count of the matrix, is used to generate the private FM, but
no consideration is made for the data distribution. The algorithms in the last category
exploit the distribution of data points into consideration to improve the utility. The
rationale behind such a categorization is the fact that every step in the generation of DP
data must be done in a DP way, including processes such as partitioning matrix based on
the underlying data distribution and statistics. Most algorithms are developed to address
only the publication of 1D and 2D FMs.
In the category of data-independent approaches, two baseline algorithms that stand out
are called singular and identity. The singular algorithm [55] considers the frequency matrix
as a single partition and adds Laplace noise to the total count. The queries are answered
based on the sanitized total count only, considering the assumption of data uniformity. The
identity algorithm [39] on the other hand, adds Laplace noise to each entry of the frequency
matrix. The number of partitions in this algorithm is equal to the total number of entries.
Another approach, referred to as the Privlet algorithm [150], enhances the performance of
the identity algorithm by transforming the frequency matrix based on wavelets and by
adding noise in the new domain. Then, the algorithm converts back to the noisy matrix and
releases the DP counts. The authors in [29] build a quadtree on top of the FM: a tree-based
spatial indexing structure that splits each node into four equal quarters, regardless of data
placement. The so-called binning or partitioning of space without observing the histograms
is studied in [28]. The authors consider the amount of overlap between bins and propose an
algorithm called ’varywidth’ that provides improved performance in terms of the trade-off
25
between the spatial precision and the accumulated variance over differentially private
queries. The use of summaries for private publication of histograms is explored in [30]. The
authors show it is possible to reduce the two-step approach of generating private summaries,
in which first the private histogram is generated and then the summaries are released, to a
one-step approach. The one-step method prevents the data owner and data user from
getting overwhelmed with the large computational complexity overhead.
In contrast to the data independent algorithms, data dependent approaches exploit the
distribution of data in the FM to answer queries with higher accuracy. General purpose
mechanisms [75, 74] and their workload-aware counterpart DAWA [73] operate over a
discrete 1D domain; however, they can be applied to the 2D domain by dimensional
reduction transformations such as Hilbert curves [55]. Unfortunately, dimensionality
reduction can prevent range queries from being answered accurately, and also increases
computational complexity. This significantly limits their practicality, particularly for
higher-dimensional data. Data-aware tree-based algorithms such as k-d-trees [152] allocate
a portion of the budget to partitioning, and generate split points based on density. Hybrid
approaches between data-independent and data-dependent algorithms have also been
proposed, e.g., UG and AG [102]. We refer to these approaches as partially data-dependent.
Only the sanitized total count of the FM is used in the partitioning process. The UG
algorithm and its extension [102] sanitize the total count of FMs and use it to alter the
granularity of FM such that the utility of the published private FM is improved. The MKM
approach proposed in [71] provides an alternative formula to partition FM considering its
26
dimensionality. As is the case in UG, the formula only takes as input the total count of the
frequency matrix and determines the granularity of FM based on the sanitized total count.
In some cases, such approaches have been shown to provide superior performance to more
complex methods [55].
There is also prior work regarding the storage, processing, and compression of
histograms, but without considerations for privacy. The authors in [64] focus on lowering
the computational complexity of matrix multiplication and storage. The proposed approach
generates an execution plan for the multiplication of dense and sparse matrices. A cost
model is also proposed to understand the sparsity of matrices and the estimation of density.
The execution plan tends to optimize the overall cost overhead. An adaptive tile matrix
representation is proposed in [65] for large matrix multiplication. An operator called
ATMULT with the capability of shared memory parallel matrix multiplication is proposed
for dynamic tile-granular optimizations, conducted based on the density estimation. The
work in [33] studies the problem of density estimation for higher dimensional histograms.
The main idea is to estimate the distribution of data for a given set of samples. The
algorithm provides near-optimal sample complexity, i.e. close to theoretical information
limit, and runs in polynomial time.
Despite attempts to enhance data utility while maintaining DP, existing methods often
result in significant information loss, especially when the data distribution is sparse.
Chapter 7 and Chapter 5 are dedicated to improving data utility and bridging this divide.
27
4.1.2 DP Publication of Time Series
The existing body of work on DP publication of time series falls into two primary
classifications: data transformation and correlation analysis. In the former category, the
main strategy involves converting the data into an alternative domain that exhibits lower
sensitivity or provides a condensed representation of the time series. After sanitization in
this new domain, an inverse function is used to revert the data back to its original form for
publication. Notable methods in this category include the Fourier transformation [104, 72]
and the discrete Haar wavelet transform [81]. The latter category focuses on enhancing the
utility of DP time series publications through improved leverage of inter-data correlations.
This includes the concept of pufferfish privacy, which employs a Bayesian Network to model
data correlations as discussed in [135], the use of the Kalman Filter to reduce utility loss as
explored in [45], and the adoption of a first-order autoregressive process for correlation
modeling as presented in [161].
Approaches for the publication of higher-dimensional FMs have also been used to
publish time series. One of the well-known algorithms in this class is High-Dimensional
Matrix Mechanism (HDMM) [84] which represents queries and data as vectors and uses
sophisticated optimization and inference techniques to answer them. DPCube [151] searches
for dense sub-cubes to release privately. Some of the privacy budget is used to obtain noisy
counts over regular partitioning, which is then refined to a standard KD-tree. Fresh noisy
28
counts for the partitions are obtained with the remaining budget, and a final inference step
resolves inconsistencies between the two sets of counts.
The techniques for publishing and using time series data under DP are still at an early
stage, largely because of the substantial privacy budget needed to address temporal
correlations. Chapter 8 tackles this problem by incorporating spatial information to
enhance the utility of DP time series.
4.2 Fairness
4.2.1 Fairness in ML
There exist two broad categories of fairness notions [86, 22]: individual fairness and group
fairness. In group fairness, individuals are divided into groups according to a protected
attribute, and a decision is said to be fair if it leads to a desired statistical measure across
groups. Some prominent group fairness metrics are calibration [100], statistical
parity [69][37], equalized odds [53], treatment equality [12], and test fairness [27]. Individual
fairness notions focus on treating similar individuals the same way. Similarity may be
defined with respect to a particular task [37, 61].
Unfairness mitigation techniques can be categorized into three broad groups:
pre-processing, in-processing, and post-processing. Pre-processing algorithms achieve
fairness by focusing on the classifier’s input data. Some well-known techniques include
suppression of sensitive attributes, change of labels, reweighting, representation learning,
29
and sampling [62]. In-processing techniques achieve fairness during training by adding new
terms to the loss function [63] or including more constraints in the optimization.
Post-processing techniques sacrifice the utility of output confidence scores and align them
with the fairness objective [99].
4.2.2 Fairness in Spatial Domain
The fairness and justice concepts in geographical social studies have been a subject of
research as early as the 1990’s [54]. With the rise of ML and its influence on
decision-making with geospatial data, this issue has gained increased importance.
Neighborhoods or individual locations frequently serve as decision-making factors in entities
such as government agencies and banks. This context can lead to unfairness in a variety of
tasks, such as mortgage lending [70], job recruitment [44], school admissions [10], and crime
risk prediction [145].
A case study on American Census datasets by Ghodsi et al. [50] underlines the context’s
importance for fairness, illustrating how spatial distribution can impact a model’s
fairness-related performance. According to [145], recidivism prediction models built with
data from one location often underperform when applied to another location. The study
in [124] explores individual spatial fairness within two contexts: (i) distance-based fairness,
motivated by nearest neighbor semantics, such as selecting drivers in ride-sharing apps, and
(ii) zone-based fairness, where abrupt changes in neighborhood boundaries can bias the
classifier’s outcome. The work in [132] formulates the issue of fairness-aware range queries,
30
defining a fair query as one most similar to the user’s own query. Studies in [57, 153]
consider crop monitoring in palm oil plantations, aiming to incorporate a fairness criterion
primarily based on the F1 score during training. Additionally, the authors propose SPAD
(space as a distribution) - a method to formulate the spatial fairness of learning models in
continuous domains. The authors in [108] define spatial fairness as the statistical
independence of outcomes from locations and propose an approach to audit spatial fairness.
The auditing is conducted by exploring the distribution of outcomes inside and outside of a
given region and how similar they are. Weydemann et al. [149] measure fairness in
next-location recommendation systems in which the historical movement pattern of users is
utilized to make future location recommendations. The proposed framework by the authors
first captures the probability that the location recommender suggests locations based on
race groups and then aims to adjust the distribution for fairer outcomes.
The authors in [106] propose a loss function designed for individual fairness in social
media and location-based advertising. Pujol et al. [101] expose the disparate impact of
differential privacy on various neighborhoods. There have been numerous attempts to apply
fairness notions to clustering data points in Cartesian space. The notion described in[66]
views clustering as fair if the average distance to points in its own cluster is not larger than
the average distance to any other cluster’s points. The authors in [83] concentrate on
defining individual fairness for k-median and k-means algorithms, suggesting clustering is
individually fair if each point expects to have a cluster center within a certain radius.
31
The concept and definition of fairness in spatio-temporal data remain a major challenge
that has been overlooked in existing literature, primarily due to its inherent complexities.
In Chapter 9 and Chapter 6, we aim to address this oversight and introduce mechanisms
designed to enhance fairness in the processing of spatio-temporal data.
32
Chapter 5
HTF: Homogeneous Tree Framework for Differentially-Private
Release of Location Data
Location-based mobile applications are widespread across various fields, including
transportation, urban development, and healthcare. Important use cases for location data
rely on statistical queries, e.g., identifying hotspots where users work and travel. Such
queries can be answered efficiently by building histograms. However, precise histograms can
expose sensitive details about individual users. DP is a mature and widely-adopted
protection model, but most approaches for DP-compliant histograms work in a
data-independent fashion, leading to poor accuracy. In this chapter, We identify density
homogeneity as a main factor driving the accuracy of DP-compliant histograms, and we
build a data structure that splits the space such that data density is homogeneous within
each resulting partition. We show through extensive experiments on large-scale real-world
33
data that the proposed approach achieves superior accuracy compared to existing
approaches. This chapter is based on the publication in [119] and [117]1
.
5.1 Introduction
Statistical analysis of location data, typically collected by mobile apps, helps researchers
and practitioners understand patterns within the data, which in turn can be used in various
domains such as transportation, urban planning and public health. At the same time,
significant privacy concerns arise when locations are directly accessed. Sensitive details
about individuals, such as political or religious affiliations, alternative lifestyle habits, etc.,
can be derived from users’ whereabouts. Therefore, it is essential to account for user
privacy and protect location data.
Differential privacy (DP) [39] is a well-established protection model for statistical data
processing. DP allows answering aggregate queries (e.g., count, sum) while hiding the
presence of any specific individual within the data. In other words, the query results do not
permit an adversary to infer with significant probability whether a certain individual’s
record is present in the dataset or not. DP achieves protection by injecting random noise in
the query results according to well-established rules. It is a powerful semantic model
1Shaham, S., Ghinita, G., Ahuja, R., Krumm, J. and Shahabi, C., 2021, November. HTF: homogeneous
tree framework for differentially private release of location data. In Proceedings of the 29th International
Conference on Advances in Geographic Information Systems (pp. 184-194).
Shaham, S., Ghinita, G., Ahuja, R., Krumm, J. and Shahabi, C., 2023. HTF: Homogeneous Tree Framework
for Differentially Private Release of Large Geospatial Datasets with Self-tuning Structure Height. ACM
Transactions on Spatial Algorithms and Systems, 9(4), pp.1-30.
34
adopted by both government entities (e.g., Census Bureau) as well as major industry
players.
In the location domain, existing DP-based approaches build a spatial index structure,
and perturb index node counts using random noise. Subsequently, queries are answered
based on the noisy node counts. Building DP-compliant index structures has several
benefits: first, querying indexes is a natural approach for most existing spatial processing
techniques; second, using an index helps quantify and limit the amount of disclosure, which
becomes infeasible if one allows arbitrary queries on top of the exact data; third, query
efficiency is improved. Due to large amounts of background knowledge data available to
adversaries (e.g., public maps, satellite imagery), information leakage may occur both from
query answers, as well as from the properties of the indexing structure itself. To deal with
the structure leakage, initial approaches used data-independent index structures, such as
quad-trees, binary space partitioning trees, or uniform grids (UG). No structural leakage
occurred, and the protection techniques focused on improving the signal-to-noise ratio in
query answers. However, such techniques perform poorly when the data distribution is
skewed.
More recently, data-dependent approaches emerged, such as adaptive grids (AG), or
KD-tree based approaches [102, 60]. AG overcomes the rigidity of UG by providing a
two-level grid, where the first level has fixed granularity, and the second uses a granularity
derived from the coarse results obtained from the first level. While it achieves improvement,
it is still a rather blunt tool to account for high variability in dataset density, which is quite
35
typical of real-life datasets. Other more sophisticated approaches tried to build KD-trees or
R-trees in a DP-compliant way, but to do so they used DP mechanisms such as the
exponential mechanism (EM) (discussed in Section 2) which are difficult to tune and may
introduce significant errors in the data. In fact, the work in [102] shows that
data-dependent structures based on EM fail to outperform AG.
Our proposed Homogeneous Tree Framework (HTF) is addressing the problem of
DP-compliant location protection using a data-dependent approach that focuses on building
index structures with homogeneous intra-node density. Our key observation is that density
homogeneity is the main factor influencing the signal-to-noise ratio for DP-compliant
spatial queries. Rather than using complex mechanisms like EM which have high sensitivity,
we derive theoretical results that can directly link index structure construction with
intra-node data density based on the lower-sensitivity Laplace mechanism. This novel
approach allows us to build effective index structures capable of delivering low query error
without excessive consumption of privacy budget. HTF is custom-tailored for capturing
areas of homogeneous density in the dataset, which leads to significant accuracy gains. Our
specific contributions are:
• We identify data homogeneity as the main factor influencing query accuracy in
DP-compliant spatial data structures;
36
Figure 5.1: System Model for Private Location Histograms.
• We propose a custom technique for homogeneity-driven DP-compliant space
partitioning based on the Laplace mechanism, and we perform an in-depth analysis of
its sensitivity;
• We derive effective DP budget allocation strategies to balance the noise added during
the building of the structure with that used for releasing query answers;
• We propose a set of heuristics to automatically tune data structure parameters based
on data properties, with the objective of minimizing overall error in query answering;
• We perform an extensive empirical evaluation showing that HTF outperforms existing
state-of-the-art on real and synthetic datasets under a broad range of privacy
requirements.
37
5.2 Preliminaries
Private publication of location histograms follows the two-party system model shown in
Fig. 5.1. The data owner/curator first builds an exact histogram with the distribution of
locations on the map. Non-trusted users/analysts are interested in learning the population
distribution over different areas, and perform statistical queries. The goal of the curator is
to publish the location histogram without the privacy of any individual being compromised.
To this end, the exact histogram undergoes a sanitizing process according to DP to
generate a private histogram. In our proposed method, a tree-based algorithm is applied for
protection, and the tree’s nodes, representing a private histogram of the map, are released
to the public. Analysts/researchers ask unlimited count queries that are answered from the
private histogram. Furthermore, they may download the whole private histogram, and the
protection method remains strong enough to protect the identity of individuals in the
database.
5.2.1 Problem Formulation
Consider a two-dimensional location dataset D discretized to an arbitrarily-fine N × M
grid. Each point is represented by its corresponding rectangular cell in the grid. We study
the problem of releasing DP-compliant histograms to answer count queries as accurately as
possible. Cell counts are modeled via an N × M frequency matrix, in which the entry in the
i
th row and j
th column represents the number of records located inside cell (i, j) of the grid.
38
A DP histogram is generated based on a non-overlapping partitioning of the frequency
matrix by applying methods to preserve ϵ-DP. The DP histogram consists of the boundary
of partitions and their noisy counts, where each partition consists of a group of cells.
Let us denote the total count of a partition with q cells by c and its noisy count by c.
There are two sources of error in answering a query. The first is referred to as noise error,
which is due to Laplace noise added to the partition count. The second source of noise is
referred to as uniformity error and arises when a query has partial overlap with a partition.
An assumption of uniformity is made within the partition, and the answer per cell is
calculated as c/q.
For example, consider the 3 × 3 grid shown in Fig. 5.2a, where each count represents the
number of data points in the corresponding cell. The cells are grouped in four partitions C1,
C2, C3, and C4, each entailing 0, 12, 4 and 2 data points, respectively. Independent noise
with the same magnitude is added to each partition’s count denoted by n1, n2, n3, and n4,
and released to the public as a DP histogram. The result of the query shown by the dashed
rectangle can be calculated as (12 + n2)/4 + (2 + n4)/2.
Problem 1. Generate a DP histogram of dataset D, such that the expected value of relative
error (MRE) is minimized, where for a query q with the true count c and noisy count c RE
is calculated as
MRE(q) = |c − c|
c
× 100 (5.1)
39
(a) (b)
Figure 5.2: Example of HTF Partitioning. Dashed Rectangles Represent the Query.
In the past, several approaches have been developed for Problem 1. Still, current
solutions have poor accuracy, which limits their practicality. Some methods tend to perform
better when applied to specific datasets (e.g., uniform) and quite poorly when applied to
others. Limitations of existing work have been thoroughly evaluated in [55].
5.3 Homogeneous-Tree Framework
Our proposed approach relies on two key observations to reduce the noise error and
uniformity error. To address noise error, one needs to carefully calibrate the sensitivity of
each operation performed, in order to reduce the magnitude of required noise. We achieve
this objective by carefully controlling the depth of the indexing structure. To control the
impact of uniformity error, we guide our structure-construction algorithm such that each
40
resulting partition (i.e., internal node or leaf node) has a homogeneous data distribution
within its boundaries.
Homogeneity ensures that uniformity error is minimized, since a query that does not
perfectly align with the boundaries of an internal/leaf node is answered by scaling the count
within that node in proportion with the overlap between the query and the node. None of
the existing works on DP-compliant data structures has directly addressed homogeneity.
Furthermore, conventional spatial indexing structures (designed for non-private data access)
are typically designed to optimize criteria other than homogeneity (e.g., reduce node area or
perimeter, control the data count balance across nodes). As a result, existing approaches
that use such structures underperform when DP-compliant noise is present.
We propose a Homogeneous-Tree Framework (HTF) which builds a customized spatial
index structure specifically designed for DP-compliant releases. We address directly aspects
such as selection of structure height, a homogeneity-driven node split strategy, and careful
tuning of privacy budget for each structure level. Our proposed data structure shares
similarities with KD-trees, due to the specific requirements of DP: namely, (1) nodes should
not overlap, since that would result in increased sensitivity, and (2) the leaf set should cover
the entire data domain, such that an adversary cannot exclude specific areas by inspecting
node boundaries. However, as shown in previous work, using KD-trees directly for DP
releases leads to poor accuracy [29, 55].
Similar to KD-trees, HTF divides a node into two sub-regions across a split dimension,
which is alternated through successive levels of the tree. The root node covers the whole
41
dataspace. Figure 5.2b provides an example of a non-private simplified version of the
proposed HTF construction applied on a 3 × 3 grid (frequency matrix). HTF consists of
three steps:
(A) Space partitioning aims to find an enhanced partitioning of the map such that the
accuracy of the private histogram is maximized. HTF performs heuristic partitioning based
on a homogeneity metric we define. At every split, we choose the coordinate that results in
the highest value of the homogeneity metric. For example, in the running example
(Fig. 5.2b) node B1 is split into C1 and C2, which are homogeneous partitions. However,
the metric evaluation is not straightforward in the private case, as metric computation for
each candidate split point consumes privacy budget. We use the Laplace mechanism to
determine an advantageous split point without consuming large amounts of budget. As part
of HTF, a search mechanism is used to select plausible candidates for evaluation and find a
near-optimal split position. The total privacy budget allocated for the private partitioning
is denoted by ϵprt.
(B) Data sanitization starts by traversing the tree generated in the partitioning step. At
each node, a certain amount of budget is used to perturb the node count using the Laplace
mechanism. Based on the sanitized count, HTF evaluates the stop condition (i.e., whether
to follow the downstream path from that node or release it as is), which is an important
aspect in building private data structures. The private evaluation of stop conditions enables
HTF to avoid over- or under-partitioning of the space, and preserve good accuracy.
Revisiting the example in Fig. 5.2b, suppose that we do not want to further partition the
42
space when the number of data points in a node is less than 7. Once HTF reaches node B2,
the actual node count (6) is noise-perturbed. The value of the sanitized count may be less
than 7 after sanitization, leading to pruning at B2 and stopping further partitioning.
Finally, the tree’s leaf set (i.e., sanitized count of each leaf node) is released to the public.
The total budget used for data sanitization is denoted by ϵdata.
(C) Height estimation is another important HTF step. Tree height is an important
factor in improving accuracy, as it influences the budget allocated at each index level. HTF
dedicates a relatively small amount of budget (ϵheight) to determine an appropriate height.
The total budget consumption of HTF (ϵtot) is the sum of budgets used in each of the
three steps:
ϵtot = ϵprt + ϵdata + ϵheight (5.2)
The DP composition rules in the case of HTF apply as follows:
• Sequential decomposition: The sum of budgets used for node splits along every tree
path adds up to the total budget available for partitioning.
• Parallel decomposition: The budget allocated for partitioning nodes in the same level
is independent, since the nodes at the same level have non-overlapping extents.
43
5.4 Technical Approach
Section 5.4.1 introduces the split objective function used in HTF, and provides its
sensitivity analysis. Section 5.4.2 focuses on HTF index structure construction.
Section 5.4.3 presents the data perturbation algorithm used to protect leaf node counts.
5.4.1 Homogeneity-based Partitioning
Previous approaches that used KD-tree variations for DP-compliant indexes preserved the
original split heuristics of the KD-tree: namely, node splits were performed on either median
or average values of the enclosed data points. To preserve DP, the split positions were
computed using the exponential mechanism (Section 2) which computes a merit function
for each candidate split. However, such an approach results in poor query accuracy [55].
We propose homogeneity as the key factor for guiding splits in the HTF index structure.
This decision is based on the observation that if all data points are uniformly distributed
within a node, then the uniformity error that results when intersecting that node with the
query range is minimized. At each index node split, we aim to obtain two new nodes with a
high degree of intra-node density homogeneity. Of course, since the decision is
data-dependent, the split point must be computed in a DP-compliant fashion.
For a given node of the tree, suppose that the corresponding partition covers U × V
cells of the N × M grid (i.e., frequency matrix), in which the count of data points located
in its i
th row and j
th column is denoted by cij . Without loss of generality, we discuss the
44
partitioning method w.r.t. the horizontal axis (i.e., rows). The aim is to find an index k
which groups rows 1 to k into one node and rows k + 1 to U into another, such that
homogeneity is maximized within each of the resulting nodes (we also refer to resulting
nodes as clusters). We emphasize that the input grid abstraction is used in order to obtain
a finite set of candidate split points. This is different than alternate approaches that use
grids to obtain DP-compliant releases. Furthermore, the frequency matrix can be arbitrarily
fined-grained, so discretization does not impose a significant constraint.
The proposed split objective function is formally defined as:
ok =
X
k
i=1
X
V
j=1
|cij − µ1| +
X
U
i=k+1
X
V
j=1
|cij − µ2|, (5.3)
where
µ1 =
Pk
i=1
PV
j=1
cij
k × V
, µ2 =
PU
i=k+1
PV
j=1
cij
(U − k) × V
. (5.4)
The optimal index k
∗ minimizes the objective function.
k
∗ = arg min
k
ok (5.5)
Consider the example in Figure 5.2b and the partitioning conducted for node B1. There
exist three possible ways to split rows of the frequency matrix: (i) separate the top row of
cells resulting in clusters {[0,0]} and {[3,3],[3,3]} yielding the objective value of zero in
45
Eq. (5.3); (ii) separate the bottom row of cells resulting in two clusters {[0,0],[3,3]} and
{[3,3]} yielding the objective value of 6, or (iii) no division is performed, yielding the
objective value of 8. Therefore, the proposed algorithm will select the first option (k
∗=2),
generating two nodes C1 and C2.
Note that the value of k
∗
is not private, since individual location data were used in the
process of calculating the optimal index. Hence, a DP mechanism is required to preserve
privacy. Thus, we need to assess the sensitivity of k
∗
, which represents the maximum
change in the split coordinate that can occur when adding or removing a single data point.
The sensitivity calculation is not trivial, since a single data point can cause the optimal
split to shift to a new position far from the non-private value. Another challenge is that the
exponential mechanism, commonly used in literature to select candidates from a set based
on a cost function, tends to have high sensitivity, resulting in low accuracy.
5.4.1.1 Baseline Split Point Selection.
We propose a DP-compliant homogeneity-driven split point selection technique based on
the Laplace mechanism. As before, consider U × V frequency matrix of a given node and a
horizontal dimension split. Denote by ok the objective function for split coordinate k
among the U candidates. There are U possible outputs O = (o1, o2, ..., oU ), one for each
split candidate. In a non-private setting, the index corresponding to the minimum oi value
would be chosen as the optimal division. To preserve DP, given that the partitioning budget
46
per computation is ϵ
′′
prt, we add independent Laplace noise to each oi
, and then select the
optimal value among all noisy outputs.
O = (o1, o2, ..., oU ) = O + Lap(2/ϵ′′
prt), (5.6)
where Lap(2/ϵ′′
prt) denotes a tuple of U independent samples of Laplace noise. Note that
since the grid is fixed, enumerating split candidates as cell coordinates is data-independent,
hence does not incur disclosure risk. The Laplace noise added to each oi
is calibrated
according to a sensitivity of 2, as proved in Theorem 1:
Theorem 1 (Sensitivity of Partitioning). The sensitivity of cost function ok for any given
horizontal or vertical index k is 2.
Proof. In the calculation of objective function o for a given index k, adding or removing an
individual data point affects only one cell and the corresponding cluster. The objective
function for split point k can be written as
ok =
X
k
i=1
X
V
j=1
|cij − µ1| +
X
U
i=k+1
X
V
j=1
|cij − µ2|, (5.7)
The modified objective function value following addition of a single record to an arbitrary
cell cxy can be represented as
o
′
k =
X
k
i=1
X
V
j=1
|c
′
ij − µ
′
1
| +
X
U
i=k+1
X
V
j=1
|c
′
ij − µ
′
2
|. (5.8)
47
Without loss of generality, assume that the additional record is located in the first
cluster which results in µ
′
1 = µ1 + 1/kV , µ
′
2 = µ2, and c
′
ij being equal to cij for all possible
i and j except for cxy where we have c
′
xy = cxy + 1. Therefore, the sensitivity of the
objective function has value 2 as follows:
∆ok = ok − o
′
k ≤
2(kV − 1)
kV ≤ 2 (5.9)
Eq. (5.9) is derived using the reverse triangle inequality:
|cij − µ1 −
1
kV | − |cij − µ1|
≤
1
kV
∀{ij|i ∈ {1, ..., k} ∧ j ∈ {1, ..., V }, ij ̸= xy}
(5.10)
and
|cxy + 1 − µ1 −
1
kV | − |cxy − µ1|
≤
kV − 1
kV (5.11)
Similarly, the sensitivity upper bound corresponding to an individual record’s removal can
be shown to be 2.
We refer to the above approach as the baseline approach. One challenge with the
baseline is that the calculation of noise is performed separately for each candidate split
point, and since the computation depends on all data points within the parent node, the
budget consumption adds up according to sequential composition.
48
This means that the calculation of each individual split candidate in oi may receive only
1/U of the budget available for that level.
For large values of U, the privacy budget per computation becomes too small,
decreasing accuracy. This leads to an interesting trade-off between the number of split
point candidates evaluated and the accuracy of the entire release. On one hand, increasing
the number of candidates leads to a higher likelihood of including the optimal split
coordinate in the set O; on the other hand, there will be more noise added to each
candidate’s objective function output, leading to the selection of a sub-optimal candidate.
Next, we propose an optimization which finds a good compromise between number of
candidates and privacy budget per candidate.
5.4.1.2 Optimized Split Point Selection
We propose an optimization that aims to minimize the number of split point candidate
evaluations required, and searches for a local minimum rather than the global one.
Algorithm 1 outlines the approach for a single split step along the y-axis (i.e., row split).
Inputs to Algorithm 1 include (i) the frequency matrix FU×V of the parent node, (ii) the
total budget allocated for the partitioning per level of the tree ϵ
′
prt, and (iii) variable T
which bounds the maximum number of objective function computations – a key factor
indicating the extent of search, and thus of the budget per operation. The proposed
approach is essentially a search tree, determining the candidate split to minimize the
49
objective function’s output. The search starts from a wide range of candidates and narrows
down within each interval until reaching a local minimum, similar to a binary search.
Let {l, . . . , r} represent the index range where the search is conducted, initially set to
the first and last possible index of the input frequency matrix. At every iteration of the
main ’for’ loop, the search interval is divided into four equal length sub-intervals, including
three inner points and two boundary point. The inner points are referred to as split indices.
The objective function is calculated for each of these candidates, and perturbed using
Laplace noise to satisfy DP. The split corresponding to the minimum value is chosen as the
center of the next search interval, and its immediate ’before’ and ’after’ split positions are
assigned as the updated search boundaries l and r). Hence, in every iteration, two new
computations of the objective function are performed, except the first run which has a
single computation. Therefore, the total number of private evaluations sums to (2T + 1)
each perturbed with the privacy budget of ϵ
′′
prt = ϵ
′
prt/(2T + 1).
5.4.2 HTF Index Structure Construction
Our proposed HTF index structure is built in accordance to the split point selection
algorithm introduced in Section 5.4.1. The HTF construction pseudocode is presented in
Algorithm 2. Each node stores the rectangular spatial extent of the node (node.region), its
children (node.left and node.right), real data count (node.count), noisy count
(node.ncount), and the node’s height in the tree.
50
Algorithm 1 Near-optimal Split Point Estimator
1: function GetSplitPoint( FU×V , ϵ
′
prt, axis, T )
2: ϵ
′′
prt ← ϵ
′
prt/(2 T + 1)
3: l ← 1, r ← (axis == 0)?V : U #axis = 0 means x-split
4: k ← ⌊(r − l)/2⌋
5: Compute ok at axis according to Eq. (5.3)
6: ok ← ok + Lap(2/ϵ′′
prt)
7: while l ≤ r and T > 0 do
8: k1 ← ⌊(k − l)/2⌋
9: k2 ← ⌊(r − k)/2⌋
10: Compute ok1 and ok2 at axis according to Eq. (5.3)
11: ( ok1, ok2 ) ← ( ok1, ok2 ) + Lap(2/ϵ′′
prt)
12: M inOutput ← min(ok1
, ok, ok2
)
13: if M inOutput = ok then
14: l ← k1, r ← k2
15: else if M inOutput = ok1 then
16: k ← k1, r ← k
17: else
18: l ← k, k ← k2
19: end if
20: T ← T − 1
21: end while
22: return k
23: end function
The root of the tree represents the entire data domain (N × M frequency matrix) and
its height is denoted by h. Deciding the height of the tree is a challenging task: a large
height will result in a smaller amount of privacy budget per level, whereas a small one does
not provide sufficient granularity at the leaf level, decreasing query precision. We estimate
an advantageous height value using a small amount of budget (ϵheight) to perturb the total
number of data records based on the Laplace mechanism:
|D| = |D| + Lap(1/ϵheight). (5.12)
51
Next, we set the height to:
h = log2
(
|D|ϵtot
10
) (5.13)
The formula is motivated by the work in [102]. The authors show that when data are
uniformly distributed in space, using a grid with a lower granularity of
s
|D|ϵtot
c0
×
s
|D|ϵtot
c0
improves the mean relative error, where the value of constant c0 is
set to 10 experimentally. We emphasize that the approach does not indicate that the
number of leaves on the tree is |D|ϵtot
c0
, but the information contained in this number is
merely used as an estimator of the tree’s height. This estimation is formally characterized
in [55] and referred to as scale-epsilon exchangeability property. The intuition is that the
error due to decreasing the amount of budget used for the estimation is offset by having a
larger number of data points in the entire dataset.
The last input to the algorithm is the budget allocated per level of the partitioning tree.
We use uniform budget allocation to allocate the budget between levels denoted as
ϵ
′
prt = ϵprt/h.
Starting from the root node, the proposed algorithm recursively creates two child nodes
and decreases height by one. This is done by splitting the underlying area of the node into
two hyperplanes based on Algorithm 1. The division is done on the y dimension if the
current height is an even number and in the x dimension otherwise. The algorithm
52
Algorithm 2 DP space partitioning
1: function PrivatePartitioning(node, ϵ
′
prt)
2: if node.height=0 then
3: return node
4: end if
5: axis ← node.height mod 2
6: node.count ← sum(node.region)
7: OptIdx ← GetSplitPoint(node.FreqMatrix, ϵ
′
prt, axis, T )
8: lef tC, rightC ← split node on OptIdx
9: node.leftChild ← PrivatePartitioning(lef tChild, ϵ
′
prt)
10: node.rightChild ← PrivatePartitioning(rightChild, ϵ
′
prt)
11: end function
continues until reaching the minimum height of zero, or to a point where no further
splitting is possible.
5.4.3 Leaf Node Count Perturbation
Once the HTF structure is completed, the final step of our algorithm is to release
DP-compliant counts for index nodes, so that answers to queries can be reconstructed from
the noisy counts. The total partitioning budget adds up to ϵheight + ϵprt, where ϵheight was
used to estimate the tree height and ϵprt budget to generate the private partitioning tree.
The data perturbation step uses the remaining ϵdata amount of budget and releases node
counts according to the Laplace mechanism.
One can choose various strategies to release index node counts. At one extreme, one can
simply release a noisy count for each index node; in this case, the budget must be shared
across nodes on the same path (sequential composition), and can be re-used across different
paths (parallel composition). This approach has the advantage of simplicity, and may do
53
well when queries exhibit large variance in size – it is well-understood that when perturbing
large counts, the relative error is much lower, since the Laplace noise magnitude only
depends on sensitivity, and not the actual count.
However, in practice, queries tend to be more localized, and one may want to allocate
more budget to the lower levels of the structure, where the actual counts are smaller, thus
decreasing relative error. In fact, as another extreme, one can concentrate the entire ϵdata
on the leaf level. However, doing so can also decrease accuracy, since some leaf nodes have
very small real counts.
Our approach takes a middle ground, where the available ϵdata is spent to (i) determine
which nodes to publish and (ii) ensure sufficient budget remains for the noisy counts.
Specifically, we publish only leaf nodes, but these are not the same leaves returned by the
structure construction algorithm. Instead, we perform an additional pruning step which
uses the noisy counts of internal nodes to determine a stop condition, i.e., the level at which
a node count is likely to be small enough that a further recursion along that path is not
helpful to obtain good accuracy. Effectively, we perform pruning of the tree using a small
fraction of the data budget, and then split the remaining budget among the non-pruned
nodes along a path. This helps decrease the effective height of the tree across each path,
and hence the resulting budget per level increases.
Next, we present in detail our approach that contributes two main ideas: (i) how to
determine smart stop (or pruning) conditions based on noisy internal node counts, and (ii)
how to allocate perturbation budget across shortened paths.
54
The proposed technique is summarized in Algorithm 3: it takes as inputs the root node
of the tree generated in the data partitioning step; the remaining budget allocated for the
perturbation of data (ϵdata); a tracker of accumulated budget (ϵaccu); a stop condition
predicate denoted by cond; and the nominal tree height h as computed in Section 5.4.2.
Similar to prior work [29], we use a geometric progression budget allocation strategy, but we
enhance it to avoid wasting budget on unnecessarily long paths. The intuition behind this
strategy is to assign more budget to the nodes located in the lower levels of the tree, since
their actual counts are lower, and hence larger added noise impacts the relative error
disproportionately high. Conversely, at the higher levels of the tree, where actual counts are
much higher, the effect of the noise is negligible.
Eq. (5.14) formulates this goal as a convex optimization problem.
min
ϵ0...ϵh
X
h
i=0
2
h−i
/ϵ2
i
(5.14)
where (5.15)
X
h
i=0
ϵi = ϵ, ϵi > 0 ∀i = 0...h (5.16)
Writing Karush-Kuhn-Tucker (KKT) [13] conditions, the optimal allocation of budget can
be calculated as:
55
L(ϵ1, ..., ϵh, λ) = X
h
i=0
2
h−i
/ϵ2
i + λ(
X
h
i=0
ϵi − ϵ) (5.17)
⇒
∂L
∂ϵi
= −
2
h−i+1
ϵ
3
i
+ λ = 0 (5.18)
⇒ ϵi =
2
h−i+1
λ1/3
, (5.19)
and substituting ϵi
’s in the constraint of problem the optimal budget in the i-th level is
derived as
ϵi =
2
(h−i)/3
ϵ (21/3 − 1)
(2(h+1)/3 − 1) . (5.20)
The algorithm starts the traversal from the partitioning tree’s root and recursively visits
the descendent nodes. Once a new node is visited, the first step is to use the node’s height
to determine the allocated budget (ϵ
′
data) based on geometric progression. Recall that the
nodes on the same level follow parallel decomposition of the budget as their underlying
areas in space do not overlap. Additionally, the algorithm keeps track of the amount of
budget used so far on the tree, optimizing the budget in later stages. Next, the computed
value of ϵ
′
data is utilized to perturb the node.count by adding Laplace noise, resulting in
noisy count node.ncount.
The stop condition we use takes into account the noisy count in the current internal
node (i.e., count threshold); and the spatial extent of the internal node threshold (i.e.,
extent threshold). If none of the thresholds is met for the current node, the algorithm
56
Algorithm 3 DP data perturbation
1: function Perturber(node, ϵdata, ϵaccu ,cond, h)
2: i ← node.height
3: ϵ
′
data ←
2
(h−i)/3
ϵdata (21/3 − 1)
(2(h+1)/3 − 1)
4: ϵaccu = ϵaccu + ϵ
′
data
5: node.ncount = node.count + Lap(1/ϵ′
data)
6: if node.ncount≤ cond then
7: ϵremain = ϵdata − ϵaccu
8: node.ncount = node.count + Lap(1/ϵremain)
9: node.leftChild = node.rightChild = null
10: else
11: Perturber(node.leftChild, ϵdata, ϵaccu, cond, h)
12: Perturber(node.rightChild , ϵdata, ϵaccu, cond,h)
13: end if
14: end function
recursively visits the node’s children; otherwise, the algorithm prunes the tree considering
that the current node should be a leaf node. In the latter case, the algorithm subtracts the
accumulated budget used so far on that path from the root, and uses the entire remaining
budget available to perturb the count. This significantly improves the utility, as geometric
allocation tends to save most of the budget for the lower levels of the tree. Revisiting the
example in Figure 5.2b, suppose that the stop condition is to prune when the underlying
area consists of less than four cells. During the data perturbation process, the node B2 is
turned into a leaf node due to its low number of cells. At this point, the node’s children are
removed, and its noisy count is determined based on the remaining budget available on the
lower levels of the tree.
57
5.5 Experimental Evaluation
5.5.1 Experimental Setup
We evaluate HTF on both real and synthetic datasets:
Los Angeles Dataset. This is a subset of the Veraset dataset [144]2
, including location
measurements of cell phones within LA city. In particular, we consider a large geographical
region covering a 70 × 70 km2 area centered at latitude 34.05223 and longitude -118.24368.
The selected data generates a frequency matrix of 3.5 million data points during a time
period between Jan 1-7 2020.
Synthetic dataset. We generate locations according to a Gaussian distribution as follows:
a cluster center, denoted by (xc, yc), is selected uniformly at random. Next, coordinates for
each data point x and y are drawn from a Gaussian distribution with the mean of xc and yc,
respectively. We model three sparsity levels by using three standard deviation (σ) settings
for Gaussian variables: low (σ = 20), medium (σ = 50), and high (σ = 100) sparsity.
We discretize the space to a 1024 × 1024 frequency matrix. We use as performance
metric the mean relative error (MRE) for range queries. Similar to prior work [55, 158, 150,
102], we consider a smoothing factor of 20 for the relative error, to deal with cases when the
true count for a query is zero (i.e., relative error is not defined). Each experimental run
consists of 2,000 random rectangular queries with center selected uniformly at random. We
vary the size of queries to a region covering {2%, 6%, 10%} of the dataspace.
2Veraset is a data-as-a-service company that provides anonymized population movement data collected
through location measurement signals of cell phones across USA.
58
(a) ϵtot = 0.1, random shape and size
queries.
(b) ϵtot = 0.3, random shape and size
queries.
(c) ϵtot = 0.5, random shape and size
queries.
(d) Random square queries, size 2%,
ϵtot = 0.1.
(e) Random square queries, size 6%,
ϵtot = 0.1.
(f) Random square queries, size 10%,
ϵtot = 0.1.
Figure 5.3: Comparison with Data Dependent Algorithms, Los Angeles Dataset.
59
(a) Stop Count = 10, random shape/size
queries.
(b) Stop Count = 50, random shape/size
queries.
(c) Stop Count = 100, random shape/-
size queries
(d) Random square queries, size 2%.
(e) Random square queries, size 6%. (f) Random square queries, size 10%.
Figure 5.4: Comparison to Grid-based Algorithms, Los Angeles Dataset.
60
5.5.2 HTF vs Data Dependent Algorithms
Data-dependent algorithms aim to exploit users’ distribution to provide an enhanced
partitioning of the map. The state-of-the-art data-dependent approach for DP-compliant
location publication is the KD-tree technique from [29]. The KD-tree algorithm generates
the partitioning tree by splitting on median values, which are determined using the
exponential mechanism. We have also included the smoothing post-processing step from
[102, 56] which resolves inconsistencies within the structure (e.g., using the fact that counts
in a hierarchy should sum to the count of an ancestor).
Fig. 5.3 presents the comparison of the HTF algorithm with KD-tree approaches,
namely: (i) geometric budget allocation in addition to smoothing and post-processing
labelled as KdTree (geo); (ii) uniform budget allocation including smoothing and
post-processing labelled as KdTree (uniform); (iii) HTF algorithm with the partitioning
budget per level set to ϵ
′
prt = 5E − 4; and (iv) HTF algorithm with ϵ
′
prt = 1E − 3. Recall
that ϵ
′
prt denotes the budget per level of partitioning, and therefore, given the tree’s height
as h, the remaining budget for perturbation is derived as ϵdata = ϵtot − ϵ
′
prt × h − ϵheight.
The value of ϵheight in the experiments is set to 1E − 4, the HTF’s T value is set to 3, and
stop condition thresholds are set to no less than 5 cells or 100 data points. Moreover, for
the KD-tree algorithm, 15% of the total budget is allocated to implement the partitioning.
In Figs. 5.3a, 5.3b, and 5.3c, the MRE performance is compared for different values of
ϵtot over a workload of uniformly located queries with random shape and size. HTF clearly
61
outperforms KD-tree for all height settings. Looking at the MRE performance, the KD-tree
algorithm follows a parabolic shape commonly occurring in tree-based algorithms, meaning
that the MRE performance reaches its best values at a particular height, and further
partitioning of the space increases the error. This is caused by excessive partitioning in
low-density areas. HTF, on the other hand, is applying stop conditions to avoid the adverse
effects of over-partitioning, and is able to estimate the optimal height beforehand.
Figs. 5.3d, 5.3e, and 5.3f, show the results when varying query size (for square shape
queries). HTF outperforms the KD-tree algorithm significantly.
5.5.3 HTF vs Grid-based Algorithms
Grid-based approaches are mostly data-independent, and they partition the space using
regular grids. The uniform grid (UG) approach uses a single-layer fixed size grid, whereas
its successor adaptive grid (AG) method considers two layers of regular grids: the first layer
is similar to UG, whereas the second uses a small amount of data-dependent information
(i.e., noisy query results on the first layer) to tune the second layer grid granularity.
Fig. 5.4 presents the comparison of HTF with AG and UG. For HTF, we consider
several stop condition thresholds. HTF consistently outperforms grid-based approaches,
especially when the total privacy budget is lower (i.e., more stringent privacy requirements).
The impact of the stop count condition on HTF depends on the underlying distribution of
data points. For the Los Angeles dataset, MRE is relatively larger for small values of ϵtot.
The performance improves and reaches its near-optimal values around stop count 50, and
62
(a) ϵtot = 0.1, random shape and size
queries.
(b) ϵtot = 0.3, random shape and size
queries.
(c) ϵtot = 0.5, random shape and size
queries.
(d) Random square queries, size 2%.
(e) Random square queries, size 6%. (f) Random square queries, size 10%.
Figure 5.5: Comparison to Data Independent Algorithms, Los Angeles Dataset.
63
ultimately worsens when the stop count becomes larger. This matches our expectation as,
on one hand, small values of stop count result in over-partitioning, and on the other hand,
when the stop count is too large, the partitioning tree cannot reach the ideal heights,
resulting in high MRE values.
Figures 5.4d, 5.4e, and 5.4f, show the obtained results for varying query sizes (2, 6, and
10% of the data domain). HTF outperforms both AG and UG. Note that all three
algorithms are adaptive and change their partitioning according to the number of data
points. Therefore, in low privacy regimes, the structure of algorithms may cause
fluctuations in the accuracy. However, as the privacy budget grows, the algorithms reach
their maximum partitioning limit, and increasing the budget always results in lower MRE.
5.5.4 HTF vs Data Independent Algorithms
The most prominent DP-compliant data-independent technique [29] uses QuadTrees. The
technique recursively partitions the space into four equal size quadrants. Two commonly
used budget allocation techniques used in [29] are geometric budget allocation (geo) and
uniform budget allocation. For a fair comparison, we have also included the smoothing
post-processing step from [102, 151].
Fig. 5.5 presents the comparison results. Figures 5.5a, 5.5b, 5.5c are generated using
random shape and size queries, and several different height settings. Note that the fanout
of QuadTrees is double the fanout of HTF, so the height represented by 2k in the figures
corresponds to the implementation of QuadTree with the height of k. The error of the
64
QuadTree approach is large for small heights, then improves to its optimal value, and rises
again significantly as height further increases, due to over-partitioning. Similar to KD-trees,
no systematic method has been developed for QuadTree to determine optimal height,
whereas the HTF height selection heuristic yields levels 15, 16, and 17 for the allocated
privacy budget of 0.1, 0.3, and 0.5, respectively. HTF outperforms QuadTree for all settings
of ϵtot. Figures 5.5d, 5.5e, and 5.5f show the accuracy of HTF and QuadTree for
square-shaped randomly placed queries of varying size. HTF outperforms Quadtree in all
cases.
5.5.5 Additional Benchmarks
Figure 5.6: Mixed Workloads, ϵtot = 0.1, All Datasets.
65
To further validate HTF performance, we run experiments on the Los Angeles dataset
as well as six synthetic datasets generated using Gaussian distribution. Fig. 5.6 presents the
comparison of all algorithms with randomly generated query workload and a privacy budget
of ϵtot = 0.1. Three additional algorithms are used as comparison benchmarks, (i) Singular
algorithm– preserving differential privacy by adding independent Laplace noise to each
entry of the frequency matrix, (ii) Uniform algorithm in which Laplace noise is added to
the total count of the grid with the assumption that data are uniformly distributed within
the grid, and (iii) the Privlet [150] algorithm based on wavelet transformations.
Fig. 5.6 shows that HTF consistently outperforms existing approaches. For the denser
datasets (sigma = 20) the gain compared to approaches designed for uniform data (e.g.,
UG, AG, Quadtrees) is lower. As data sparsity grows, the difference in accuracy between
HTF and the benchmarks increases. HTF performs best on a relative basis for lower ϵ
′
prt,
i.e., more stringent privacy requirements.
66
Chapter 6
Fair Spatial Indexing: A Paradigm for Group Spatial Fairness
Machine learning (ML) is becoming increasingly integral to decision-making processes that
have direct impacts on individuals, such as approving loans or screening job candidates.
Significant concerns arise that, without special provisions, individuals from under-privileged
backgrounds may not get equitable access to services and opportunities. Existing research
studies fairness with respect to protected attributes such as gender, race or income, but the
impact of location data on fairness has been largely overlooked. With the widespread
adoption of mobile apps, geospatial attributes are increasingly used in ML, and their
potential to introduce unfair bias is significant, given their high correlation with protected
attributes. In this chapter, we propose techniques to mitigate location bias in machine
learning. Specifically, we consider the issue of miscalibration when dealing with geospatial
attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm
that accounts for fairness. Our KD-tree inspired approach significantly improves fairness
67
while maintaining high learning accuracy, as shown by extensive experimental results on
real data. This chapter is based on the publication in [123]1
.
6.1 Introduction
Recent advances in machine learning (ML) led to its adoption in numerous decision-making
tasks that directly affect individuals, such as loan evaluation or job application screening.
Several studies [98, 12, 88] pointed out that ML techniques may introduce bias with respect
to protected attributes such as race, gender, age or income. The last years witnessed the
introduction of fairness models and techniques that aim to ensure all individuals are treated
equitably, focusing especially on conventional protected attributes (like race or gender).
However, the impact of geospatial attributes on fairness has not been extensively studied,
even though location information is being increasingly used in decision-making for novel
tasks, such as recommendations, advertising or ride-sharing. Conventional applications may
also often rely on location data, e.g., allocation of local government resources, or crime
prediction by law enforcement using geographical features. For example, the Chicago Police
Department releases monthly crime datasets [26] and classifies neighborhoods based on
their crime risk level. Subsequently, the risk level is used to determine vehicle and house
insurance premiums, which are increased to reflect the risk level, and in turn, result in
additional financial hardship for individuals from under-privileged groups.
1Shaham, S., Ghinita, G. and Shahabi, C., 2023. Fair Spatial Indexing: A paradigm for Group Spatial
Fairness. arXiv preprint arXiv:2302.02306.
68
Fairness for geospatial data is a challenging problem, due to two main factors: (i) data
are more complex than conventional protected attributes such as gender or race, which are
categorical and have only a few possible values; and (ii) the correlation between locations
and protected attributes may be difficult to capture accurately, thus leading to
hard-to-detect biases.
We consider the case of group fairness [38], which ensures no significant difference in
outcomes occurs across distinct population groups. In our setting, groups are defined with
respect to geospatial regions. The data domain is partitioned into disjoint regions, and each
of them represents a group. All individuals whose locations belong to a certain region are
assigned to the corresponding group. In practice, a spatial group can correspond to a zip
code, a neighborhood, or a set of city blocks. Our objective is to devise fair geospatial
partitioning algorithms, which can handle the needs of applications that require different
levels of granularity in terms of location reporting. Spatial indexing [146, 41, 157] is a
common approach used for partitioning, and numerous techniques have been proposed that
partition the data domain according to varying criteria, such as area, perimeter, data point
count, etc. We build upon existing spatial indexing techniques, and adapt the partition
criteria to account for the specific goals of fairness. By carefully combining geospatial and
fairness criteria in the partitioning strategies, one can obtain spatial fairness while still
preserving the useful spatial properties of indexing structures (e.g., fine-level clustering of
the data).
69
Specifically, we consider a set of partitioning criteria that combines physical proximity
and calibration error. Calibration is an essential concept in classification tasks which
quantifies the quality of a classifier. Consider a binary classification task, such as a loan
approval process. Calibration measures the difference between the observed and predicted
probabilities of any given point being labeled in the positive class. If one partitions the data
according to some protected attribute, then the expectation would be that the probability
should be the same across both groups (e.g., people from different neighborhoods should
have an equal chance, on aggregate, to be approved for a loan). If the expected and actual
probabilities are different, that represents a good indication of unfair treatment.
Our proposed approach builds a hierarchical spatial index structure by using a
composite split metric, consisting of both geospatial criteria (e.g., compact area) and
miscalibration error. In doing so, it allows ML applications to benefit from granular
geospatial information, while at the same time ensuring that no significant bias is present in
the learning process.
Our specific contributions include:
• We identify and formulate the problem of spatial group fairness, an important concept
which ensures that geospatial information can be used reliably in a classification task,
without introducing, intentionally or not, biases against individuals from
underprivileged groups;
70
(a) Neighborhoods Partitioning
(b) Generation of Classifier Scores
(c) Calibration of Neighborhoods
Figure 6.1: An Example of the Miscalibration Problem with Respect to Neighborhoods.
• We propose a new metric to quantify unfairness with respect to geospatial boundaries,
called Expected Neighborhood Calibration Error (ENCE);
• We propose a technique for fair spatial indexing that builds on KD-trees and considers
both geospatial and fairness criteria, by lowering miscalibration and reducing ENCE;
• We perform an extensive experimental evaluation on real datasets, showing that the
proposed approach is effective in enforcing spatial group fairness while maintaining
data utility for classification tasks.
71
6.2 Preliminaries
6.2.1 System Architecture
We consider a binary classification task T over a dataset D of individuals u1, ..., u|D|
. The
feature set recorded for ui
is denoted by xi ∈ R
l
, and its corresponding label by yi ∈ {0, 1}.
Each record consists of l features, including an attribute called neighborhood, which captures
an individual’s location, and is the main focus of our approach. The sets of all input data
and labels are denoted by X and Y, respectively. A classifier h(.) is trained over the input
data resulting in h(X ) = (Y , ˆ Sˆ) where Yˆ = {yˆ1, ..., yˆ|D|} is the set of predicted labels
(yˆi ∈ {0, 1}) and S = {s1, ..., s|D|} is the set of confidence scores (si ∈ [0, 1]) for each label.
The dataset’s neighborhood feature indicates the individual’s spatial group. We assume
the spatial data domain is split into a set of partitions of arbitrary granularity. Without loss
of generality, we consider a U × V grid overlaid on the map. The grid is selected such that
its resolution captures adequate spatial accuracy as required by application needs. A set of
neighborhoods is a non-overlapping partitioning of the map that covers the entire space,
with the i
th neighborhood denoted by Ni
, and the set of neighborhoods denoted by N .
Figure 6.1 illustrates the system overview. Figure 6.1a shows the map divided into 4
non-overlapping partitions N = {N1, N2, N3, N4}. The neighborhood is recorded for each
individual u1, ..., u11 together with other features, and a classifier is trained over the data.
The classifier’s output is the confidence score for each entry which turns into a class label
by setting a threshold.
72
6.2.2 Fairness Metric
Our primary focus is to achieve spatial group fairness using as metric the concept of
calibration [90, 100], described in the following.
In classification tasks, it is desirable to have scores indicating the probability that a test
data record belongs to a certain class. Probability scores are especially important in
ranking problems, where top candidates are selected based on relative quantitative
performance. Unfortunately, it is not granted that confidence scores generated by a
classifier can be interpreted as probabilities. Consider a binary classifier that indicates an
individual’s chance of committing a crime after their release from jail (recidivism). If two
individuals u1 and u2 get confidence scores 0.4 and 0.8, this cannot be directly interpreted
as the likelihood of committing a crime by u2 being twice as high as for u1. The model
calibration aims to alleviate precisely this shortcoming.
Definition 9. (Calibration). An ML model is said to be calibrated if it produces calibrated
confidence scores. Formally, outcome score R is calibrated if for all scores r in support of R
it holds that
P(y = 1|R = r) = r (6.1)
This condition means that the set of all instances assigned a score value r contains an r
fraction of positive instances. The metric is a group-level metric. Suppose there exist 10
people who have been assigned a confidence score of 0.7. In a well-calibrated model, we
expect to have 7 individuals with positive labels among them. Thus, the probability of the
73
whole group is 0.7 to be positive, but it does not indicate that every individual in the group
has this exact chance of receiving a positive label.
To measure the amount of miscalibration for the whole model or for an output interval,
the ratio of two key factors needs to be calculated: expected confidence scores and the
expected value of true labels. Abiding by the convention in [90], we use functions o(.) and
e(.) to return the true fraction of positive instances and the expected value of confidence
scores, respectively. For example, the calibration of the model in Figure 6.1b is computed
as:
e(h)
o(h)
=
(
P
u∈D su)/|D|
(
P
u∈D yu)/|D|
=
5.2/11
7/11
≈ .742 (6.2)
Perfect calibration is achieved when a specific ratio is equal to one. Ratios that are above or
below one are considered miscalibration cases. Another way to measure the calibration error
is by using the absolute value of the difference between two values, denoted by |e(h) − o(h)|,
with the ideal value being zero. In this work, the second method is utilized, as it eliminates
the division by zero problem that may arise from neighborhoods with low populations.
6.2.3 Problem Formulation
Even when a model is overall well-calibrated, it can still lead to unfair treatment of
individuals from different neighborhoods. In order to achieve spatial group fairness, we
must have a well-calibrated model with respect to all neighborhoods. The existence of
calibration error in a neighborhood can result in classifier bias and lead to systematic
74
unfairness against individuals from that neighborhood (in Section 6.4, we support this
claim with real data measurements).
Definition 10. (Calibration for Neighborhoods). Given neighborhood set
N = {N1, ..., Nt}, we say that the score R is calibrated in neighborhood Ni
if for all the
scores r in support of R it holds that
P(y = 1|R = r, N = Ni) = r, ∀i ∈ [1, t] (6.3)
The following equations can be used to measure the amount of miscalibration with
respect to neighborhood Ni
,
e(h|N = Ni)
o(h|N = Ni)
or |e(h|N = Ni) − o(h|N = Ni)| (6.4)
Going back to the example in Figure 6.1d, the calibration amount for neighborhoods N1
to N4 is visualized on a plot. Neighborhood N4 is well-calibrated, whereas the others suffer
from miscalibration.
Problem 2. Given m binary classification tasks T1, T2, ..., Tm, we seek to partition the
space into continuous non-overlapping neighborhoods such that for each decision-making
task, the trained model is well-calibrated for all neighborhoods.
75
Figure 6.2: Overview of the Proposed Mitigation Techniques.
6.2.4 Evaluation Metrics
A commonly used metric to evaluate the calibration of a model is Expected Calibration
Error (ECE) [51]. The goal of ECE is to understand the validity of output confidence
scores. However, our focus is on identifying the calibration error imposed on different
neighborhoods. Therefore, we extend ECE and propose the Expected Neighborhood
Calibration Error (ENCE) that captures the calibration performance over all
neighborhoods.
76
Definition 11. (Expected Neighborhood Calibration Error). Given t non-overlapping
geospatial regions N = {N1, ..., Nt} and a classifier h trained over data located in these
neighborhoods, the ENCE metric is calculated as:
ENCE =
X
t
i=1
|Ni
|
|D|
|o(Ni) − e(Ni)| (6.5)
where o(Ni) and e(Ni) return the true fraction of positive instances and the expected value
of confidence scores for instances in Ni
2
.
Table 6.1: Summary of Notations.
Symbol Description
k Number of features
D =
{u1, ..., u|D|}
Dataset of individuals
(xi
, yi)
(Set of features, true label) for
ui
D = [X , Y]
Dataset with user features and
labels
Yˆ={yˆ1, .., yˆ|D|} Set of predicted labels
S =
{s1, ..., s|D|}
Set of confidence scores
N =
{N1, ..., Nt}
Set of neighborhoods
U × V Base grid resolution
T Binary classification task
m
Number of binary classification
tasks
t Number of neighborhoods
th Tree height
2Symbol |.| denotes absolute value.
77
(a) Initial execution of classifier.
(b) Re-districting the map based on fairness objective.
(c) Training classifier based on re-districted neighborhoods.
Figure 6.3: Overview of Fair KD-tree Algorithm.
6.3 Spatial Fairness through Indexing
We introduce several algorithms that achieve group spatial fairness by constructing spatial
index structures in a way that takes into account fairness considerations when performing
data domain splits. We choose KD-trees as a starting point for our solutions, due to their
ability to adapt to data density, and their property of covering the entire data domain (as
opposed to structures like R-trees that may leave gaps within the domain).
78
Figure 6.2 provides an overview of the proposed solution. Our input consists of a base
grid with an arbitrarily-fine granularity overlaid on the map, the attributes/features of
individuals in the data, and their classification labels. The attribute set includes individual
location, represented as the grid cell enclosing the individual. We propose a suite of three
alternative algorithms for fairness, which are applied in the pre-processing phase of the ML
pipeline and lead to the generation of new neighborhood boundaries. Once spatial
partitioning is completed, the location attribute of each individual is updated, and
classification is performed again.
The proposed algorithms are:
• Fair KD-tree is our primary algorithm and it re-districts spatial neighborhoods based
on an initial classification of data over a base grid. Fair KD-tree can be applied to a
single classification task.
• Iterative Fair KD-tree improves upon Fair KD-tree by refining the initial ML scores at
every height of the index structure. It incurs higher computational complexity but
provides improved fairness.
• Multi-Objective Fair KD-tree enables Fair KD-trees for multiple classification tasks. It
leads to the generation of neighborhoods that fairly represent spatial groups for
multiple objectives.
Next, we prove an important result that applies to all proposed algorithms, which states
that any non-overlapping partitioning of the location domain has a weighted average
79
calibration greater or equal to the overall model calibration. The proofs of all theorems are
provided in [123].
Theorem 2. For a given model h and a complete non-overlapping partitioning of the space
N = {N1, N2, ..., Nt}, ENCE is lower-bounded by the overall calibration of the model.
A broader statement can also be proven, showing that further partitioning leads to poorer
ENCE performance.
Theorem 3. Consider a binary classifier h and two complete non-overlapping partitioning
of the space N1 and N2. If N2 is a sub-partitioning of N1, then:
ENCE(N1) ≤ ENCE(N2) (6.6)
Neighborhood set N2 is a sub-partitioning of N1 if for every Ni ∈ N1, there exists a set of
neighborhoods in N2 such that their union is Ni.
6.3.1 Fair KD-tree
We build a KD-tree index that partitions the space into non-overlapping regions according
to a split metric that takes into account the miscalibration metric within the regions
resulting after each split. Figure 6.3 illustrates this approach, which consists of three steps.
Algorithm 4 presents the pseudocode of the approach.
Step 1. The base grid is used as input, where the location of each individual is
represented by the identifier of their enclosing grid cell. This location attribute, alongside
80
Algorithm 4 Fair KD-tree
Input: Grid (U × V ), Features (X ), Labels (Y), Height (th).
Output: New neighborhoods and updated feature set
1: function FairKDtree(N, X , Y, S, th)
2: if th = 0 then
3: N ← N + N
4: return T rue
5: end if
6: axis ← th mod 2
7: Lk
∗ , Rk
∗ ← SplitNeighborhood(N, Y, S, axis)
8: Run FairKDtree(Lk
∗ , X , Y, S, th − 1)
9: Run FairKDtree(Rk
∗ , X , Y, S, th − 1)
10: end function
11: N1 ← Grid
12: Global N ← {}
13: Set all neighborhoods in X to N1
14: Scores (S) ← Train ML model on X and Y
15: Neighborhoods (N ) ← Run F airKDtree(N1, X , Y, S, th)
16: Update neighborhoods in X
17: return N , X
other features, is used as input to an ML classifier h for training. The classifier’s output is a
set of confidence scores S, as illustrated in Figure 6.3a. Once confidence scores are
generated, the true fraction of positive instances and the expected value of predicted
confidence scores of the model with respect to neighborhoods can be calculated as follows:
e(h|N = Ni) = 1
|Ni
|
(
X
u∈Ni
su) ∀i ∈ [1, t] (6.7)
o(h|N = Ni) = 1
|Ni
|
(
X
u∈Ni
yu) ∀i ∈ [1, t] (6.8)
where t is the number of neighborhoods.
81
Algorithm 5 Split Neighborhood
Input: Neighborhood (N), Confidence Scores (S), Labels (Y), Axis.
Output: Non-overlapping split of N into two neighborhoods
1: function SplitNeighborhood(N, S, Y, axis)
2: if axis = 1 then
3: N ← Transpose of N
4: end if
5: U
′ × V
′ ← Dimensions of N
6: for k = 1...U′ do
7: Lk ← Neighborhoods in 1...k
8: Rk ← Neighborhoods in k + 1...U
9: zi ← Compute Equation (6.9) for Lk and Rk
10: end for
11: k
∗ ← arg mink zk
12: return Lk
∗ , Rk
∗
13: end function
Step 2. This step performs the actual partitioning, by customizing the KD-tree split
algorithm with a novel objective function. KD-trees are binary trees where a region is split
into two parts, typically according to the median value of the coordinate across one of the
dimensions (latitude or longitude). Instead, we select the division index that reduces the
fairness metric, i.e., ENCE miscalibration. Confidence scores and labels resulted from the
previous training step are used as input for the split point decision. For a given tree node,
assume the corresponding partition covers U
′ × V
′
cells of the entire U × V grid. Without
loss of generality, we consider partitioning on the horizontal axis (i.e., row-wise). The aim is
to find an index k which groups rows 1 to k into one node and rows k + 1 to U
′
into
another, such that the fairness objective is minimized (among all possible index split
82
Algorithm 6 Iterative Fair KD-tree
Input: Grid (U × V ), Features (X ), Labels (Y), Height (th).
Output: New neighborhoods and updated feature set
1: N1 ← Grid
2: Set all neighborhoods in X to N1
3: N ← {N1}
4: while th > 0 do
5: Scores (S) ← Train ML model on X and Y
6: Nnew ← {}
7: for Ni
in N do
8: L, R ← SplitNeighborhood(Ni
, S, Y, th%2)
9: Nnew ← Nnew + L, R
10:
11: end forN ← Nnew
12: Update neighborhoods in X based on N
13: th ← th − 1
14: end while
15: return N , X
positions). Let Lk and Rk denote the left and right regions generated by splitting on index
k. The fairness objective for index k is:
zk =
|Lk| × |o(Lk) − e(Lk)| − |Rk| × |o(Rk) − e(Rk) |
(6.9)
In the above equation, |Lk| and |Rk| return the number of data entries in the left and right
regions, respectively. The intuition behind the objective function is to reduce the model
miscalibration difference as we heuristically move forward. Two key points about the above
function are: (i) the formulation of calibration is used in linear format due to the possibility
83
of a zero denominator, and (ii) the calibration values are weighted by their corresponding
regions’ cardinalities. The optimal index k
∗
is selected as:
k
∗ = arg min
k
zk (6.10)
Step 3. On completion of the fair KD-tree algorithm, the index leaf set provides a
non-overlapping partitioning of the map. In the algorithm’s final step, the neighborhood of
each individual in the dataset is updated according to the leaf set and used for training.
The pseudocode for the Fair KD-tree method is illustrated in Algorithms 4 and 5. The
SplitNeighborhood function in the latter identifies the split point based on the fairness goal,
and it is invoked multiple times within Algorithm 4. In Algorithm 4, lines 11 to 14 outline
the algorithm’s initial training stage, as detailed previously in Step 1. The starting grid is
determined as N1 in line 11, and the model undergoes training in line 14. The recursive
split procedure is initiated in line 15. Upon reaching a leaf node, the neighborhood is stored
in line 3. If not, further divisions are made, focusing on the fairness target.
Theorem 4. For a given dataset D, the required number of neighborhoods t and the model
h, the computational complexity of Fair KD-tree is O(|D| × ⌈log(t)⌉) + O(h).
6.3.2 Iterative Fair KD-tree
One drawback of the Fair KD-tree algorithm is its sensitivity to the initial execution of the
model, which uses the baseline grid to generate confidence scores. Even though the space is
84
Figure 6.4: Overview of Iterative Fair KD-tree Algorithm.
recursively partitioned following the initial steps, the scores are not re-computed until the
index construction is finalized. The iterative fair KD-tree addresses this limitation by
re-training the model and computing updated confidence scores after each split (i.e., at each
level of the tree). A refined version of ML scores is used at every height of the tree, leading
to a more fair redistricting of the map.
Similar to the Fair KD-tree algorithm, the baseline grid is initially used, and all grid
cells are considered to be in the same neighborhood (i.e., a single spatial group covering the
entire domain). The algorithm is implemented in th iterations with the root node
corresponding to the initial point (entire domain). As opposed to the Fair KD-tree
algorithm that follows Depth First Search (DFS) recursion, the Iterative Fair KD-tree
algorithm is based on Breadth First Search (BFS) traversal. Therefore, all nodes in the
given height i − 1 are completed before moving forward to the height i. Suppose we are in
the i
th level of the tree, and all nodes at that level are generated. Note that, the set of
nodes at the same height represents a non-overlapping partitioning of the grid. The
algorithm continues by updating the neighborhoods at height i based on the i − 1 level
85
Figure 6.5: Aggregation in Multi-Objective Fair KD-tree.
partitioning. Then, the updated dataset is used to train a new model, thus updating
confidence scores for each individual.
Algorithm 6 presents the Iterative Fair KD-tree algorithm. Let N denote the set of all
neighborhoods at level i of the tree. For each neighborhood Ni ∈ N , Iterative Fair KD-tree
splits the region Ni by calling the SplitNeighborhood function in Algorithm 5. The split is
done on the x-axis if i is even and on the y-axis otherwise.
The algorithm provides a more effective way of determining a fair neighborhood
partitioning by re-training the model at every tree level, but incurs higher computation
complexity.
86
Theorem 5. For a given dataset D, the required number of neighborhoods t and the model
h, the computational complexity of Iterative Fair KD-tree is
O(|D| × ⌈log(t)⌉) + ⌈log(t)⌉ × O(h).
6.3.3 Multi-Objective Fair KD-tree
So far, we focused on achieving a fair representation of space given a single classification
task. In practice, applications may dictate multiple classification objectives. For example, a
set of neighborhoods that are fairly represented in a city budget allocation task may not
necessarily result in a fair representation of a map for deriving car insurance premia. Next,
we show how Fair KD-tree can be extended to incorporate multi-objective decision-making
tasks.We devise an alternative method to compute initial scores in Line 9 of Algorithm 5,
which can then be called as part of Fair KD-tree in Algorithm 4. A separate classifier is
trained over each task to incorporate all classification objectives. Let hi be the i
th classifier
trained over D and label set Yi representing the task Ti
. The output of the classifier is
denoted by Si = {s
i
1
, ..., si
|D|
}, where in s
i
j
, the superscript identifies the set Si and the
subscript indicates individual uj . Once confidence scores for all models are generated, an
auxiliary vector is constructed as follows:
87
vi =
s
i
1 − y
i
1
s
i
2 − y
i
2
.
.
.
s
i
|D| − y
i
|D|
, ∀i ∈ [1...t] (6.11)
To facilitate task prioritization, hyper-parameters α1, ..., αm are introduced such that
Pm
i=1 αi = 1 and 0 ≤ αi ≤ 1. Coefficient αi
indicates the significance of classification Ti
.
The complete vector used for computing the partitioning is then calculated as,
vtot =
Xm
i=1
αivi =
Pm
i=1 αi(s
i
1 − y
i
1
)
Pm
i=1 αi(s
i
2 − y
i
2
)
.
.
.
Pm
i=1 αi(s
i
|D| − y
i
|D|
)
(6.12)
In the above formulation, each row corresponds to a unique individual and captures its
role in all classification tasks. Let vtot[ui
] denote the entry corresponding to ui
in vtot.
Then the classification objective function in Eq. 6.9 is replaced by:
zk =
|Lk| × | X
ui=Lk
vtot[ui
]| − |Rk| × | X
ui=Rk
vtot[ui
] |
(6.13)
88
and the optimal split point is selected as,
k
∗ = arg min
k
zk (6.14)
Vector aggregation is illustrated in Figure 6.5.
Theorem 6. For a given dataset D, the required number of neighborhoods t and m
classification tasks modelled by h1, ..., hm, computational complexity of Multi-Objective Fair
KD-tree is O(|D| × ⌈log(t)⌉) + Pm
i=1 O(hi).
6.4 Experimental Evaluation
6.4.1 Experimental Setup
We use two real-world datasets provided by EdGap [142] with 1153 and 966 data records
respectively, containing socio-economic features (e.g., household income and family
structure) of US high school students in Los Angeles, CA and Houston, Texas. Consistent
with [48], we use two features of average American College Testing (ACT) and the
percentage of family employment as indicators to generate classification labels. The
geospatial coordinates of schools are derived by linking their identification number to data
provided by the National Center for Education Statistics [91].
We evaluate the performance of our proposed approaches (Fair KD-tree, Iterative Fair
KD-tree, and multi-objective Fair KD-tree) in comparison with four benchmarks: (i)
89
(a) EdGap (Los Angeles) (b) EdGap (Los Angeles)
(c) EdGap (Houston) (d) EdGap (Houston)
Figure 6.6: Evidence of Model Disparity on Geospatial Neighborhoods.
Median KD-tree, the standard method for KD-tree partitioning; (ii) Reweighting over grid –
an adaptation of the re-weighting approach used in in [62] and deployed in geospatial tools
such as IBM AI Fairness 360; (iii) Zipcode partitioning; and (iv) the SPAD (space as a
distribution) method proposed in [153], designed to improve spatial fairness by minimizing
statistical discrepancies tied to partitioning and scaling in a continuous space. The core
idea of SPAD is to introduce fairness via a referee at the start of each training epoch. This
involves adjusting the learning rate for different data sample partitions. Intuitively, a
partition that exceeds performance expectations will receive a reduced learning rate, while
90
(a) Los Angeles (Logistic Regression)
(b) Los Angeles (Decision Tree) (c) Los Angeles (Naive Bayes)
(d) Houston (Logistic Regression) (e) Houston (Decision Tree) (f) Houston (Naive Bayes)
Figure 6.7: Performance Evaluation with Respect to ENCE.
those underperforming will be allocated higher rates. All experiments are implemented in
Python and executed on a 3.40GHz core-i7 Intel processor with 16GB RAM.
6.4.2 Evidence for Disparity in Geospatial ML
First, we perform a set of experiments to measure the amount of bias that occurs when
performing ML on geospatial datasets without any mitigating factors. Figure 6.6 captures
the existing disparity with respect to widely accepted metrics of calibration error and ECE
with 15 bins. We use the ratio representation of calibration in which a closer value to 1
represents higher calibration levels. Two logistic regression models are trained over
neighborhoods in Los Angeles and Houston areas. The labels are generated by setting a
threshold of 22 on the average ACT performance of students in schools. The overall
91
(a) Model Accuracy (Los Angeles) (b) Training Miscalibration (Los
Angeles)
(c) Test Miscalibration (Los Angeles)
(d) Model Accuracy (Houston)
(e) Training Miscalibration (Houston)
(f) Test Miscalibration (Houston)
Figure 6.8: Performance Evaluation with Respect to Other Indicators.
performance of models in terms of training and test calibration in Los Angeles and Texas
are (1.005, 1.033) and (0.999, 0.958), respectively. Both training and test calibration are
close to 1 overall, which in a naive interpretation would indicate all schools are treated
fairly. However, this is not the case when computing the same metrics on a
per-neighborhood basis. Figure 6.6 shows miscaibration error for the top 10 most populated
92
(a) Median KD-tree (Los Angeles)
(b) Fair KD-tree (Los Angeles) (c) Iterative Fair KD-tree (Los
Angeles)
(d) Median KD-tree (Houston) (e) Fair KD-tree (Houston) (f) Iterative Fair KD-tree (Houston)
Figure 6.9: Impact of Features on Decision-making.
zip codes. Despite the model’s acceptable outcomes overall, many individual neighborhoods
suffer from severe calibration errors, leading to unfair outcomes in the most populated
regions, which are often home to the under-privileged communities.
93
6.4.3 Mitigation Algorithms
6.4.3.1 Evaluation w.r.t. ENCE Metric.
ENCE is our primary evaluation metric that captures the amount of calibration error over
neighborhoods. Recall that Fair KD-tree and its extension Iterative Fair KD-tree can work
for any given classification ML model. We apply algorithms for Logistic Regression,
Decision Tree, and Naive Bayes classifiers to ensure diversity in models. We focus on
student SAT performance following the prior work in [48] by setting the threshold to 22 for
label generation. Figure 6.7 provides the results in Los Angeles and Houston on the EdGap
dataset. The x-axis denotes the tree’s height used in the algorithm. Having a higher height
indicates a finer-grained partitioning. The y-axis is log-scale.
Figure 6.7 demonstrates that both Fair KD-tree and Iterative Fair KD-tree outperform
benchmarks by a significant margin. The improvement percentage increases as the number
of neighborhoods increase, which is an advantage of our techniques, since finer spatial
granularity is beneficial for most analysis tasks. The intuition behind this trend lies in the
overall calibration of the model: given that the trained model is well-calibrated overall,
dividing the space into a smaller number of neighborhoods is expected to achieve a
calibration error closer to the overall model. This result supports Theorem 2, stating that
ENCE is lower-bounded by the number of neighborhoods. Iterative Fair KD-tree behaves
better, as confidence scores are updated on every tree level. The improvement achieved
compared to Fair KD-trees comes at the expense of higher computational complexity. On
94
(a) Height= 4, Los Angeles (b) Height= 6, Los Angeles (c) Height= 8, Los Angeles
(d) Height= 10, Los Angeles (e) Height= 4, Houston (f) Height= 6, Houston
(g) Height= 8, Houston (h) Height= 10, Houston
Figure 6.10: Performance Evaluation of Multi-objective Algorithm.
average Fair KD-tree achieves 45% better performance in terms of computational
complexity. The time taken for Fair KD-tree with 10 levels is 102 seconds, versus 189
seconds for the iterative version.
6.4.3.2 Evaluation w.r.t. other Indicators.
In Figure 6.8 we evaluate fairness with respect to three other key indicators: model
accuracy, training miscalibration, and test miscalibration. We focus on logistic regression to
95
discuss the performance as one of the most widely adopted classification units. The
accuracy of all algorithms follows a similar pattern and increases at higher tree heights.
This is expected, as more geospatial information can be extracted at finer granularities.
Figure 6.8b shows training miscalibration calculated for the overall model (a lower value
of calibration error indicates better performance). Our proposed algorithms have
comparable calibration errors to benchmarks, even though their fairness is far superior. Out
of all benchmarks, SPAD is observed to have comparable or slightly better performance
than our approach, but only at coarse granularities, when the space is partitioned according
to a low-height structure. However, at coarse granularity, there is little information that is
provided to the data recipient (e.g., in practice, it is of interest to take decisions at a city
block granularity, whereas zipcode-scale granularity is too coarse). For finer-grained
partitioning (i.e., higher height values) Fair KD-tree and iterative KD-tree outperform
benchmarks.
To understand better the underlying performance trends, Figure 6.9 provides the
heatmap for the tree-based algorithms over 10 different tree heights. The amount of
contribution each feature has on decision-masking is captured using a different color code.
One observation is that the model shifts focus to different features based on the height.
Such sudden changes can impact the generated confidence scores and, subsequently, the
overall calibration of the model. As an example, consider the median KD-tree algorithm at
the height of 8 in Los Angeles (Figure 6.8b): there is a sudden drop in training calibration,
which can be explained by looking at the corresponding heat map in Figure 6.9a. At the
96
Figure 6.11: Performance Evaluation on Synthetic Data.
height of 8, the influential features on decision-making consist of different elements than the
heights 4, 6, and 10, leading to the fluctuation in the model calibration.
6.4.4 Performance of multi-objective approach.
When multi-objective criteria are used, we need a methodology to unify the geospatial
boundaries generated by each task. Our proposed multi-objective fair partitioning
predicated on Fair KD-trees addresses exactly this problem. In our experiments, we use the
two criteria of ACT scores and employment percentage of families as the two objectives
used for partitioning. These features are separated from the training dataset in the
pre-processing phase and are used to generate labels. The threshold for ACT is selected as
before (22), and the threshold for label generation based on family employment is set to 10
percent.
97
Figure 6.12: Multi-objective Performance Evaluation.
Figure 6.10 presents the results of the Multi-Objective Fair KD-tree (to simplify chart
notation, we use the ‘Fair KD-tree’ label). We choose a α value of 0.5 to give equal weight
to both objectives. We emphasize that, the output of the Multi-Objective Fair KD-tree is a
single non-overlapping partitioning of the space representing neighborhoods. Once the
neighborhoods are generated, we show the performance with respect to each objective
function, i.e., ACT and employment. The first row of the figure shows the performance for
varying tree heights in Los Angeles, and the second row corresponds to Houston. The
proposed algorithm improves fairness for both objective functions. The margin of
improvement increases as the height of the tree increases.
6.4.5 Synthetic Data Results
We compare the studied algorithms using synthetic datasets, with the primary focus of
assessing their performance on larger data cardinality. The results are illustrated in
98
Figure 6.11. Synthetic data were generated with sizes of 1k, 10k, 50k, 100k using Python’s
SKLearn library to create a classification task encompassing 5 features, and users were
distributed across the Los Angeles map. The findings validate the earlier performance
assessment using real-world data, highlighting the superior performance of both the
Iterative Fair KD-tree and Fair KD-tree algorithms.
6.4.6 Multi-Objective Performance Evaluation
Figure 6.12 evaluates the Fair KD-tree’s effectiveness in a multi-objective setting using
synthetic data. We use three target features labeled as ’Obj1’, ’Obj2’, and ’Obj3’. The
multi-objective fair KD-tree is used to generate a single unified map for all three, and the
resulting performance is evaluated. The outcomes corroborate the performance analysis
using real-world data, highlighting the improved fairness outcomes achieved by the Fair
KD-tree algorithm.
99
Chapter 7
Differentially-Private Publication of Origin-Destination
Matrices with Intermediate Stops
Conventional origin-destination (OD) matrices record the count of trips between pairs of
start and end locations, and have been extensively used in transportation, traffic planning,
etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling,
it is increasingly important to also record intermediate points along an individual’s path,
rather than only the trip start and end points. This can be achieved by using a
multi-dimensional frequency matrix over a data space partitioning at the desired level of
granularity. However, serious privacy constraints occur when releasing OD matrix data, and
especially when adding multiple intermediate points, which makes individual trajectories
more distinguishable to an attacker. To address this threat, this chapter proposes a
technique for privacy-preserving publication of multi-dimensional OD matrices that
achieves DP, the de-facto standard in private data release. We propose a family of
approaches that factor in important data properties such as data density and homogeneity
100
in order to build OD matrices that provide provable protection guarantees while preserving
query accuracy. Extensive experiments on real and synthetic datasets show that the
proposed approaches clearly outperform existing state-of-the-art. This chapter is based on
the publication in [121]1
.
7.1 Introduction
Origin-destination (OD) matrices have been extensively used to characterize the demand for
transportation between pairs of start and end trip points. Using OD matrices, one can
provision appropriate capacity for a transportation infrastructure, by determining what is
the demand (or trip frequency) for each source-destination pair. However, novel applications
require more level of detail, for which conventional OD matrices are insufficient, due to the
fact that they have a 2D structure, and intermediate points along a trajectory cannot be
captured. Consider, for instance, the study of COVID-19 spread patterns in the ongoing
pandemic, where an analyst needs to determine not only the end points of a trajectory, but
also the intermediate points that a certain individual has visited, and where possible
exposure to the virus occurred. In this case, it is necessary to record several distinct points
across a trajectory, which leads to an increase in the dimensionality of OD matrices. We
denote such enhanced data structures as OD matrices with intermediate stops.
1Shaham, S., Ghinita, G. and Shahabi, C., 2022, March. Differentially-Private Publication of OriginDestination Matrices with Intermediate Stops. In Proceedings of the 25th International Conference on
Extending Database Technology (EDBT).
101
While such detailed OD matrices capture additional information, they also pose a more
serious privacy threat for the individuals included in the data, since the finer level of
granularity of trajectory representation allows an adversary to pinpoint a user with better
accuracy. For instance, there may be a large number of users that travel between a
suburban neighborhood and the city center. However, when intermediate stops are also
included, e.g., a specific type of store that sells ethnic products, a gym specializing on a
certain type of yoga, and a fertility clinic, there are far fewer individuals who follow such a
path (and sometimes, perhaps just one individual), which may lead to serious privacy
breaches related to that individual’s gender, race and lifestyle details. It is thus essential to
protect the privacy of individuals whose trajectories are aggregated to build detailed OD
matrices, and differential privacy (DP) [39] is the model of choice to achieve an appropriate
level of protection.
Specifically, DP bounds the ability of an adversary such that s/he cannot determine
with significant probability whether the trajectory data of a target individual is present in
the released OD matrix or not. The OD matrix with intermediate stop points is equivalent
to a multi-dimensional frequency matrix, in which an element represents the number of
individuals who took a trip that includes that specific sequence of start, intermediate and
end points. According to DP, carefully calibrated noise is added to each count to bound the
identification probability of any single individual.
Several approaches tackled the problem of protecting frequency matrices for location
data, but they do have serious limitations. For instance, solutions for DP-compliant
102
location data histograms [102, 29, 160, 158] build data-independent structures that do not
adapt well to data density, and assume a fixed dimensionality of the indexing structure,
typically 2D only. As we show in Section 7.5, they do not handle well skewed datasets,
which are the most typical ones in the case of geospatial data. Another category of
approaches attempts to capture trajectories using prefix trees or n-grams [4, 25], but those
approaches transform cells in the data domain into a sequence of abstract string labels, and
lose the proximity semantics that are so important when querying location-based data.
We propose a novel technique for sanitization of OD matrices with intermediate stops
such that location proximity semantics are preserved, and at the same time the
characteristics of the data are carefully factored in to boost query accuracy. Specifically, we
build custom data structures that tune important characteristics like index fan-out and
split points to account for data properties. This way, we are able to achieve superior
accuracy while at the same time enforcing the strong protection guarantees of DP.
Our specific contributions are:
• We identify important properties of indexing data structures that have a high impact
on query accuracy when representing location frequency matrices;
• We design customized approaches that are guided by intrinsic data properties and
automatically tune structure parameters, such as fan-out, split points and index
height;
103
• We perform a detailed analysis of the obtained data structures that allows us to
allocate differential privacy budget in a manner that is conducive to preserving as
much data accuracy as possible under a given privacy constraint;
• We perform an extensive experimental evaluation on both real and synthetic datasets
which shows that our proposed techniques adapt well to data characteristics and
outperform existing state-of-the-art in terms of query accuracy.
7.2 Preliminaries
We assume the two-party system model shown in Fig. 7.1: a trusted data curator/owner
collects the frequency matrix directly from individuals and sanitizes the data. Untrusted
data analysts are interested in querying the private frequency matrix.
Let F1 × F2 × ... × Fd be a d-dimensional array representing a frequency matrix F. Each
entry fi ∈ F is a number denoting a frequency or count. For example, a two-dimensional
frequency matrix can model a map with each entry indicating the number of individuals
located in a particular area. The frequency matrix corresponds to a d-dimensional finite
space hyper-rectangle, or d-orthotope. According to the differential privacy model, a
protection mechanism adds to each matrix element noise from a carefully selected random
distribution to prevent an adversary from learning with significant probability whether a
particular individual’s data was used or not when creating the matrix.
104
Figure 7.1: System Model for Private Frequency Matrices.
7.2.1 Problem Statement
Starting with an input frequency matrix, we create a set of non-overlapping partitions of
the matrix and then publish a set of noisy counts for each of these partitions, according to
the Laplace mechanism. The sanitized, DP-compliant frequency matrix consists of the
boundaries of all partitions and their noisy counts. Since partitions are non-overlapping, we
keep sensitivity low (i.e., 1). We refer to each input cell in the original frequency matrix as
an entry, hence a partition is a group of matrix entries. Analysts (i.e., users of the sanitized
matrix) ask multi-dimensional range queries.
Definition 12. (Range Query) A range query on the frequency matrix F is a d-orthotope
with dimensions denoted as d1 × d2 × ... × dn, where di represents a continuous interval in
dimension i.
105
Table 7.1: Summary of Notations.
Symbol Description
F Frequency matrix
Fi Dimension cardinality
N Total count of F
N Sanitized total count of F
m Partitioning constant
s Sensitivity
ϵtot Total privacy budget
ϵprt Partitioning budget
ϵdata Data perturbation budget
H(F) Entropy of F
Lap(s/ϵ)
Laplace noise with sensitivity s and
budget ϵ
For example, consider the 3 × 2 × 3 frequency matrix shown in Fig. 7.1. The generation
of partitions is referred to as partitioning and the addition of noise to total sums is referred
to as sanitization. The example shows a sample partitioning of the matrix generating three
partitions P1, P2 and P3 with total counts of 2, 4 and 12, respectively. In a simplified setup,
the sanitization follows by adding Laplace noise to the partitions’ total count and answering
queries based on the resulting private frequency matrix. Moreover, a uniformity
assumption [29] is made within each partition to answer queries with varying shapes and
sizes. For example, if the sanitized counts are 2 + n1, 4 + n2, and 12 + n3, where ni denotes
Laplace noise added for sanitization, and an analyst asks a query including two cells whose
borders are shown in bold red color, the answer is 12 + n3
6
+
2 + n1
4
.
Suppose that the total count of a partition entailing q entries is p, and its noisy count is
denoted by p. One can see that there are two sources of error while answering a query. The
106
first type of error is referred to as noise error, which is due to Laplace noise added to the
partition counts. The second source of error is referred to as uniformity error. The
uniformity error arises as the assumption of uniformity is made within each partition so
that the noisy count of a cell in the partition can be calculated as p/q.
To evaluate accuracy of query results, we use the mean relative error (MRE). For a
query q with the true count p and noisy count p, MRE is calculated as
MRE(q) = |p − p|
p
× 100 (7.1)
Problem 3. Given a frequency matrix F, generate an ϵ-differentially private version of F
such that the expected value of relative error (MRE) is minimized.
In the design of methods for the publication of private frequency matrices, we make
extensive use of entropy to understand the amount of information contained in the
frequency matrix and the effect that partitioning has on information loss.
Definition 13. (Entropy) Given a frequency matrix F and a set of partitions
P = {P1, P2, ..., Pn} with the total counts p1, p2, ..., pn, the entropy of F is defined as:
H(F|P) = −
Xn
i=1
P
pi
n
j=1 pj
log2 P
pi
n
j=1 pj
(7.2)
Table 7.1 summarizes notations used throughout the paper.
107
7.2.2 Trajectory Modeling with OD Matrices
Conventional OD matrices allow analysts to determine how many individuals traveled
between pairs of locations, e.g., between the central business district (CBD) and a suburb.
Increasing availability of mobile data and their use in complex planning problems makes it
important to expand the expressiveness of OD matrices, by allowing one to include
intermediate stops, which essentially amounts to supporting queries on trajectories.
Furthermore, conventional OD matrices tend to use abstract representations of locations,
where the spatial information may be lost, e.g., by tabulating counts of individuals traveling
between pairs of zipcodes. Proximity of zipcodes may be lost in the process, and if one
wishes to change the representation granularity, or perform range-based queries (e.g., find
how many users traveled from a 1km circle centered at point A to a 1km circle centered at
B), such functionality is not possible.
Our proposed multi-dimensional histograms produce a hierarchical partitioning of the
data domain that preserves locality and proximity information. It allows flexible queries,
and captures intermediate points along a trajectory, as shown in Figure 7.2. Assume a
trajectory representation where one wishes to capture daily activities across several time
frames, e.g., morning->noon->evening. Trajectory T1 corresponds to a person who lives in
a suburb, works in CBD and goes to see a play in the evening. This can be captured using
a multi-dimensional histogram where the first pair of spatial coordinates corresponds to the
morning location (suburb), followed by another pair in the CBD, and finally the evening in
108
Figure 7.2: Capturing Trajectory Data Using OD Matrices.
the theater district. Each of the time frames can be partitioned independently, resulting in
the structure on the right half of Figure 7.2 (due to space constraints, we do not represent
the evening time frame). Each trajectory corresponds to a single entry in this
multi-dimensional matrix, according to each location at each time frame.
An important advantage of this representation is that the specific partitioning used for a
particular dimension is customized to the data corresponding to that time frame. For
instance, the same part of the space can be present in different frames, but with different
granularities. In this example, the CBD area has low granularity for the morning time
frame, since few people live there, but high granularity in the noon frame. Similarly, a
theater district will not present interest in queries for the first or second time frame, but
109
will likely be of high interest in the evening frame. Conventional OD matrices cannot
accommodate such scenarios.
7.3 Data-Independent Approaches
In this section, we introduce two data-independent approaches for the sanitization of
frequency matrices with arbitrary dimensionality. These are extensions of existing work,
particularly the technique in [102]. In Section 7.4 we will introduce more advanced
data-dependent techniques that account for data distribution.
7.3.1 Extended Uniform Grid (EUG)
We extend the work in [102], originally proposed for two-dimensional frequency matrices.
We refer to that algorithm as Uniform Grid (UG). The main idea of UG was to sanitize the
total count of the frequency matrix and substitute it in a formula that results in a constant
value m that represents the granularity of dividing each dimension of a 2D frequency
matrix. After partitioning, the count in each of the partitions is sanitized using the Laplace
mechanism.
While the approach in [102] only works for two-dimensional data, EUG provides a
detailed analytical model that finds the optimal m value for uniform partitioning in any
number of dimensions. EUG is formally presented in Algorithm 7. Suppose that the
frequency matrix F has d dimensions represented by a F1 × F2 × ... × Fd array, and let N
110
denote the total count of F. The objective is to find a value of m such that, by updating
the granularity of F to md and applying the Laplace mechanism, the utility of the published
private histogram is maximized. The algorithm starts by utilizing a small amount of budget
denoted as ϵ0 to obtain a noisy count of the total number of entries in the frequency matrix.
N = N + Lap(s/ϵ0), (7.3)
where N denotes the sanitized count. The sanitized count is used for the estimation of m
by formulating an optimization problem.
The value of m can be estimated by considering the existing error sources, i.e., noise
error and non-uniformity error. The former is used for sanitization of counts, and the latter
is due to the assumption that data in each partition are uniform. Consider a query that
selects r portions of F, calculated by dividing its covered entries over the total number of
entries. Hence, the query entails rmd
entries of F. On the one hand, given that the noise
added to each partition has a variance of 2/ϵ2
, the total additive noise variance sums up to
2rmd
ϵ
2
, or equivalently standard deviation of
√
2rmd/2
ϵ
.
On the other hand, the query can be seen as a d-orthotope where the side length is
proportional to √d r. Thus, each side of the orthotope spans √d r × m cells, and the number
of points located inside the query is on average √d r × m ×
N
md
. The term N /md
comes
from the assumption of data uniformity in F. By further assuming that the non-uniformity
error on average is some portion of the total density of the cells on the query border, we
111
have the non-uniformity error as √d r ×
N
c0md−1
for some constant c0. Therefore, the aim is
to find the optimal value of m that minimizes the summation of two errors, i.e.,
min
m
√
2rmd/2
ϵ
+
√d
r ×
N
c0md−1
(7.4)
Solving based on stationary conditions of the above convex problem results in the optimal
m given by:
d
√
2rmd−1
ϵ
− (d − 1)√d
r ×
N
c0m2d
= 0 (7.5)
→ m = (2(d − 1)
d
× r
(1/d−1/2) ×
N ϵ
√
2c0
)
2/(3d−2) (7.6)
The base case of the problem occurs when the frequency matrix has two dimensions and
results in the same equation proposed by [102]:
m =
s
N ϵ
√
2c0
(7.7)
For higher dimensions, if query size is known in advance, Equation (7.6) can be used with
the given r to estimate the value of m; otherwise, by assuming that all query sizes are
112
equally likely, integration over r leads to Equation (7.11). For derivation, let us define an
auxiliary variable α as
α = (2(d − 1)
d
×
N ϵ
√
2c0
)
2/(3d−2) (7.8)
Integration over r leads to
Z 1
0
α × r
2 − d
d(3d − 2) dr =
α
2 − d
d(3d − 2) + 1
r
2 − d
d(3d − 2) +1
1
0
(7.9)
= α × (
d(3d − 2)
3d
2 − 3d + 2
), (7.10)
and ultimately, results in:
m = (2(d − 1)
d
×
N ϵ
√
2c0
)
2/(3d−2) × (
d(3d − 2)
3d
2 − 3d + 2
). (7.11)
Once the value of m is calculated, each dimension of matrix F is divided into m equal
intervals generating md partitions. The entries in each partition are set to the partition’s
sanitized total count divided by the number of entries it contains. The sanitized total count
of a partition is generated by adding its entries and using Laplace mechanism with the
privacy budget of ϵtot − ϵ0.
113
Algorithm 7 Extended Uniform Grid (EUG)
Input: F, ϵtot, ϵ0, s;
1: N ← SUM(F) + Lap(s/ϵ0)
2: ϵtot ← ϵtot − ϵ0
3: d ← Number of dims in F
4: m ← (
2(d − 1)
d
× r
(1/d−1/2) ×
N ϵ
√
2c0
)
2/(3d−2)
5: // UPDATE GRANULARITY
6: Divide each dimension by m
7: for each new partition i do
8: N′ ← SUM(i)
9: N
′ ← N′ + Lap(s/ϵtot)
10: for each entry j in i do
11: j ← N
′
/|i|
12: end for
13: end for
14: return F
7.3.2 Entropy-based Partitioning (EBP)
A critical point in the EUG algorithm is how to determine the value of m. We propose
Entropy-based Partitioning (EBP), a method for estimating a good value of m based on the
concept of entropy. In addition to providing better accuracy, EBP also addresses the issue
with EUG’s arbitrary choice of constant c0 which is empirically set to 10/
√
2. EBP proposes
a more informed parameter selection process that does not require arbitrary value settings.
Consider a d-dimensional frequency matrix F with dimensions F1 × F2 × ... × Fd, and
let N represent the total count of F. Moreover, denote the privacy budget allocated for the
calculation of m by ϵ. As in the case of Algorithm 7, the objective is to find a value of m
that, by updating the granularity of F to md
, and applying the Laplace mechanism, the
114
utility of the published private histogram is maximized. We look at the problem from an
information theory perspective. Once the granularity of F is updated, the variance of total
Laplace noise used to sanitize partitions adds up to 2md
ϵ
2
, leading to total standard
deviation of
√
2md/2
ϵ
. The entropy of the noise imposed on the frequency matrix is
therefore,
H(
√
2md/2
ϵ
) = − log2
ϵ
√
2md/2
. (7.12)
On the other hand, consider the amount of information loss that occurs due to the change
in granularity. To calculate the information loss, the amount of information before and after
changing the granularity F is required. The information contained in F before change of
granularity can be calculated as H(F), denoting the entropy of F. After partitioning is
conducted, the entropy is reduced to H(F|m), denoting entropy calculated based on the
updated frequency matrix with the granularity of md
. Thus, the amount of information loss
incurred due to change in granularity is:
Information Loss = H(F) − H(F|m). (7.13)
An optimization problem can be formulated to find the optimal value of m that minimizes
the average query error.
min
m
H(
√
2md/2
ϵ
) + H(F) − H(F|m). (7.14)
115
By increasing the value of m, information loss becomes smaller, but the induced noise
grows larger. The optimal value of m is reached when the noise is equal to information loss.
Unfortunately, entropy cannot be directly calculated due to privacy concerns; however, an
approximation can be employed as follows. We assume that the number of entries is in the
order of the number of data points, and data points are uniformly distributed over the md
partitions. Entropy before/after changing granularity can be approximated as
H(F) ≈ − log2
(1/N), H(F|M) ≈ − log2
(1/md
) (7.15)
To preserve the privacy of users, the value of N is sanitized beforehand based on the
Laplace mechanism. The value of m minimizing the optimization problem is derived as
− log2
ϵ
√
2md/2
= − log2
(1/N) + log2
(1/md
) (7.16)
→ log2
ϵ
√
2md/2
= log2
(md
/N) → m =
3d/2
s
N ϵ
√
2
(7.17)
The derived formula in Equation (7.17) is an alternative method to calculate the value
of m in the EUG algorithm. Therefore, the pseudocode in Algorithm 7 applies to EBP by
replacing the formula in line 4 with Equation (7.17).
116
7.4 Data-Dependent Approaches
7.4.1 Overview
Data-independent algorithms overlook critical information about the distribution of data
points, as they always assume uniform distribution. This is particularly problematic for
higher dimensional frequency matrices, due to their tendency to be sparse.
To improve accuracy when publishing higher dimensional frequency matrices, we
propose a tree-based approach called Density-Aware Framework (DAF) that takes into
account density variation across different regions of the space. In addition, DAF introduces
a key feature that enables custom stop conditions for partitioning. Intuitively, denser parts
of the space should be split in more granular fashion, while for sparse areas the partitioning
can stop earlier, since most likely large regions of the space are empty. The decision of
when to stop partitioning the frequency matrix is made privately, and avoids
over-partitioning which can lead to large errors in higher dimensional frequency matrices.
DAF is a hierarchical partitioning approach that resembles a tree index. Each node
covers a portion of the frequency matrix, with the root node covering all entries.
Descendants of a node are generated by a non-overlapping split of the parent node’s entries.
The split is conducted based on the depth of the node, such that nodes at depth i are
created by dividing dimension i of their parent node’s partition. The maximum index
height is d + 1. The fanout and the split point are customized at each node based on
sanitized local information about the data. We propose two DAF alternatives based on
117
different split objective functions: (i) DAF-Entropy (Section 7.4.2) which uses entropy
information to estimate good split parameters, and (ii) DAF-Homogeneity (Section 7.4.3)
which focuses on creating partitions with high intra-region homogeneity. Section 7.4.4
introduces privacy budget allocation considerations that are relevant to both approaches.
7.4.2 DAF-Entropy
DAF-Entropy has the recursive structure presented in Algorithm 8. It receives as inputs
the current node to split denoted by x, privacy budget ϵtot, variable acc tracking the budget
spent so far (initially set to zero), and a constant m0 set in the first round of the recursion
which is used for budget allocation purposes at all levels of the tree (more details are
provided in Section 7.4.4). Each tree node x is an object with four attributes: (i) x.F; the
node’s associated entries in the frequency matrix, (ii) x.count; the actual sum of entries in
x.F, (iii) x.ncount; the sanitized (or noisy) count, and (iv) x.depth; the node’s depth in the
tree. The initial run of the function is performed for the root node, representing the whole
frequency matrix.
DAF-Entropy sanitizes the total count of the root node and utilizes Equation (7.17) to
partition the first dimension of the frequency matrix. New nodes are generated for each new
partition assigned as one of the node’s children. The algorithm recursively visits children
and repeats the same process with the key difference that the split is done based on the
118
Algorithm 8 DAF-Entropy
1: Global Constants:
ϵtot, m
0
2: function DAF-Entropy
(x, acc
)
3:
d
← Number of dimensions
4:
d
′
← x.depth
5: if
d
′
=
d then
6: x.ncount
← x.count
+ Lap(1
/
(
ϵtot
− acc))
7: return TRUE
8: end if
9: if
d
′ = 0 then
10: x.ncount = x.count + Lap(1
/
(
ϵtot
/100))
11: acc
← acc
+
ϵtot
/100
12:
m
0, m
← 3(
d
−
d
′
)
/
2s
(x.ncount
)
×
(
ϵtot
− acc
)
√
2
13: else
14: mem
←
ϵtot
×
m
d
′
/
3
0
× (1
−
m
1
/
3
0
)
m
1
/
3
0
(1
−
m
d/
3
0
)
15: x.ncount
← x.count
+ Lap(1/mem
)
16: acc
← acc
+ mem
17:
m
← 3(
d
−
d
′
)
/
2s
(x.ncount
)
×
(
ϵtot
− acc
)
√
2
18: end if
19: if Stop Conditions
= TRUE then
20: mem
←
ϵtot
− acc
21: x.ncount
← x.count
+ Lap(1/mem
)
22: return TRUE
23: end if
24:
M
← Split
(
d
′ + 1)th dimension into
m intervals
25: for
i
=
1 to
m do
26: create a new node
x
′
27:
x
′
.F
← x.F with
ith dimension set to
M
[
i
]
28:
x
′
.depth
←
d
′ + 1
29:
x
′
.count
← SUM
(
x
′
.F
)
30: DAF-Entropy(
x
′
, acc
)
31: end for
32: end function
119
second dimension. More generally, upon reaching a node at depth i, the split is conducted
in the (i + 1)-th dimension.
Once a new node is visited, its count is sanitized, and stop conditions are tested on the
sanitized count. If satisfied, the tree is pruned, and the node turns into a leaf. A special
technique is used in such a scenario to enhance accuracy. The algorithm uses the remaining
amount of budget that was supposed to be used while visiting children to update the
sanitized count. This technique improves accuracy as budget allocation is such that lower
levels of the tree are allocated more budget. Thus, it is worth updating the sanitized count
based on the remaining amount of budget. Note that, stop conditions can be selected based
on application-specific details; however, the most prominent stop condition that can help
avoid over-partitioning is to stop when the sanitized count is below a certain threshold. The
algorithm continues until reaching depth d, indicating that partitioning on all d dimensions
has been implemented successfully or a stop condition is reached. Finally, the sanitized
counts of the leaves are published, representing the private frequency matrix.
7.4.3 DAF-Homogeneity
The partitioning process plays a critical role in the private publication of frequency matrices.
Hence, several attempts have been made in prior work [152, 55] to find an efficient splitting
mechanism, including partitioning independent of data, based on medians or using the
frequency matrix’s total count to estimate a viable partitioning granularity. Our earlier
work in [119] shows that partitioning based on homogeneity can significantly improve the
120
Algorithm 9 DAF-Homogeneity
1: Global Constants: p, q, ϵtot, m
0
2: function DAF-Homogeneity
(x, acc
)
3:
d
← Number of dimensions,
d
′
← x.depth
4: if
d
′
=
d then
5: x.ncount
← x.count
+ Lap(1
/
(
ϵtot
− acc))
6: return TRUE
7: end if
8: if
d
′ = 0 then
9: x.ncount = x.count + Lap(1
/
(
ϵtot
/100))
10: acc
← acc
+
ϵtot
/100
11:
m
0, m
← 3(
d
−
d
′
)
/
2s
(x.ncount
)
×
(
ϵtot
− acc
)
√
2
12:
K
1, ...
K
p
← Use
m to generate candidate sets
13: Compute
O
(
K
1
), O
(
K
2
), .., O
(
K
p
)
14:
O
(
K
i
)
←
O
(
K
i) + Lap(2
/
(
p
×
ϵprt))
,
∀
i = 1...p
15: K ← M inimize i
O
(
K
i
)
∀
i
=
1...p
16: else
17:
ϵ
←
ϵtot
×
m
d
′
/
3
0
× (1
−
m
1
/
3
0
)
m
1
/
3
0
(1
−
m
d/
3
0
)
18: acc
← acc
+
ϵ
,
ϵprt
← qϵ
,
ϵdata
← (1
−
q
)
ϵ
19: x.ncount
← x.count
+ Lap(1/ϵdata
)
20:
m
← 3(
d
−
d
′
)
/
2s
(x.ncount
)
×
(
ϵtot
− acc
)
√
2
21: Execute lines 12 to 15
22: end if
23: if Stop Conditions
= TRUE then
24: x.ncount
← x.count
+ Lap(1
/
(
ϵtot
− acc))
25: return TRUE
26: else
27:
M
← Split
(
d
′ + 1)th dimension based on
K
28: for
i
=
1 to
m do
29: create a new node
x
′
30:
x
′
.F
← x.F with
ith dimension set to
M
[
i
]
31:
x
′
.depth
←
d
′ + 1
32:
x
′
.count
← SUM
(
x
′
.F
)
33: DAF-Homogeneity(
x
′
, acc
)
34: end for
35: end if
36: end function
121
utility of private frequency matrices in 2D. The principal idea is to have mechanisms that
can cluster the entries such that data density is homogeneous within each cluster. Recall
that partitioning needs to follow the DP constraint as with any other part of the algorithm.
Here, we extend the approach in [119] to higher dimensional frequency matrices. The
approach is built on top of Algorithm 8, with a key difference that once fanout is calculated
for a node, an alternative method is used to partition the space based on homogeneity.
Suppose that while executing Algorithm 8, a node with depth i is visited.
DAF-Homogeneity starts by dividing the calculated amount of budget into two parts:
sanitization budget (ϵdata), and partitioning budget (ϵprt).
ϵprt = qϵi
, ϵdata = (1 − q)ϵi (7.18)
Constant q denotes the ratio of the budget assigned for partitioning. This value is
experimentally set to 0.3. Next, the node’s count is sanitized based on the Laplace
mechanism with the privacy budget ϵdata, and substituted in Equation (7.17) to calculate
the fanout m.
Suppose that m = 3 and recall that for nodes with depth i, the split is conducted on
dimension i + 1. Let us denote the interval corresponding to the (i + 1)-th dimension by
[kstart, kend]. In the case of DAF-Homogeneity, given that the fanout is calculated to be 3,
122
the generated intervals for the i + 1 dimension of children would be [kstart, k1), [k1, k2), and
[k2, kend], where
k1 = ⌊k0 +
kend − kstart
3
⌋, k2 = ⌊k0 + 2 ×
kend − kstart
3
⌋ (7.19)
Instead of simply selecting k1 and k2 as splitting points, DAF-Homogeneity follows an
alternative method: it generates p sets of candidate partitioning sets K1, K2, ...Kp, where p
is an input to the algorithm. Each set Kj has a cardinality equal to the desired fanout, and
is generated by drawing uniformly random split positions from every partition. For
example, consider the first candidate set to be K1 = {k
′
1
, k′
2
, k′
3
}, where k
′
1
, k′
2
, and k
′
3
are
uniformly random coordinates drawn from intervals [kstart, k1), [k1, k2), and [k2, kend],
respectively. Furthermore, let us denote the frequency matrix generated by setting the i + 1
dimension into the jth interval by F
j
. Next, the algorithm computes the homogeneity
objective function for candidate sets, resulting in O(K1), O(K2), .., O(Kp), where
O(K) =
|K|
X
+1
i=1
X
fj∈Fi
|fj − µFi |, (7.20)
In the above equation, µFi denotes the average of entries in F
i
.
µFi =
P
fj∈Fi fj
|Fi
|
(7.21)
123
Then, the output values are sanitized based on the Laplace mechanism with the reserved
privacy budget for partitioning.
O(Ki) = O(Ki) + Lap(s/ϵprt), ∀i = 1...p (7.22)
The optimal candidate set is chosen as the one that results in the minimum sanitized
output.
M inimize
i
O(Ki) ∀i=1...p (7.23)
Lemma 1. Sensitivity of the homogeneity objective function is 2.
Proof. In the calculation of objective function O(K) for a given split index set K, a data
entry’s existence or absence only affects one cell and the corresponding cluster. Let us
denote the objective function after addition or removal of one data record by O(K)
′
.
O(K)
′ =
|K|
X
+1
i=1
X
f
′
j∈Fi
|f
′
j − µ
′
Fi
|, (7.24)
Without loss of generality assume that the additional record is located in the first
cluster which results in µ
′
F1 = µF1 + 1/|F
1
|, and µ
′
Fi = µFi for all i = 2, ..., k + 1. Similarly,
124
(a) Non-adaptive approaches (b) DAF-Entropy (c) DAF-Homogeneity
Figure 7.3: Intuition Behind DAF Sanitization Approaches.
the counts are equal (f
′
i = fi) for all entries except a single entry denoted by x for which we
have we have f
′
x = fx + 1. Writing triangle inequality results in
|fi − µ1 −
1
|F1|
| − |fi − µ1|
≤
1
|F1|
∀i=1...|F
1
| − {x}
(7.25)
The sensitivity of the objective function can have the maximum value of two, as proven by
the following inequality:
|fx + 1 − µ1 −
1
|F1|
| − |fx − µ1|
≤
|F
1
| − 1
|F1|
(7.26)
The DAF-Homogeneity pseudocode is shown in Algorithm 9.
To better understand the reason why the proposed DAF approaches outperform
competitor techniques, Fig. 7.3 provides a heatmap representation of Los Angeles city with
125
500,000 sampled from the Veraset dataset (experimental setup details are provided in
Section 6.1). The partitioning conducted in the first and second dimensions are shown by
green and yellow lines, respectively. For non-adaptive approaches, only the sanitized total
population count is used for partitioning, and therefore, both dimensions are divided
equally without considering user distribution (Fig. 7.3a). Conversely, the DAF-Entropy
approach is adaptively adjusting the number of partitioning as the dimension changes
(Fig. 7.3b). The DAF-Homogeneity technique goes one step further, and adjusts the
number of partitions generated in each dimension, by selecting the split point such that
resulting areas will exhibit homogeneous intra-bin density, hence reducing the negative
effects of uniformity assumption and increasing query accuracy.
7.4.4 Budget Allocation
The derivation of the optimal amount of privacy budget allocated for different levels of the
hierarchy is a challenging task as nodes have varying fanouts. We formulate an
optimization problem to achieve a good quality budget allocation. Denote the fanout of the
root node by m0. We assume that the progression of fanout is geometric. At depth i, there
exist approximately mi
0 nodes. Furthermore, we denote the budget allocated to depth i of
the tree by ϵi
. The goal is to minimize the variance of the noise added to each level:
min
ϵ1...ϵd
X
d
i=1
mi
0/ϵ2
i
,
X
d
i=1
ϵi = ϵ
′
tot, ϵi > 0 ∀i = 1...d (7.27)
126
where, ϵ
′
tot = ϵtot − ϵ0. We have intentionally separated the root node’s budget, as it will be
used to calculate m0. The optimization problem can be solved by writing Lagrangian and
KKT conditions.
L(ϵ1, ..., ϵd, λ) = X
d
i=1
mi
0/ϵ2
i + λ(
X
d
i=1
ϵi − ϵ) (7.28)
⇒
∂L
∂ϵi
=
−2mi
0
ϵ
3
i
+ λ = 0 ⇒ ϵi =
(2mi
0
)
1/3
λ1/3
, (7.29)
which leads to
ϵi =
ϵ
′
tot × m
i/3
0
Pd
i=1 m
i/3
0
=
ϵ
′
tot × m
i/3
0 × (1 − m
1/3
0
)
m
1/3
0
(1 − m
d/3
0
)
. (7.30)
A question arises on how to calculate the value m0 upon which the above optimization
problem is formulated. Note that the formulation only considers depths 1 to d, and the root
node is excluded from the equation. The value of m0 is calculated in the first run of the
recursive algorithm 8, and we set the budget to:
ϵ0 =
ϵtot
100
(7.31)
Therefore, a comparably small amount of budget is allocated to the root node to derive m0.
Based on the above formulation, one can see that lower levels of the tree benefit from
significantly higher levels of budget. This helps to improve the utility of the published
private histogram, as the sanitized leaf set of the tree represents the counts published by
our approach.
127
Table 7.2: Summary of Compared Approaches
Strategy Symbol
Baseline Algorithms IDENTITY [39]
UNIFORM [39]
Non-adaptive Sanitization EUG
Approaches EBP
MKM [71]
With partitioning budget DAF-Entropy
Without partitioning budget DAFHomogeneity
7.5 Experimental Evaluation
7.5.1 Experimental Setup
Synthetic Datasets. We generate synthetic frequency matrices according to both
Gaussian and Zipf distribution. To generate a d-dimensional Gaussian frequency matrix F
with dimensions F1 × F2 × ... × Fd, a uniformly random integer is sampled in each
dimension: ci ∼ Uniform(1, Fi), ∀i = 1...d. The generated point (c1, c2, ..., cd) is selected as
the cluster center and 1 million datapoints are generated with respect to the cluster center
according to a normal distribution. Specifically, each data point (x1, x2, ..., xd) ∈ Zd
is
sampled from a multivariate Gaussian distribution (X1, X2, ..., Xd), where Xi ∼ N (ci
, var).
Changing the variance var allows us to adjust the degree of data skewness (lower values of
var will correspond to more skewed data). Zipfian data are generated by sampling each
datapoint from a multivariate Zipf distribution (Y1, Y2, ..., Yd), where Yi ∼
x
−a
ζ(a)
. ζ(.)
128
denotes the Riemann Zeta function and parameter a controls the skew in the frequency
matrix. As opposed to variance in Gaussian distribution, a higher value of a results in a
more skewed distribution for the Zipf distribution.
Real-world datasets. We use a subset of the Veraset2 dataset [144], including location
measurements of cell phones in three US cities: New York, Denver and Detroit. For each
city, we consider a large geographical region covering a 70 × 70 km2 area centered at the
city’s central latitude and longitude. These are chosen to represent cities with high,
moderate and low densities, respectively. Cities are modeled by a 1000 × 1000 frequency
matrix where each entry represents the number of data points in the corresponding region
of the city. The selected data generates a frequency matrix of 1 million data points during
the time period March 1-7, 2020.
Based on the real location data, we construct origin-destination matrices: in each city,
300, 000 trajectories are sampled, and their origin, destination and intermediate points are
included in the OD matrix. The data are stored as a multi-dimensional frequency matrix
generated as follows: the map of each city is discretized to a 1000 × 1000 grid, and for every
trajectory with the origin coordinates of (xo, yo) and destination coordinates of (xd, yd), the
element F[xo, yo, xd, yd] in the frequency matrix is incremented by one. A similar process is
conducted for intermediate points, with the distinction that the matrix dimension count
increases.
2Veraset is a data-as-a-service company that provides anonymized population movement data collected
through signals of cell phones across the USA.
129
(a) 2D, ϵtot = 0.1. (b) 2D, ϵtot = 0.3. (c) 2D, ϵtot = 0.5.
(d) 4D, ϵtot = 0.1. (e) 4D, ϵtot = 0.3. (f) 4D, ϵtot = 0.5.
(g) 6D, ϵtot = 0.1. (h) 6D, ϵtot = 0.3. (i) 6D, ϵtot = 0.5.
Figure 7.4: Synthetic Dataset Results, Gaussian Distribution, Random Shape and Size
Queries.
The evaluation metric used to compare the results is Mean Relative Error (MRE),
formally defined in Section 7.2, Eq. (7.1). We evaluate the accuracy of considered
approaches on the basis of:
• Varying Data Skewness/Distribution. The generation of synthetic datasets is
conducted for Gaussian and Zipfian random variables with distinct variances; for
real-world datasets we select cities with a wide range of skewness properties.
130
(a) 2D. (b) 4D. (c) 6D.
Figure 7.5: Synthetic Dataset Results, Zipf Distribution, Random Shape and Size Queries,
ϵtot = 0.1.
• Varying Query Shape/Size. Each data point in our experiments is the average MRE
result of 1000 queries generated based on random shapes and sizes. Additionally, the
impact of small, medium and large queries is evaluated.
• Varying Privacy Budget. The experiments consider three privacy budget values of
0.1, 0.3, 0.5 modeling high, moderate, and low privacy constraints.
• Varying dimensionality. We run experiments on frequency matrices with
dimensionality from two to six.
7.5.2 Results on Synthetic Datasets
Figure 7.4 presents evaluation results on synthetic datasets. For each distinct
dimensionality (i.e., row), we consider low, medium and high privacy budget settings. The
width of frequency matrices in each dimension is set to √d N.
For the 2D case, results are shown in Figures 7.4a-7.4c. EBP and DAF-Entropy provide
superior accuracy compared to other techniques, followed by DAF-Homogeneity and EUG.
131
(a) New York, random
queries
(b) New York, 1% query coverage
(c) New York, 5% query coverage
(d) New York, 10% query coverage
(e) Denver, random queries (f) Denver, 1% query coverage
(g) Denver, 5% query coverage
(h) Denver, 10% query coverage
(i) Detroit, random queries
(j) Detroit, 1% query coverage
(k) Detroit, 5% query coverage
(l) Detroit, 10% query coverage
Figure 7.6: Population Histograms in 2D for Real Datasets.
The MKM and IDENTITY algorithms exhibit similar performance, and we observed that
MKM is reaching the maximum granularity for the frequency matrix. This is justified by
the fact that the MKM approach does not follow the epsilon-scale exchangeability principle
identified in [55]. In general, there exist two scenarios in which data-independent algorithms
132
perform better: (i) the data points are distributed almost uniformly, (i.e., high variance)
and (ii) the data points are densely populated in the cluster center in a handful of matrix
entries (i.e., low variance). The superior performance of the DAF framework becomes more
evident in higher dimensions. In almost all experiments conducted, the DAF framework
outperformed the data-independent sanitization approaches. Among the two objective
functions that we developed for DAF, DAF-Entropy generally outperforms
DAF-Homogeneity.
We also evaluate the studied approaches for Zipf synthetic distribution of data.
Figure 7.5 shows similar relative trends, with the proposed approaches outperforming
existing work by an order of magnitude. The error increases as the skew parameter a
increases.
7.5.3 Results on Real-World Datasets
Figure 7.6 shows the accuracy of all studied methods on 2D data, for various query
workloads: a mix of random queries, as well as fixed coverage queries with range from 1%
to 10% of dataspace side. As in the case of synthetic data, the IDENTITY and MKM
benchmarks underperform by an order of magnitude. For all methods, the error decreases
when the query range increases, which is expected, since coarser queries can be accurately
answered using most methods. However, the more challenging case is that of small query
ranges, which provide more detailed information to the analyst.
133
(a) New York, random
queries
(b) New York, 1% query coverage
(c) New York, 5% query coverage
(d) New York, 10% query coverage
(e) Denver, random queries (f) Denver, 1% query coverage
(g) Denver, 5% query coverage
(h) Denver, 10% query coverage
(i) Detroit, random queries
(j) Detroit, 1% query coverage
(k) Detroit, 5% query coverage
(l) Detroit, 10% query coverage
Figure 7.7: Population Histograms in 2D on Real Datasets, no Baselines.
Due to their poor performance, we exclude IDENTITY and MKM from the rest of the
results, and focus on studying the relative performance of the proposed approaches,
illustrated in Figure 7.7 on linear scale. The EUG algorithm results in poorer accuracy
overall. For Detroit and New York EBP has performed better than competing techniques.
134
The EBP and DAF results are comparable for the Denver datasets, with DAF-Homogeneity
providing the highest accuracy. The EBP algorithm performs better in cities where the
entropy of the population histogram is higher. This aligns with our expectations, as greater
entropy can be an indicator of higher skewness, where EUG performs worse. When
increasing the privacy budget, the error of all algorithms decreases consistently, since the
noise required to satisfy the privacy bound becomes lower. Fig. 7.8 presents the results for
higher-dimensionality matrices. Similar to the results observed for synthetic datasets,
DAF-Entropy has superior accuracy on average compared to the other techniques. The
relative accuracy gain achieved by DAF is observed to increase as the number of dimensions
increases.
Table 7.3 shows the execution time for all techniques. The DAF methods have faster
execution time, because they adapt to data and do not perform unnecessary splits. In all
cases, the proposed techniques complete execution in less than five minutes.
Discussion. Data-independent methods perform better when data are highly uniform
or highly concentrated around the cluster center. However, most location datasets do not
fall in either of these cases, hence there is need for carefully-designed density-aware
approaches, like the ones we proposed. In lower dimensions, the EBP algorithm
outperforms competitor approaches on both real-world and synthetic datasets. In higher
dimensions, the density-aware algorithms outperform data-independent algorithms. The
improvement margin increases as the number of dimensions grows. On average,
135
(a) New York, random
queries
(b) New York, 1% query coverage
(c) New York, 5% query coverage
(d) New York, 10% query
coverage
(e) Denver, random queries (f) Denver, 1% query coverage
(g) Denver, 5% query coverage
(h) Denver, 10% query coverage
(i) Detroit, random queries
(j) Detroit, 1% query coverage
(k) Detroit, 5% query coverage
(l) Detroit, 10% query coverage
Figure 7.8: Origin-Destination Matrices in 4D, Real Datasets.
DAF-Entropy outperforms its homogeneity-based counterpart due to the additional budget
required for evaluating homogeneity metrics of candidate splits in the latter.
136
Table 7.3: Running Time of Algorithms (Seconds), 2D, ϵ = 0.1
IDENTITY EUG EBP MKM DAF-Entropy DAF-Homogeneity
New York 89 87 87 177 .47 0.5
Denver 91 91 94 182 0.38 0.46
Detroit 111 111 110 272 0.34 0.48
137
Chapter 8
Privacy-Preserving Publication of Eletrcity Datasets
Smart grids serve as a crucial source of data for analyzing consumer behavior and informing
decisions on energy policy. In particular, time-series of power consumption over
geographical areas are essential in deciding the optimal placement of expensive resources
(e.g., transformers, storage elements) and their activation schedules. However, publication
of such data raises significant privacy issues, as it may reveal sensitive details about
personal habits and lifestyles. DP is well-suited for sanitization of individual data, but
current DP techniques for time series lead to significant loss in utility, due to the existence
of temporal correlation between data readings. In this chapter, we introduce STPT
(Spatio-Temporal Private Timeseries), a novel method for DP-compliant publication of
electricity consumption data that analyzes spatio-temporal attributes and captures both
micro and macro patterns by leveraging RNNs. Additionally, it employs a partitioning
method for releasing electricity consumption time series based on identified patterns. We
demonstrate through extensive experiments, on both real-world and synthetic datasets, that
138
STPT significantly outperforms existing benchmarks, providing a well-balanced trade-off
between data utility and user privacy. This chapter is based on the publication in [120]1
.
8.1 Introduction
Analysis of electricity consumption data plays a critical role in planning and developing
power grid infrastructures for smart cities. Such data are represented as time series and
serve as a critical information source, providing detailed knowledge to policymakers about
energy usage trends, leading to effective energy policy decisions. Additionally, utility
companies and third parties may leverage such data to forecast energy demand and manage
grid operations more efficiently. The data also play a significant role in environmental
studies, helping to assess the impact of energy consumption on the environment and aiding
efforts to reduce carbon footprints.
Despite its numerous advantages, publication of electricity consumption time series
raises significant privacy concerns. Electricity usage data may unintentionally reveal
personal habits and lifestyles, such as individuals’ daily routines, working hours and
absence periods, potentially leading to privacy violations. Moreover, the risk of third-party
exploitation by marketers and advertisers poses a threat of unwanted privacy intrusions, as
consumers may be targeted based on their specific energy usage behavior. These privacy
concerns require robust data protection measures for electricity time series data.
1Shaham, Sina, Ghinita, Gabriel, Krishnamachari, Bhaskar and Shahabi, Cyrus, 2024. Differentially
Private Publication of Smart Electricity Grid Data. In submission.
139
Prevailing approaches for protecting electricity time series information rely on the
powerful Differential Privacy (DP) model [36]. DP achieves privacy by adding noise to the
data, thereby minimizing the likelihood of re-identification. The success of DP-based
privacy protection hinges on the privacy budget which determines the amount of noise
needed to achieve a specified level of protection. However, when dealing with time series
data, the existence of temporal correlations leads to increased re-identification risk over
time, causing DP to add excessive noise to offset this risk, and thus destroying data utility.
Current techniques designed to sanitize time series, including those based on Kalman
filters [45] or Fourier transforms [72], lead to a substantial reduction in data utility.
To tackle this issue, we propose a model that jointly takes into account both time and
space attributes of electricity consumption data. Our Spatio-Temporal Private Timeseries
(STPT) algorithm trains a Recurrent Neural Network (RNN) to identify spatio-temporal
electricity consumption patterns. The patterns are subsequently used to partition the time
series into a spatio-temporal histogram that is used by DP mechanisms to sanitize and
release the data. A key innovation of STPT is its unique RNN training methodology which
incorporates spatial distribution alongside temporal sequencing. We start with a
low-granularity aggregation of time series data to identify macro consumption trends,
followed by several increasingly-higher granularity aggregations to discern micro trends.
This dual-focus on both macro and micro trends allows for a nuanced representation of
consumption patterns, enhancing STPT’s ability to preserve data utility while enforcing to
DP protection requirements.
140
Our specific contributions include:
• We introduce a novel method for modeling and representing time series data of
electricity, which takes into account both the spatial and temporal properties during
the data sanitization process.
• We propose STPT, an innovative algorithm that integrates a unique approach for
training RNNs across both time and space dimensions on differentially private data.
• We design a customized technique for STPT that clusters electricity consumption
data across time and space, thereby improving data utility when applying
DP-compliant mechanisms for sanitization of time series.
• We perform an extensive experimental evaluation on both real-world and synthetic
datasets, demonstrating STPT’s notable improvements in data utility compared to
existing benchmarks.
8.2 Preliminaries
Consider a two-dimensional map that encloses a set of N households U = {u1, ..., uN }. We
denote the electricity consumption for user i at time t by xi,t (we use the term household
and power grid user interchangeably). Each household meter sends its electricity reading to
an aggregator at regular intervals ∆ × t (t = 1, ..., T) where ∆ ∈ R. The dataset of meter
readings is denoted as:
141
D = (xi,t)i=1,...,N;t=1,...,T (8.1)
The goal is to release the dataset D according to the requirements of DP, thus
preventing identification of any individual user. We start our discussion by explaining the
system model commonly used for the publication of DP electricity consumption time series,
followed by an illustration of the foundational concepts related to DP.
8.2.1 System Model
Figure 8.1 depicts the system architecture which adheres to the industry-standard model for
publishing electricity data, and consists of households, a data aggregator, and data
recipients which perform analyses on the released consumption data. The specific functions
of each party are detailed below.
• Households equipped with smart meters are generators of data and are considered to
be trustworthy in the system model. The electricity consumption of users is recorded
hourly using their meter and sent to the data aggregator.
• Data Aggregator is a trusted party that collects the time series generated by users and
publishes their aggregated data in a privacy-preserving way. The sanitization process
is done based on DP, preventing adversaries from inferring any individual-level
consumption pattern.
142
Households Smart Meters
DP answers
Queries
Aggregator Data Recipients
System Model
Figure 8.1: System Model
• Data Recipients leverage the private aggregated data for diverse applications, from
forecasting to planning. Their objective is to utilize consumption values over specific
spatial regions and time periods. The recipients are considered to be honest but
curious, and they may attempt to infer individual user consumption from aggregated
data. Individual consumption details must be protected, as they can be used to infer
sensitive details about users, such as activity patterns, lifestyle habits, etc.
8.2.2 Differential Privacy Principles
Two databases D and D′ are called neighboring or sibling if they differ in a single record t,
i.e., D′ = D
S
{t} or D′ = D\{t}.
Definition 14 (ϵ-Differential Privacy[36]). A randomized mechanism A provides ϵ-DP if
for any pair of neighbor datasets D and D′
, and any a ∈ Range(A),
P r(A(D) = a)
P r(A(D′) = a)
≤ e
ϵ
(8.2)
143
Table 8.1: Summary of Notations
Symbol Description
D Time series data database
N Number of households
U
Set of households (or power grid
users)
xi,t User i’s consumption at time t
Ccons Actual consumption matrix
Cnorm Normalized consumption matrix
Cpattern Pattern estimate matrix
Csanitized Sanitized consumption matrix
ϵtot Total privacy budget
ϵpattern Pattern recognition budget
ϵsanitize Sanitization budget
P Partition set
si Partition i sensitivity
Ttrain RNN training time
Parameter ϵ is referred to as privacy budget. ϵ-DP requires that the output obtained by
executing mechanism A does not significantly change by adding or removing one record in
the database. Thus, an adversary is not able to infer with significant probability whether
an individual’s record was included or not in the database.
Aside from the amount of privacy budget, another factor that plays a critical role in
achieving ϵ-DP is the concept of sensitivity, which captures the maximal difference achieved
in the output by adding or removing a single record from the database.
Definition 15 (L1-Sensitivity[39]). Given sibling datasets D, D′
the L1-sensitivity of a set
g = {g1, . . . , gm} of real-valued functions is:
144
s = max
∀D,D′
Xm
i=1
|gi(D) − gi(D
′
)| (8.3)
A widely-used mechanism to achieve ϵ-DP is called Laplace mechanism. This approach
adds to the output of every query function noise drawn from Laplace distribution Lap(b)
with scale b and mean 0, where b depends on sensitivity and privacy budget.
Lap(x|b) = 1
2b
e
−|x|/b where b =
s
ϵ
(8.4)
To simplify notation, we denote Laplace noise by Lap(
s
ϵ
).
In our work, we make extensive use of the following three essential results in differential
privacy:
Theorem 7 (Sequential Composition [85]). Let A1 and A2 be two DP mechanisms that
provide ϵ1- and ϵ2-differential privacy, respectively. Then, applying in sequence A1 and A2
over the dataset D achieves (ϵ1 + ϵ2)-differential privacy.
Theorem 8 (Parallel Composition [85]). Let A1 and A2 be two DP mechanisms that
provide ϵ1- and ϵ2-differential privacy, respectively. Then, applying A1 and A2 over two
disjoint partitions of the dataset D1 and D2 achieves (max (ϵ1, ϵ2))-differential privacy.
Theorem 9 (Post-Processing Immunity[36]). Let A be an ε-differentially private
mechanism and g be an arbitrary mapping from the set of possible output sequences O to an
arbitrary set. Then, g ◦ A is ε-differentially private.
145
8.3 Problem Statement
8.3.1 Time-Series Representation
Consider a spatial grid of size Cx × Cy overlaid on a 2D map, dividing the spatial domain
into smaller regions. Additionally, we divide the time dimension into a number of Ct
equal-length intervals. The electricity consumption data is thus captured by a
three-dimensional matrix Ccons called consumption matrix with Cx × Cy × Ct elements.
Each element cijk in this matrix represents the electricity consumption within the (i, j)
region during the time interval from ∆ × k to ∆ × (k + 1), where ∆ is the time resolution.
For ease of analysis, especially when conducting sensitivity studies in relation to data
publication under DP, we assume without loss of generality that ∆ = 1. This assumption
implies that each data point in the time series corresponds to distinct time intervals,
meaning that Ct
is effectively the length of the time series (Ct = T).
8.3.2 Problem Formulation
Data recipients are interested in answering multi-dimensional range queries on top of the
electricity consumption matrix.
Definition 16. (Range Query) A range query on the consumption matrix is a 3-orthotope
with dimensions denoted as d1 × d2 × d3, where di represents a continuous interval in
dimension i.
146
To evaluate the accuracy of query results, we use the Mean Relative Error (MRE)
metric. For a query q with the true aggregated consumption p and noisy consumption value
p, MRE is calculated as
MRE(q) = |p − p|
p
× 100 (8.5)
Armed with this knowledge, the problem we seek to solve is formulated in Problem 4.
Problem 4. Given a consumption matrix Ccons, generate a ϵ-DP matrix Csanitized such that
MRE is minimized.
8.3.3 A Simple Strategy
One simple strategy to publish the electricity consumption matrix is the Identity
algorithm [154]. This algorithm was initially designed for population histograms, and works
by adding independent Laplace noise to every matrix cell. When applying this technique to
the consumption matrix, it is essential to note that time series have temporal correlations.
As a result, every snapshot of time should have its distinct allocated privacy budget,
according to the sequential composition theorem (Theorem 7). Conversely, since at each
timestamp the spatial grid creates disjoint partitions of the map, parallel composition
applies within each time interval (Theorem 8). The following important result emerges
which quantifies the sensitivity of a query on each cell of the electricity consumption matrix.
Theorem 10. The sensitivity of range queries of size 1 × 1 × 1 on the electricity
consumption matrix Ccons is given by max
i,t
xi,t.
147
(a) Electricity consumption matrix.
(b) Generated time series for training the RNN unit.
(c) Clustering hypercube cells for sanitization purposes.
Figure 8.2: Consumption Matrix in Different Stages.
Proof. Recall that the consumption matrix is constructed such that the time series
resolution matches the time axis resolution. As a result, each matrix cell contains no more
than a single data point of an individual household/user. Consequently, adding or removing
a user from the data can alter the value in a matrix cell by at most max
i,t
xi,t. If the time
series are normalized to values between 0 and 1, then this sensitivity would be 1.
148
Theorem 11. The consumption matrix follows sequential decomposition in time and
parallel decomposition in space.
Proof. The sequential decomposition in time is due to the correlation of time series over
time. The parallel decomposition of the privacy budget over space is due to the fact that
the time series of users are spatially bounded in the matrix and independent of the values
in other cells.
Armed with this knowledge, the Identity algorithm allocates an equal amount of privacy
budget to each time slice. Therefore, if ϵtot represents the entire budget allocated for
sanitization, the budget for each time slice amounts to ϵtot/Ct
. Then, each cell of the
matrix is sanitized by the addition of Laplace noise with sensitivity 1 and privacy budget
ϵtot/Ct given that the time series are normalized in advance.
8.4 STPT Algorithm
8.4.1 Overview
The STPT algorithm starts by generating two matrices Ccons and Cnorm out of the collected
time series from different neighborhoods of the map. The former matrix denotes the
consumption matrix based on the actual values of the time series, and the latter is the
consumption matrix generated based on the normalized time series. The STPT algorithm
continues by conducting two sequential core procedures to generate the DP consumption
149
matrix, namely the Pattern Recognition Step, followed by the Sanitization Step. The
workflow of the approach is shown in Figure 8.3.
Pattern Recognition Step. The primary aim of pattern recognition is to create a
sanitized version of the normalized consumption matrix, Cnorm, which is referred to as
Cpattern. The core objective here is to achieve rough estimations of the normalized time
series values while utilizing a minimal privacy budget. The choice of using Cnorm over Ccons
is strategic as it helps in bounding the sensitivity of the cells during the sanitization process.
The generation of sanitized estimated values in Cpattern involves using a short segment of
the time series, Ttrain, to predict future consumption while preserving privacy. The training
data are sanitized through a novel hierarchical method, considering both time and space
dimensions. The sanitized data are used to train a RNN, which is responsible for estimating
the remaining values in Cpattern. The total privacy budget allocated for this phase is
denoted by ϵpattern.
Sanitization Step. This algorithm’s primary objective is to perform an intelligent
partitioning of the matrix, based on the private estimates in Cpattern, and then to sanitize
and release the values of Ccons. Since the estimates in Cpattern are private, the resulting
matrix partitioning is also privacy-preserving, being derived from private data. The
partitioning approach for the consumption matrix, both temporally and spatially, is
predicated on the principle of spatial homogeneity. This principle, which contributes to
enhanced data utility, essentially involves grouping cells with similar values into the same
partition. Post partitioning, the true values in each partition, extracted from Ccons, are
150
Generation of and
Selection of Initial Part of for Training
Generation of 3D Quad Tree Over Selected Data
Creation of Training Dataset & Sanitizing Them
Use RNN to Generate (Private Estimation of )
Apply k-Quantization for Grouping Cells in Time and Space
Use Clustering to Sanitize and Publish
Private Training of RNN
Figure 8.3: Workflow of Proposed Approach
aggregated and sanitized. The final output of this procedure is the matrix Csanitized,
representing the differentially private version of Ccons. The privacy budget for the
sanitization algorithm is denoted as ϵsanitize. This leads to the total amount of privacy
budget of ϵtot for the STPT algorithm where,
ϵtot = ϵsanitize + ϵpattern (8.6)
Therefore, STPT publishes ϵtot-differential private version of the original consumption
matrix (Ccons).
The pseudocode for STPT is provided in Algorithm 10 and referred to throughout the
section.
151
8.4.2 Pattern Recognition
The goal of the pattern recognition phase in the STPT algorithm is to effectively use a
designated privacy budget, ϵpattern, to develop a method for privately generating
approximate estimates for cells within the normalized consumption matrix. The primary
means to accomplish this is through the private training of an RNN unit. The input for this
algorithm comprises household time series data along with their corresponding geographic
locations on the map. The pattern recognition process commences as outlined in
Section 8.3. It involves generating the consumption matrix Ccons from the time series data
and creating Cnorm, the consumption matrix based on normalized time series. For
normalization, we employ min-max normalization at a global level. Under this scheme, the
normalized consumption for user i at time j is computed as follows:
xi,j :=
xi,j − min
i,t
xi,t
max
i,t
xi,t − min
i,t
xi,t
(8.7)
Next, the initial time segment (Ttrain) of the consumption matrix Cnorm is allocated for
training, resulting in a matrix dimension of Cx × Cy × Ct
[0 : Ttrain]. The notation
Ct
[0 : Ttrain] demonstrates the selection of indices from 0 to Ttrain of time dimension. A
critical challenge is determining an efficient training method for the RNN, ensuring it
comprehensively learns both micro and macro trends in neighborhoods while minimizing the
amount of private budget utilized for training. One straightforward training method for the
RNN model involves the sanitization strategy described in Section 8.3.3. By adopting this
152
method, every time snapshot is allocated a budget of ϵsanitize/Ttrain, and each matrix entry
undergoes Laplace noise perturbation with a sensitivity of one and budget ϵsanitize/Ttrain,
which translates to Lap(1/(ϵsanitize/Ttrain)) given that time series are normalized.
Despite its feasibility, this method introduces excessive noise into the training data,
adversely impacting model performance. To mitigate this problem, we introduce an
approach centered on the generation of a spatio-temporal quadtree (lines 5 to 14 in
Algorithm 10). Assuming Cx < Cy, the process initiates by segmenting time into
log2(Cx) + 1 levels, resulting in a time span of T
′
train for each interval derived by,
T
′
train = ⌈Ttrain/(log2(Cx) + 1)⌉ (8.8)
The matrix corresponding to the ith interval is Cx × Cy × Cz[i ∗ T
′
train : (i + 1) ∗ T
′
train],
symbolizing the quadtree’s ith depth. In the first segment of the matrix corresponding to
the tree’s root, all cells are presumed to be part of a unified neighborhood. However, in the
subsequent sub-matrix (depth 1), the previous matrix’s neighborhoods are subdivided into
four distinct quadrants. Given that quadtree operates independently of data, it does not
demand a privacy budget for its divisions. Once the partitioning in space and time of the
training matrix is completed, at depth i, there exist 4
i neighborhoods. The next critical
step of the algorithm is generating a single time series representing each neighborhood. The
representative time series is generated by element-wise average of all time series in the
neighborhood over the time allocated for the tree’s level. Consider a neighborhood in ith
153
depth of the tree, and without loss of generality suppose times series 1, .., j fall in this
neighborhood. For each point of time t lying in the interval i ∗ T
′
train : (i + 1) ∗ T
′
train the
value in the representative time series would be the average of all consumption of users in
that neighborhood in that specific time derived by,
xrep,t =
1
j
X
j
i=1
xi,t (8.9)
It should be emphasized that the time series created are stacked and not sequential. To
produce training data for subsequent training phases, a time window will be swept across
each time series individually to generate the training data. An illustration of the method is
exemplified in Figure 8.2b. As observed, we utilize a 4 × 4 × 6 matrix for the training
process. The entire duration of training is segmented into 3 parts, which translates to a
duration of 6/(log2(4) + 1) for each. This involves the creation of 3 submatrices, each
having dimensions of 4 × 4 × 2. The root node of the 3D quadtree includes only one
neighborhood, indicated by the cells’ number, which results in one distinct time series
shown beneath. Depths 1 and 2 in the tree align with 4 and 16 time series, respectively.
Altogether, 21 time series are employed for the next step.
Once partitioning is conducted, there exist 4
i
representative time series at depth i of the
tree. The time series next undergo sanitization. The key advantage of hierarchical
partitioning lies in sensitivity analysis of the approach during sanitization. As outlined in
Theorem 12, the sensitivity of each datapoint corresponding to a time series at depth i
154
stands at 1/4
log2(Cx)−i
. The underlying principle suggests that macro trends can be
captured with heightened precision since the sensitivity of the time series is reduced,
allowing for a smaller amount of Laplace noise during sanitization. The stacked sanitized
time series corresponds to list res in Algorithm 10. After finalizing the time series, training
data for the RNN is produced by sweeping a time window across the time series and
organizing them into batches. This training data is then employed for training an RNN.
Subsequently, the RNN is utilized to create private estimations of the matrix Cnorm in
Cpattern.
Theorem 12. The Sensitivity of a cell in depth i of the matrix is 1/4
log2(Cx)−i
.
Proof. Consider a cell at time t corresponding to a sub-region in depth i of the tree and all
users j falling in the sub-region. Let us denote the consumption of user i before and after
the removal of an individual by xi,t and x
′
i,t, respectively. The maximum change observed in
the representative time series of the sub-region denoted by M at time t can be derived as,
|
P
i∈M xi,t −
P
i∈M x
′
i,t|
4
log2(Cx)−i
=
|xj,t − x
′
j,t|
4
log2(Cx)−i
≤
1
4
log2(Cx)−i
(8.10)
In the above equation, index j denotes the datapoint corresponding to the user whose
existence in the dataset is altered. Therefore, the addition or removal of an individual can
change the value of the representative point by at most 1
4
log2(Cx)−i
.
155
Algorithm 10 STPT
Input: D, ϵpattern, ϵsanitize, Quad Tree Depth (depth), Window Size (ws), Training
Time (Ttrain), Quantization Level (k).
Output: Sanitized Consumption Matrix
1: Ccons ← Create Consumption Matrix
2: Cnorm ← Min-Max Normalize Ccons
3: Select Ttrain Data Points from Cnorm
4: res ← [] ▷ Initialize empty list for time series
5: for d ∈ [0, . . . , depth] do
6: T emp ← Select time interval [i · Ttrain : (i + 1) · Ttrain]
7: Divide x and y Axes of T emp into 2
d Creating 4
d Neighborhoods
8: for each neighborhood do
9: Compute Representative Time Series (Eq. 8.9)
10: Sanitize Time Series with Budget ϵpattern/Ttrain and Sensitivity 1/4
log2(Cx)−i
11: Append Sanitized Series to res
12: end for
13: end for
14: Prepare Training Data from res Based on ws
15: Train RNN
16: Generate Cpattern Using RNN
17: P ← k-Quantize Cpattern
18: for each partition Pi ∈ P do
19: f(Pi) ← Sum Values in Ccons for Pi
20: s ← Compute Sensitivity of Pi
21: Sanitize f(Pi) Using s and Budget (Eq. 8.12)
22: for each cell c ∈ Pi do
23: Update c in Ccons to f(Pi)/|Pi
|
24: end for
25: end for
26: return Sanitized Ccons, i.e., Csanitized
8.4.3 Sanitization Algorithm
The output of pattern recognition is the matrix Cpattern, with dimensions Cx × Cy × Ct
.
Each element of this matrix is created using a differentially private approach. These
elements are sanitized estimates of normalized time series, providing an idea of
156
consumption patterns rather than actual consumption amounts. The purpose of the
sanitization algorithm is not only to reveal these patterns but also to provide sanitized
consumption values.
The sanitization algorithm of STPT (lines 17 to 26 in Algorithm 10) starts by
developing a non-overlapping partitioning of the matrix Cpattern. The developed
partitioning’s objective is to group cells with similar values together. For this purpose, we
use a k-quantization of matrix Cpattern to generate clusters. The formal definition of
k-quantization is provided in Definition 17.
Definition 17 (k-Quantization). Let Cpattern be a 3-dimensional matrix with elements ci,j,k,
where i, j, k are the indices of the matrix, and let k be a positive integer representing the
number of quantization levels. The k-quantization of Cpattern is a process defined as follows:
1. Determine Range: Identify the minimum min(Cpattern) and maximum max(Cpattern)
values within the matrix Cpattern.
2. Establish Quantization Buckets: Divide the range
[min(Cpattern), max(Cpattern)]
into k equal intervals or ’buckets’, each representing a quantization level.
3. Quantize Matrix Values: For each element ci,j,k in the matrix Cpattern, assign it to
a quantization level based on which bucket its value falls into. This assignment is
157
represented as a function Q(ci,j,k) that maps the value of ci,j,k to one of the k
quantization levels.
The output is a quantized 3-dimensional matrix where each element is represented by
one of the k quantization levels, effectively reducing the original range of values in Cpattern
to k distinct values.
As it suggested by definition, a k-Quantization of matrix leads to generation of k
non-overlapping clusters of Cpattern and subsequently Ccons. We use this non-overlapping
partitioning of the matrix as a basis for sanitizing and releasing the electricity data Ccons.
Once partitioning is completed, the values in each partition are aggregated and sanitized
based on the Laplace mechanism. The accumulated value in each partition is then
uniformly distributed across its corresponding cells. More formally, let the set of generated
non-overlapping partitions based on the STPT be denoted by P = {P1, P2, ..., Pk} where
each Pi
is a set of cells. Note that partitions are generated based on Cpattern which is
private and now the partitions will be used for matrix Ccons, i.e. releasing the sanitized
values of true values. A partition’s cells are not necessarily continuous and may be
distributed across the matrix. To sanitize and publish the electricity consumption values,
the corresponding values in each partition are added and sanitized based on Laplace noise
to achieve differential privacy. The process can be shown as
f(Pi) = X
c∈Pi
f(c) + Lap(s/ϵ), (8.11)
158
where c denotes a cell and the function f(.) returns the added value of all cells if applied
on a partition or just the value of a cell if applied to a cell. Once the sanitized value of each
partition is generated, it is uniformly distributed among its cells. Therefore, for all c ∈ Pi
its value is updated to f(Pi)/|Pi
| in the defacto sanitized matrix Csanitized.
A critical question that needs to be addressed is the determination of the sensitivity of
each partition and the corresponding allocation of the privacy budget, leading to the highest
utility. Theorem 13 addresses this issue by establishing that the sensitivity of each partition
is equal to the maximum number of cells contained within a single xy-axis pillar of the
consumption matrix., where xy-axis pillar is referred to all cells that have the same x and y
coordinates. This theorem provides a foundational understanding of how sensitivity is
distributed across the partitions and guides the allocation of the privacy budget accordingly.
Theorem 13. Let Pi ∈ P be a partition in the consumption matrix. The sensitivity of Pi is
the maximum number of cells it contains in any of the xy-axis pillars.
Proof. Denote by s the maximum number of cells in a xy-axis pillar within the partition Pi
.
Given that the maximum cell count in each xy-axis pillar is bounded by p, and each pillar
represents a unique time series, the addition or removal of an individual alters the
cumulative values in the cluster by at most p. Hence, the sensitivity is characterized by this
maximal change.
Armed with this knowledge, the optimal assignment of privacy budget to each partition
can be derived as follows. Let us denote the sensitivity of partition Pi and allocated budget
159
to this partition by si and ϵi
, respectively. The optimal assignment of privacy budget to
partitions can be formulated and solved by convex optimization as shown in Theorem 14.
Theorem 14. Given a non-overlapping partitioning of the consumption matrix
P = {P1, . . . , Pm} and the sensitivity of these partitions S = {s1, . . . , sm}, the optimal
allocation of the privacy budget to a partition Pi is derived by the following equation:
ϵi =
ϵsanitize × s
2
3
i
Pm
i=1 s
2
3
i
, (8.12)
where ϵsanitize represents the total sanitization budget.
Proof. The amount of noise added to each partition can be quantified using the variance of
Laplace noise. Here, the goal is to distribute the privacy budget across partitions such that
the total variance of applied noise is minimized. Equation 8.13 formulates this goal as a
convex optimization problem.
min
ϵ1...ϵm
Xm
i=1
s
2
i /ϵ2
i
(8.13)
Xm
i=1
ϵi = ϵsanitize, ϵi > 0 ∀i = 1...m (8.14)
Writing Karush-Kuhn-Tucker (KKT) [13] conditions, the optimal allocation of budget can
be calculated as:
160
L(ϵ1, ..., ϵm, λ) = Xm
i=1
s
2
i /ϵ2
i + λ(
Xm
i=1
ϵi − ϵsanitize) (8.15)
⇒
∂L
∂ϵi
= −
2s
2
i
ϵ
3
i
+ λ = 0 (8.16)
⇒ ϵi =
2
1/3
s
2/3
i
λ1/3
, (8.17)
and substituting ϵi
’s in the constraint of problem the optimal budget in the i-th level is
derived as
ϵi =
ϵsanitize × s
2/3
i
Pm
i=1
s
2/3
i
. (8.18)
8.5 Experimental Evaluation
In this section, a thorough assessment of the STPT algorithm is presented. Section 8.5.1
details the experimental setup, Section8.5.2 discusses the comparison and analysis of the
methodology, and Section8.5.3 delves into an in-depth evaluation of STPT.
8.5.1 Experimental Setup
For a thorough assessment of the STPT method, we have conducted evaluations on both
real-world and synthetic datasets. This includes testing with diverse query shapes and
161
Table 8.2: Electricity Consumption Data Summary
Dataset Number of
Households
Average
Hourly Consumption
(kWh)
STD of
Hourly Consumption
(kWh)
Maximum
Hourly Consumption
(kWh)
Sensitivity
Clipping
Factor
CER 5000 0.61 1.24 19.62 1.85
CA 250 0.38 1.13 33.54 1.51
MI 250 0.48 1.22 49.50 1.7
TX 250 0.55 1.63 68.86 2.18
lengths, examining the spatial distribution of households, and comparing our method
against leading benchmarks and algorithms.
Datasets & Spatial Distribution. Our experiments are conducted on two publicly
available datasets. The first, which we refer to as the CA dataset, is based on the patterns
of electricity consumption in California households and was published by the authors
in [136]. The second dataset, known as the CER dataset, has been published by the
Commission for Energy Regulation in Ireland. Figure 8.4 and Table 8.2 provide statistics of
the datasets. In more detail,
• CER [42]: The Electricity Smart Metering Customer Behavior Trials were conducted
in Ireland from 2009 to 2010, engaging over 5,000 households and businesses. These
trials aimed to evaluate the effects of smart meters on electricity usage among
consumers, providing insights for a cost-benefit analysis of its nationwide
implementation. The data gathered from these trials, which has been anonymized, is
made available online for public research.
162
Mon Tue Wed Thu Fri Sat Sun
0
50
100
150
200
250
Total Weekly Consumption (kWh)
(a) CER
Mon Tue Wed Thu Fri Sat Sun
0
50
100
Total Weekly Consumption (kWh)
(b) CA
Mon Tue Wed Thu Fri Sat Sun
0
Total Weekly Consumption (kWh)
(c) MI
Mon Tue Wed Thu Fri Sat Sun
0
50
100
Total Weekly Consumption (kWh)
(d) TX
Figure 8.4: Total Weekly Consumption per Week Day of Datasets.
• CA [136]: This dataset functions as a digital twin of a residential energy-use dataset
within the residential sector. We focus on 5 first counties in CA, including Alameda,
Alpine, Amador Butte, and Calaveras. The household electricity time series is used
on an hourly basis between September and December of 2014. The number of
households participating in the program in each county is 50, adding up to 250
households in total.
The privacy concerns regarding household-level electricity consumption have limited the
availability of publicly accessible datasets, with no geotagged datasets currently accessible
online [8]. Therefore, to account for the distribution of users, we perform our experiments
163
by distributing households in two settings: Uniform and Normal. The center of the normal
distribution is selected randomly over the map, and the households are located with the
standard deviation equal to one-third of the grid size. The experiment is repeated 10 times
and the average result is shown to ensure repeatability of the experiments. Therefore, in
total, the experiments are conducted on four datasets which are referred to as CA-Uniform,
CER-Uniform, CA-Normal, and CER-Normal.
Comparison Benchmarks. We compare and evaluate the performance our approach with
the available state-of-the-art approaches detailed below.
• FAST. The framework proposed in [45] is a widely adopted approach focused on
exploiting the Kalman Filter for lowering utility loss while sanitizing time series.
• Fourier Perturbation Algorithm. The methodology initially introduced in [104] and
subsequently refined through sensitivity evaluations in [72], involves processing a time
series with a specified integer k. The procedure begins by executing a Fourier
transform on the time series, followed by the selection and sanitizing of the top k
primary Fourier coefficients. After sanitizing these coefficients, the inverse Fourier
transform is applied, and DP time series are generated. We implement the algorithm
in two settings where k = 10 and k = 20, denoted in the experiments as Fourier-10
and Fourier-20, respectively.
• Wavelet Perturbation Algorithm. By substituting the Fourier transform with the
discrete Haar wavelet transform, Lyu et al. [81] introduced the wavelet perturbation
164
algorithm for creating DP time series. This method, akin to the Fourier technique,
requires an integer k, which signifies the number of coefficients to be used and
sanitized. We denote this algorithm as Wavelet and implement it in two distinct
scenarios: one with k = 10 and the other with k = 20.
• Identity. The approach illustrated in Subsection 8.3.3 is an adaptation of the original
approach for the publication of time series and will be used as a comparison
benchmark in our experiments. Despite its simplicity, the algorithm is a robust
benchmark outperforming many of the existing approaches [76].
Query Types. As discussed in Problem formulation in Subsection 8.3.2, analysts are
interested in range queries which is 3-orthotopoe with the dimensions d1 × d2 × d3
indicating the consumption of the map over a particular time range. For this purpose, we
use small (1 × 1 × 1) and large queries (10 × 10 × 10) as well as queries with random shape
and size. For each of the 3 categories, we generated 300 randomly generated queries over
the consumption matrix, calculated the MRE, and reported the average result.
Hyper-parameters Setting. A grid with dimensions of 32 × 32 is overlaid on a map, and
the households are distributed over the space as previously described, based on both normal
and uniform distributions. The total privacy budget is set to ϵtot = 30, with ϵpattern = 10
allocated for pattern recognition in STPT, and ϵsanitize = 20 for sanitization. To ensure fair
comparison, the same privacy budget is utilized across all algorithms. For training in the
STPT algorithm, 100 datapoints are used, resulting in a training matrix of 32 × 32 × 100.
165
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(a) CER-Uniform; Random
Shape & Size Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
10
3
Percentage of MRE (log)
Uniform
Normal
(b) CER-Uniform; Small
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
1
10
2
Percentage of MRE (log)
Uniform
Normal
(c) CER-Uniform; Large
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(d) CER-Uniform; Random
Shape & Size Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
10
3
Percentage of MRE (log)
Uniform
Normal
(e) CER-Uniform; Small
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(f) CER-Uniform; Large
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
1
10
2
Percentage of MRE (log)
Uniform
Normal
(g) CA-Normal; Random
Shape & Size Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
10
3
Percentage of MRE (log)
Uniform
Normal
(h) CA-Normal; Small
Querieslabel
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(i) CA-Normal; Large
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(j) CER-Normal; Random
Shape & Size Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
10
3
Percentage of MRE (log)
Uniform
Normal
(k) CER-Normal; Small
Queries
Fourier-10
Fourier-20 Wavelet-10 Wavelet-20
FAST
IDENTITYSTPT
10
2
Percentage of MRE (log)
Uniform
Normal
(l) CER-Normal; Large
Queries
Figure 8.5: Assessment of Algorithm Performance Across Varied Datasets and Query Types.
166
0.01 0.05 0.1 0.5 1 1.5 2
Pattern Recognition Budget Per Time Datapoint
0.0
0.2
0.4
0.6
0.8
Prediction MAE
(a) Impact of privacy budget on
pattern recognition MAE.
0.01 0.05 0.1 0.5 1 1.5 2
Pattern Recognition Budget Per Time Datapoint
0.2
0.4
0.6
0.8
Prediction RMSE
(b) Impact of privacy budget on
pattern recognition RMSE.
100 120 140 160 180 200
Number of Quantization Levels
12.0
12.5
13.0
MRE of Random Queries (%)
(c) Impact of quantization on
performance.
Fourier Wavelet FAST IDENTITY STPT
10
0
10
1
Execution Time (Seconds)
(d) Computational complexity of
algorithms.
0 1 2 3 4 5
Quad Tree Depth
0.02
0.04
0.06
0.08
0.10
Prediction MAE
(e) Impact of Quad tree’s depth
on the prediction MAE of pattern
recognition.
0 1 2 3 4 5
Quad Tree Depth
0.02
0.04
0.06
0.08
0.10
0.12
Prediction RMSE
(f) Impact of Quad tree’s depth
on the prediction RMSE of pattern recognition.
Figure 8.6: Detaileld Analysis of STPT.
The test involves 120 points, leading to a matrix of 32 × 32 × 120. Consequently, the
published consumption matrix has dimensions of 32 × 32 × 120. The sensitivity clipping
factor of the consumption matrix is provided in Table 8.2. The RNN unit comprises a
self-attention mechanism and a GRU unit. Training was conducted over 20 epochs, with a
batch size of 32. The time window is set to encompass 6 datapoints for predicting the next
datapoint. The RMSProp optimizer is employed with a learning rate of 1e-3. The
embedding size and hidden dimension are set to 128 and 64, respectively.
Hardware and Software Setup. Our experiments were performed on a cluster node
equipped with an 18-core Intel i9-9980XE CPU, 125 GB of memory, and two 11 GB
167
NVIDIA GeForce RTX 2080 Ti GPUs. Furthermore, all neural network models are
implemented based on PyTorch version 1.13.0 with CUDA 11.7 using Python version 3.10.8.
8.5.2 Performance Evaluation
Figure 8.5 illustrates the performance of various algorithms when subjected to queries of
differing shapes and sizes. Each row in the figure is dedicated to one of four datasets:
CA-Uniform, CER-Uniform, CA-Normal, and CER-Normal. Within each row, the leftmost
figure depicts the performance for randomly shaped and sized queries generated over the
consumption matrix. The middle figure shows results for smaller queries, and the rightmost
figure displays the performance for larger queries. On the x-axis of each figure, the
algorithms are listed, while the y-axis represents the MRE in percentage. For queries with
random shapes and sizes, the STPT algorithm exhibited improvements of 60, 31, 54, and 32
in the datasets, respectively, following the order of the rows. Notably, the performance
enhancement of the algorithms is more pronounced for smaller-sized queries. This result is
desirable, indicating that more precise information about the consumption matrix can be
conveyed with minimal loss of utility.
In more detail, as anticipated, the IDENTITY algorithm generally shows the least
effective performance among the baseline algorithms. However, it surpasses some of the
more recent algorithms in scenarios where the data exhibits a more uniform shape, as seen
in the first and second rows. An unexpected outcome of our experiments is the relative
performance of Wavelet and Fourier transformations. Although Wavelet transformation was
168
introduced at a later stage than the Fourier approach, the Fourier method demonstrates
superior performance for queries of random shape and size. Conversely, the Wavelet
approach leads to better results for queries of fixed shape.
Another notable observation is that, on average, all algorithms tend to perform worse
with non-uniform data. This aligns with findings in [119], where a crucial determinant of
performance is the homogeneity in data partitioning. Uniform data distribution contributes
to higher homogeneity, thereby enhancing the performance of the algorithm.
8.5.3 Detailed Evaluation
In this section, we assess the STPT algorithm’s performance, focusing on several critical
factors that affect its efficiency. These include the influence of the privacy budget on the
STPT’s pattern recognition capabilities, the effects of quantization, the computational
complexity of the method, and the extent to which the depth of the tree impacts pattern
recognition.
Privacy Budget. Figures 8.6a and 8.6b are designed to analyze how the allocation of
privacy budget affects pattern recognition performance in the STPT algorithm. In these
figures, while the sanitization budget in the second step remains constant, the budget for
pattern recognition generation varies. For enhanced clarity, the x-axis displays the amount
of budget allocated to each training datapoint of the RNN unit. The y-axis, meanwhile,
indicates the MAE and RMSE of the RNN unit’s predictions. As anticipated, an increase in
the allocated budget enhances prediction accuracy, illustrating the privacy-utility trade-off.
169
Notably, a significant improvement is observed when the privacy budget for each datapoint
is increased from 0.01 to 0.05, suggesting that the minimal budget required for effective
training lies within this range.
Quantization. Figure 8.6c illustrates the effect of the number of quantization levels on
the performance of the STPT algorithm. The MRE performance is displayed on the y-axis
for queries of varying shapes and sizes. Although there are fluctuations in the results, the
general trend indicates that excessive increase in the number of quantization levels can
negatively impact the effectiveness of STPT. This is understandable, as many points in the
cycle of a time series often exhibit similar values. Consequently, a high degree of
quantization results in excessive partitioning and a reduction in the homogeneity that is
captured in the data.
Computational Complexity. Figure 8.6d presents and compares the runtime of various
algorithms. According to the figure, the execution time for all algorithms is remarkably
small, typically spanning just a few seconds. Although the STPT algorithm shows a slight
rise in computational complexity, it is crucial to note that a significant portion of this
complexity stems from the initial training phase required for pattern recognition, which is a
one-time process. Overall, all algorithms demonstrate comparable execution times in the
order of seconds, indicating that computational complexity does not pose a significant
hurdle to their performance.
Quad Tree Depth. The influence of varying tree depth on pattern recognition efficiency
is showcased in Figures 8.6e and 8.6f. The aim is to explore how changes in tree depth
170
affect the MAE and RMSE in the RNN unit. It’s important to remember the balance
between sensitivity and precision in time series produced at various depths. At shallow
depths, such as the root node, time series sensitivity to data points is relatively low, yet it
encompasses a broader range of households. As the depth increases, this trend inversely
shifts. The figures reveal that augmenting the tree depth up to a certain point enhances
performance, but beyond that, the diminishing number of training data points at each level
restricts further performance gains. Consequently, opting for a shorter tree length, despite
its impact on micro trends, proves to be advantageous.
171
Chapter 9
Fairness-Aware Emergency Demand Response Program for
Electricity
Smart electricity grids increasingly rely on Demand Response (DR) programs to address
the rising electricity needs, particularly during crises when sudden surges in energy demand
occur. This chapter introduces the Incentive-Driven Load Balancer (ILB), a novel program
designed to manage demand and response efficiently in such emergencies. The ILB
motivates households that can potentially lower their electricity usage by offering them
incentives, thus ensuring preparedness for unforeseen situations. We propose a two-phase,
machine learning-based methodology for selecting participants in the ILB, utilizing a
graph-based strategy to pinpoint households with a high potential for adjusting their
electricity use. This method involves two Graph Neural Networks (GNNs) – one for
identifying consumption patterns and another for choosing households. Our approach also
integrates a fairness mechanism to guarantee non-discriminatory selection of participants.
Extensive tests with household electricity data from California, Michigan, and Texas
172
demonstrates the ILB’s substantial capacity to assist communities in emergency situations.
This chapter is based on the publication in [129]1
.
9.1 Introduction
Demand Response (DR) programs have become essential to the functioning of smart
electricity grids, playing a pivotal role in managing the escalating demand for electricity.
These programs shift the adaptability of electricity usage from the supply side to the
consumer, encouraging households to lower their energy consumption. This shift not only
enhances the grid’s efficiency and adaptability but also brings a range of benefits. Key
advantages of DR programs include reducing the necessity for additional power plants,
cutting down on electricity expenses by avoiding the purchase of costly energy, improving
grid stability to avert power outages, and lessening environmental harm through decreased
fossil fuel use. A notable instance of such a program is Time-of-Use pricing (TOU), which
adjusts electricity prices according to the time of day, generally raising rates during peak
demand periods [139].
Emergency DR programs, a more urgent and critical variant of Demand Response
programs, are activated during crises to handle sudden increases in energy demand.
Typically, these programs are implemented post-incident, without prior arrangements for
households, such as in extreme weather, equipment failures, or other emergencies. As a
1Shaham, S., Krishnamachari, B. and Kahn, M., 2023. ILB: Graph Neural Network Enabled Emergency
Demand Response Program For Electricity. arXiv preprint arXiv:2310.00129.
173
result, they often act as damage control rather than preventive measures, posing significant
risks to households and businesses. For instance, in the heatwave of August 2020 in
California, the California Independent System Operator (CAISO) implemented an
emergency DR program to curb electricity use and avert blackouts [20]. This entailed
requesting businesses to lower energy use during peak times and urging residents to
conserve power during the hottest parts of the day. Similarly, during Texas’ February 2021
winter storm, the Electric Reliability Council of Texas (ERCOT) initiated an emergency
DR program. It called for businesses and individuals to conserve energy by minimizing
heating and hot water use and turning off non-critical appliances and electronics [18].
The need for more sophisticated emergency DR programs is highlighted by their
potential to alleviate stress on the power grid during crises and prevent community losses.
However, several challenges hinder the development of effective DR programs. These
include the necessity for real-time energy usage communication and monitoring, accurate
forecasting of energy demand, and ensuring enough resources are available for heightened
demand during emergencies. Additionally, the impact of DR programs on vulnerable or
low-income households is a significant consideration. These households might struggle to
participate in DR programs due to limited resources for investing in energy-efficient
technologies and unjust selection of the participants due to their neighborhoods and
corresponding socio-economic factors. Fortunately, advancements in Machine Learning
(ML) and the growing implementation of Advanced Metering Infrastructure (AMI) in
homes are paving the way for novel DR programs. AMI meters can measure and record
174
electricity usage at least hourly and transmit this data to both utilities and customers daily.
These installations range from basic hourly interval meters to advanced real-time meters
with two-way communication, enabling immediate data recording and transmission.
In light of recent technological advancements, we introduce an incentive-based program
integrated with a cutting-edge ML approach. This initiative is designed to overcome current
challenges and assist in maintaining equilibrium between supply and demand during crisis
situations. Specifically, our contributions are as follows:
• We propose a novel incentive-based strategy, named the Incentive-Driven Load
Balancer (ILB), aimed at efficiently managing demand and response in critical
scenarios. This approach focuses on strategically pinpointing households capable of
modifying their electricity usage and employs a carefully crafted incentive system.
This system is designed to encourage participant households to regulate their energy
consumption during times of emergency.
• We develop an ML-driven two-step framework for the efficient selection of participants
in the program. This framework encompasses two distinct GNNs: one for recognizing
patterns and another dedicated to household selection. The first GNN operates as a
time-series forecasting dynamic graph, utilizing an attention mechanism to identify
similarities among users while considering socio-economic factors that affect demand
elasticity. The second GNN is designed to model communities, leveraging the insights
from the pattern recognition GNN to choose appropriate households for participation.
175
Central to our framework’s design is the consideration of geo-spatial neighborhood
factors, ensuring a comprehensive and effective approach to participant selection.
• To tackle fairness concerns present in the household selection process, we propose a
robust unfairness mitigation algorithm. This algorithm is specifically designed to
guarantee a non-discriminatory selection of households across various neighborhoods.
By employing a Pareto analysis, we demonstrate that our proposed method achieves
Pareto optimality. This ensures a balanced consideration of both fairness and utility
within communities, effectively addressing potential biases in the selection process
• We have devised and publicly shared a dataset based on the factors that influence the
elasticity of electricity demand for data mining objectives. This dataset is greatly
empowered by a sample from the synthetically generated household-level data in [136],
which offers a digital twin of residential energy consumption. We further enrich this
dataset by integrating education and awareness factors from high schools [142], college
education data from the Census [91], median household income and unemployment
statistics from the US Department of Agriculture [141], and climate data [43].
We compare and evaluate the ILB by conducting an extensive number of experiments on
the household-level electricity consumption of people in California, Michigan, and Texas.
176
Table 9.1: Summary of Notations.
Symbol Description
n, m
Number of households and
neighborhoods
U = {u1, ..., un} Set of users
X
supply
d
, Xdemand
d
Supply and demand power on
day d
Xu
d
Power consumption of user u
on day d
k Number of features
x
u User u power consumption
time series
D, D′ Set of all days and emergency
days
r
u
baseline, ru
emergency
Baseline & emergency price
rate on ILB participants
I
u
Incentive for household u
e
u PE of household u
Areal, Aest
Real and estimated similarity
matrices
oi
, ai
, pi
Offers made, offers accepted,
and population of neighborhood i
177
9.2 Preliminaries
9.2.1 Notation
Consider a map that contains n households U = {u1, ..., un}, which are distributed among
m non-overlapping neighborhoods N = {N1, ..., Nm}. Each neighborhood Ni contains a
population of size |Ni
|. We represent the hourly electricity consumption of household u ∈ U
using a time series x
u ∈ R
1×T
, and we denote the set of all time series by
X = (x
1
, ...xn
) ∈ R
n×T
. Additionally, we model the total power consumption of user u on
day i as Xu
i
, and the set of all days in a billing cycle by D. Table 9.1 summarizes the
important notations used throughout the manuscript.
9.2.2 Message Passing Framework
Most GNNs use message-passing and aggregation to learn improved node representations,
or so-called embeddings, of the graph. At propagation step i, the embeddings of node v is
derived by:
Hi
v = f(aggr(Hi−1
u
|u ∈ N1(v))). (9.1)
In the above formulation, Hi
v
represents the i-th set of features for the nodes, f(.) is the
function used to transform the embeddings between propagation steps, N1(.) retrieves the
1-hop embeddings of a node, and aggr combines the embeddings of its 1-hop neighbors. For
instance, in GCN, node aggregation and message-passing are expressed as:
178
Hi
v = σ(W
X
u∈N1(v)
1
p ˆdu
ˆdv
Hi−1
u
), (9.2)
where σ is the activation function, W is a matrix of learnable weights, ˆdv is the degree
of node v. The propagation length of the message-passing framework is commonly limited
to avoid over-smoothing.
9.3 Proposed Scheme
In this section, we present our proposed emergency DR program.
9.3.1 Program Overview
With the rising trend of electricity consumption, combined with the limited capacity of
supply, electricity disruptions in household networks are becoming inevitable. Suppose the
utility provides X
supply
i
kWh of power on a particular day i, but the electricity demand is
Xdemand
i
kWh, where X
supply
i < Xdemand
i
. This necessitates a strategy to balance the
supply and demand. Currently, there are two extreme strategies in place: (I) implementing
the same level of power outage for all users by reducing each household’s electricity by
(Xdemand
i − X
supply
i
)/n, and (II) focusing on only l households and cutting their power for a
longer duration, while ensuring uninterrupted power supply for the others. The second
strategy results in an outage of (Xdemand
i − X
supply
i
)/l in this group of households.
179
The sudden and unexpected power cuts can disrupt the daily life of households,
regardless of their efforts to manage the power demand and response. In order to tackle this
issue, we propose a price-driven demand response program named the Incentive-driven
Load Balancer (ILB), which prioritizes the user’s preferences. The ILB program enables
households to voluntarily participate in a program that offers them financial incentives
upfront, in exchange for paying higher rates during a few unanticipated days. The
households are informed of such occurrences a day before and encouraged to reduce their
electricity consumption during those times. For those who do not opt in to the program,
higher rates will be charged throughout the period to cover the program’s expenses. The
formal definition of the ILB program is provided in Definition 18.
Definition 18. (Incentive-Driven Load Balancer ). Under this strategy, households are
allowed to voluntarily participate in a program, where they agree to pay higher rates for a
number of unplanned days during the upcoming billing cycle, which they will be notified of
only one day in advance. In return for their participation, they receive a monetary incentive
at the start of the program. For the remaining duration of the billing cycle, the standard
rates will be applied. To cover the cost of the incentives, the electricity rate for all other
households is increased.
The aim of ILB to incentivize flexible users who opt-in to reduce their power usage
during emergency days, rather than enforcing power outages to balance demand and supply,
or force higher prices on all users. This is achieved by charging the opt-in users higher rates,
180
encouraging them to adjust their consumption behavior. The incentives provided to the
users needs to be carefully determined — if it is too low, then not enough customers will
accept the offer resulting in an insufficient reduction of the demand during the emergency
days; if too high, it may potentially result in too high a number of households signing up
for the incentive making the program too expensive, raising the burden on
non-participating households beyond acceptable levels.
To quantify the utility of the proposed program, we propose the following two indicators.
The first utility function aims to reveal if offers are successful in attracting customers.
Definition 19. (Utility: Acceptance Rate). The acceptance rate indicates the percentage of
households who agree to participate in the program after receiving an offer.
Acceptance rate =
# accepted offers
# offers made × 100. (9.3)
The second utility focuses on the amount of reduction made in consumption given the
incentives formulated in Definition 20.
Definition 20. (Utility: Responsiveness Cost). Let U
′ denote the set of users who
voluntarily participate in the program, and D′ = di
, ..., dj denote the set of emergency days
that occurred during the billing cycle. The responsiveness cost of ILB is defined as
Responsiveness Cost = (X
u∈U′
I
u
)/(
X
d∈D′
X
u∈U′
∆Xu
d
). (9.4)
181
In the above formulation, ∆Xu
d = X¯ u
i − Xu
i
represents the variation in the consumption of
user u on day d. Here, the consumption of user u is represented by Xu
d
, and the adjusted
consumption due to their participation is represented by X¯ u
d
. The symbol I
u
refers to the
incentive provided to user u at the beginning.
The rationale behind the responsiveness utility function is to determine the amount of
power consumption reduction that can be achieved for a given amount of incentives to
participants. The utility company uses the following optimization formulation to balance
supply and demand while minimizing incentive expenditure. The responsiveness cost is a
metric that measures how well this objective is met in practice.
Minimize X
u∈U′
I
u
subject to Xdemand
d − X
supply
d ≤
X
u∈U′
∆Xu
d
, ∀ d ∈ D′
(9.5)
9.3.2 Pricing
As utility functions outlined in the previous section reveal, ILB requires a diligent selection
of participants and careful pricing of incentives. In this subsection, we address the latter
and explain how incentives are calculated, and in Section 9.4, the selection process of
applicants is illustrated. Recall that D′ ⊂ D are the emergency days. For any u ∈ U′
, let us
denote the price rate for emergency days (d ∈ D′
) by r
u
emergency and their regular rate for
d ∈ D\D′ by r
u
baseline. Thus, by participating in the program the cost of user u will be
modified from their baseline cost,
182
Costbaseline(u) = X
d∈D
Xu
d × rbaselineu , (9.6)
to the modified cost of,
Costilb(u) = X
d∈D\D′
Xu
d × r
u
baseline +
X
d∈D′
X¯ u
d × r
u
emergency − I
u
. (9.7)
Once it comes to consideration, the household should understandably expect
Costbaseline(u) ≥ Costilb(u). (9.8)
Otherwise, the offer will not be beneficial for the user. The households who are not part of
the program will be charged by an extra charge of rextra and pay the modified rate of,
rothers = rbaseline + rextra. (9.9)
The rate hike for users not included in the program is derived based on the incentives
provided in ILB, calculated as,
rextra = (X
u∈U′
I
u
)/(
X
u∈U\U′
X
d∈D
Xu
d
). (9.10)
183
9.3.3 Rate Hikes
A critical factor in the above formulation is the rate hikes on emergency days. This factor
directly influences the amount of demand response on such days. We derive this rate based
on the price elasticity of demand for electricity.
Definition 21. (Elasticity of Demand). The elasticity of demand refers to the degree to
which demand responds to a change in an economic factor.
The appropriate value of remergency must be chosen such that the increase in rates
results in a change in consumers’ demand, which can be measured by the price elasticity of
demand. The price elasticity of demand is derived by,
P E =
% Change in Quantity
% Change in price . (9.11)
The equation indicates the percentage of change in demanded quantity for a given
percentage of change in price. For example, if the price of electricity increases by 10%, and
the quantity of electricity demanded decreases by 5%, the price elasticity of demand for
electricity can be calculated as:
P E = (−5%/10%) = −0.5 (9.12)
The PE for electricity has been measured and estimated in many countries in the world
including the US. The factor is commonly considered for the short term and long term. The
184
average PE for electricity on the state level is estimated to be −0.1 in the short-run and
−1.0 in the long-run [17, 89].
It is crucial to note that the numbers mentioned earlier represent average statistics for
households. However, on an individual household level, the PE for electricity can vary
greatly, which can impact the necessary amount of incentive required for them to accept an
offer. Let us denote the PE for household u by e
u
. Hence, given the goal of ILB that every
household in the program reduces its consumption by i%, the percentage of change in price
for the household is calculated by
% Change in price = i/eu
. (9.13)
Once the percentage of change in price is calculated, the rates on emergency days are
calculated by
r
u
emergency = (1 + % Change in price) × r
u
baseline. (9.14)
Armed with this knowledge, applying inequality in the Equation 9.8, the minimum
incentive values can be derived as,
185
Figure 9.1: Pattern Recognition Graph.
Costbaseline(u) ≥ Costilb(u) → (9.15)
X
d∈D
Xu
d × r
u
baseline ≥
X
d∈D\D′
Xu
d × r
u
baseline
+
X
d∈D′
X¯ u
d × r
u
emergency − I
u →
(9.16)
I
u ≥
X
d∈D′
X¯ u
d × r
u
emergency −
X
d∈D′
Xu
d × r
u
baseline. (9.17)
9.4 Implementation Framework
The success of the program relies heavily on the process of selecting candidates. Utilizing
both a sequential and machine learning approach can aid in identifying when enough
households have been efficiently selected, offered, and accepted the incentive. This is
essential in ensuring that a satisfactory number of high-demand flexibility users have
accepted the incentives and that the expected reduction in demand meets the shortfall on
186
emergency days. Additionally, this method prevents over-recruitment, which could result in
unnecessary program expenses.
Our proposed approach for choosing potential participants for the ILB program is based
on two graph models: the pattern recognition GNN and the household selection GNN. The
dynamic pattern recognition model is designed to generate a similarity matrix among users,
which indicates the probability that a household would respond positively to an offer based
on the response of its neighbors and other households. The household selection model, on
the other hand, is a node classification model that aims to identify the candidates who are
most likely to accept the offer.
9.4.1 Pattern Recognition GNN
The main objective of the pattern recognition model is to create an accurate similarity
matrix that reflects the degree of similarity between two individuals based on their
socio-economic status and electricity consumption pattern. This advance model helps us
understand the likelihood of a household accepting an offer, given the responses of those
already queried. The underlying assumption is that households with a high similarity score
are likely to respond similarly to the offer. Although there may be some margin of error, a
well-designed model can mitigate this issue to a reasonable extent. Our proposed model
considers three factors: (I) socio-economic factors affecting demand response, (II)
intra-series temporal correlations in time series, and (III) inter-series correlations captured
by an attention mechanism that reveals pattern similarity. While the model is primarily
187
designed to predict future demand, the inter-series attention matrix enables enhanced
modeling of user similarity.
We use Areal ∈ R
n×n and Aest ∈ R
n×n
to denote the actual and predicted probability
matrices of accepting the offer, respectively. The element (aij ) located in the i-th row and
j-th column of these matrices represents the likelihood of household ui accepting the offer,
given that household uj has already accepted the offer. An overview of the model is
provided in Figure 9.1 and the components of the model are elaborated layer by layer in the
following.
9.4.1.1 Intra-series Temporal Correlation Layer
The first layer of the model aims to capture the temporal correlation in each time series.
For a given window s, the training data X ∈ R
n×s would be input to the model. Each time
series is processed by two essential components within this layer: a Gated Recurrent Unit
(GRU) unit and a self-attention module.
The purpose of GRU units is to handle sequential data by temporal dependencies. GRU
has been shown to utilize less memory and perform faster compared to Long Short Term
Memory (LSTM). The self-attention components have been included after GRU units to
enhance their performance further. The input data X is fed into GRU units and
self-attention components generating embeddings C = (c1, ..., cs) ∈ R
s×M where M is the
embedding size.
188
9.4.1.2 Inter-series Correlation Layer
Next, the generated embeddings are fed to a multi-head attention layer [143]. The
multi-head attention layer consists of multiple attention heads, each of which learns to
attend to different parts of the input sequence. This is the critical step where the
correlation and similarity between households are captured. The ultimate output of this
unit is the matrix Aest, or the so-called attention matrix, representing the pairwise
correlations of households.
The input of the attention layer, as formulated in Equation 9.18 consists of queries (Q),
keys (K), and values (V ), each of which is a sequence of vectors. To understand the
similarity between time series, all inputs are set to the embedding matrix C generated in
the previous layer. The layer then applies multiple attention heads, each of which computes
a weighted sum of the values using a query and key pair. The outputs from each attention
head are concatenated and linearly transformed to produce the final output of the layer.
Aest = MultiHead(Q, K, V ) = (head1 ∥ . . . ∥ headj )WO, (9.18)
where each head consists of,
headi = Attention(QWQ
i
, KWK
i
, V WV
i
), (9.19)
Attention (Q, K, V ) = Sof tmax
QKT
√
M
V. (9.20)
189
In the above formulation, WO, W
Q
i
, WK
i
, and WV
i
are weight matrices learned during the
training.
9.4.1.3 Socio-economic Factors
Next, the embedding produced in the preceding layers will be combined with the
socio-economic factors to serve as node features in the dynamic graph. It is important to
note that our focus is on factors that have been proven to have a considerable effect on
demand elasticity, not just on user consumption. For instance, even though the number of
individuals in a household can impact consumption, it has not been identified as a
significant coefficient of demand elasticity in some studies, as cited in [35]. We have
summarized the key determinants in the following. (I) Income: Studies, such as the one by
Brounen et al.[14], have demonstrated that the level of income has a notable influence on
the elasticity of demand for electricity. (II) Demographic characteristics: Factors such as
race and age significantly impact household electricity consumption[47, 35]. (III) Dwelling
physical factors: Aspects including the type of building, its size, and its thermal and quality
characteristics are linked to the amount of energy consumed by a household [140]. (IV)
Climate: Temperature has been demonstrated to impact electricity consumption
significantly [35]. (V) Living Area: The geographical location of households, including
regional variables such as urban and rural areas, cities and counties, critically affect their
elasticity of demand [35]. (VI) Awareness and Education: The level of awareness about
policies and the education level of individuals can influence their electricity demand and
190
compliance with ILB, as shown by Du et al. [35], who found the coefficient representing the
extent to which users understand the policy to be a significant factor in demand response.
Let us denote the socio-economic features for i-th household by cˆi ∈ RM¯
. The
concatenation of embeddings from time series and socio-economic features leads to node
features of the dynamic graph, mathematically represented by,
c
′
i = ci
||cˆi ∈ R
(M+M¯ )
, (9.21)
resulting in the final node feature matrix C
′ = (c
′
1
, ..., c′
n
) ∈ R
n×(M+M¯ )
.
9.4.1.4 Dynamic Graph
In the final stage of the pattern recognition graph, the dynamic GNN is formulated as
G(V, E), where V represents the node set and E represents the edge set. To create the
graph, each household is assigned a node, and their node features are generated using the
concatenated features derived from Equation 9.21. The attention matrix Aest obtained from
Equation 9.18 is utilized as the edge weights of the graph. Subsequently, the GNN block is
applied to the graph. The result of this process is the embeddings for the nodes,
represented by
C
′′ = FGNN(G(V, E)) ∈ R
n×M”
. (9.22)
191
Figure 9.2: Household Selection Graph.
where FGNN denotes a custome GNN model. The graph uses the mean square error
(MSE) as its objective loss function to predict demand. After the training process, the
pattern recognition graph outputs the attention matrix Aest, which is utilized to represent
the similarity between households.
9.4.2 Household Selection GNN
The purpose of the household selection graph is to efficiently select the users who have the
highest likelihood of participating in the program. Therefore, the graph is a node
classification GNN conducting the semi-supervised task. Let GHSG = (VHSG, EHSG) denote
the household selection graph. The similarity matrix generated based on the pattern
recognition graph represents the weighted edges of the graph, i.e., EHSG = Aest. At this
stage all socio-economic and power consumption information have been incorporated in the
192
similarity matrix. The set of nodes in the graph consist of a single node for each household,
i.e., VHSG = U. The one-hot encoding method is used as node features of the graph. For
example, if 4 households exist in the network, their feature set would be
0001, 0010, 0100, 1000.
The goal of this graph is to efficiently label nodes indicating acceptance or rejection of
an offer. Two steps are taken to conduct the labeling. First, Spectral Graph Analysis is
conducted on the graph to cluster nodes based on the similarity. Such an approach allows
for naive understanding of the node labels. Based on the spectral graph analysis small
portion of households in each neighborhood are queried to understand the true labels.
Then, those labels are used in the graph to conduct the semi-supervised graph labeling.
9.4.2.1 Spectral Graph Analaysis
Spectral clustering methods rely on the spectrum (eigenvalues) of the data’s similarity
matrix to reduce its dimensions before clustering it in a lower number of dimensions. The
similarity matrix Aest is given as input to the clustering.
First, the method computes the normalized Laplacian of the graph as defined by the
following equation:
L = I
′ − D′′−1/2AestD′′−1/2
. (9.23)
193
In this equation, I
′
represents the identity matrix, while D′′ is defined as diag(d), where
d(i) denotes the degree of node i.
Secondly, it computes the first k eigenvectors that correspond to the k smallest
eigenvalues of the Laplacian matrix. Once eigenvectors are derived, a new matrix with the k
eigenvectors is generated, where each row is treated as a set of features of the corresponding
node in the graph. Finally, the nodes are clustered based on these features using k-means
algorithms into 2 clusters representing people who are likely to accept or reject the offer.
9.4.2.2 Semi-Supervised Node classification
Up to this point, there have been no offers made to determine real-world responses to offers.
However, using spectral graph analysis, households have been grouped into two clusters and
assigned a label. It’s important to note that the labels based on these clusters don’t
necessarily indicate which group is more likely to accept the offers. The ultimate goal at
this stage is to use the perceptions received from clustering to survey a small portion of
individuals from each of the m neighborhoods, and use this data alongside the adjacency
matrix in a semi-supervised approach to identify individuals who are likely to accept the
offers.
To ensure that all parts of the graph are properly discovered and fairly treated, it’s
crucial to diversify the initial queries in all neighborhoods. For this reason, in each
neighborhood, 5% of users from each cluster generated by spectral graph analysis are
selected to be offered incentives. Querying this sample of users leads to real labels
194
determined for 10% of the population. With this information in hand, a node classification
GNN model is applied on top of GHSG = (VHSG, EHSG) to classify whether other users will
accept the offer or not. The resulting labels are used to make offers.
9.5 Fairness-Aware Candidate Selection
Ensuring a fair selection of participants is critical when rolling out incentive-based
programs that impact diverse societal groups. One of the primary challenges stems from
the resource-intensive nature of household surveys, both in terms of costs and time, which
can impede the program’s effectiveness. To address this, our framework aims to pinpoint
households more likely to accept offers, thus optimizing expenses and resource allocations.
Nevertheless, neglecting fairness considerations could lead to imbalances in participation
across neighborhoods, especially sidelining economically disadvantaged communities.
Therefore, this section presents our approach to tackle these fairness concerns, with an
emphasis on selecting households for the ILB program.
9.5.1 Fairness Metric & Evaluation
To better understand the underlying reasons for potential unfairness in the household
selection graph, Figure 9.3 showcases the average similarity scores both within
neighborhoods and between different counties. This figure depicts the attention similarity
matrix taken from the pattern recognition graph specific to California (more details
195
ALAMEDA ALPINE AMADOR BUTTE CALAVERAS
ALAMEDA ALPINE AMADOR BUTTE CALAVERAS
0.0
0.2
0.4
0.6
0.8
1.0
Figure 9.3: Household Similarity in Five Counties: Higher Values Show Greater Similarity.
provided in Section 9.6). It is evident from the figure that users within the same
neighborhood show greater similarity compared to users from other neighborhoods. These
pronounced similarities amplify the impact of primary user queries, which subsequently
affect the household selection graph in ILB. As an example, consider a scenario where T
initial queries are made in each of the neighborhoods N1 and N2, and they result in
acceptance rates of k1 and k2, where k1 is significantly less than k2. Such imbalances in the
initial labels, coupled with the heightened similarity within neighborhoods, could lead to a
surge in offers for users in N2, while disadvantaging N1. To address this problem, we start
by defining the utility, fairness metric, and discrimination within the context of the
program. Following that, in the subsequent subsection, we introduce an algorithm designed
to enhance fairness without significant compromise to utility.
196
For a given neighborhood Ni
, let us denote the number of offers made, the number of
offers accepted by recipients, and the population of the neighborhood as ai
, oi
, and pi
,
respectively. An effective candidate selection algorithm is indicated by a high utility, which
is contingent on a high rate of offer acceptance. Thus, in alignment with the utility metric
outlined in Definition 19, achieving high utility necessitates an acceptance rate approaching
one, or equivalently, one hundred percent.
For a specific neighborhood Ni
, we represent the total number of offers made, the
number of offers accepted by recipients, and the neighborhood’s population using the
symbols ai
, oi
, and pi
, respectively. As outlined in the utility metric in Definition 19, an
effective candidate selection algorithm leads to a significant proportion of the offers made
being accepted. Hence, high utility necessitates an acceptance rate approaching one, or
equivalently, one hundred percent in Equation 9.24.
Utility :=
Pm
i=1
P
ai
m
i=1
oi
. (9.24)
To formulate fairness in ILB, we utilize the manifestation of a widely accepted metric
called statistical parity [37, 126] for geospatial neighborhood [123].
Definition 22. (Fairness Metric [123]). A user selection program is said to be fair for
neighborhoods N = {N1, ..., Nm} if:
oi
pi
=
oj
pj
∀i, j ∈ {1, ..., m} (9.25)
197
Algorithm 11 Fairness-Aware Household Selection Framework
Input: Household Dataset Including Electricity Consumption and Socio-economic
Features.
Output: Households to be Queried
1: Execute Pattern Recognition Graph
2: Spectral Graph Analysis
3: Query a small portion in each neighborhood based on spectral graph analysis
4: Calculate k1, k2, ..., km
5: kmax ← maxi{ki}
6: wi ← kmax/ki
, ∀i ∈ [1...m]
7: for i ∈ D/Dqueried do
8: for j ∈ Dqueried do
9: if Node j has Accepted the Offer then
10: Aest[j, i] ← Aest[j, i] × wN(j)
11: end if
12: end for
13: end for
14: Execute household selection GNN with the updated Aest
15: return Query households based on predicted labels
The fairness measure is designed to ensure a uniform probability of offer distribution
across different neighborhoods. While this metric does not encompass all aspects of fairness
within the network, it is vital in promoting the inclusion of various neighborhoods during
the offer distribution process. Given the fairness measure, the problem we aim to solve is
formally presented in Problem 5.
Problem 5. Given neighborhoods N = {N1, ..., Nm}, we seek to propose a strategy to select
l users with the highest utility such that each neighborhood is fairly represented in the ILB
program.
198
9.5.2 Proposed Approach
We now introduce our fairness-aware strategy, designed to reduce bias across neighborhoods.
Our method employs a reweighting technique to mitigate unfairness, enhancing the chances
of participation across various neighborhoods. Unlike conventional methods in the existing
literature that primarily focus on ensuring statistical parity across gender and, in some
instances, specific racial groups, our approach is crafted to ensure that a broad spectrum of
neighborhoods can participate without compromising substantially on utility.
The objective of our method is to balance the likelihood of positive predictions in the
household selection graph, proportionally aligning with their significance as indicated by
the initial ground truth labels on the map. As explained in Algorithm 11, the process
initiates by querying a subset of each neighborhood, following the procedure outlined in the
household selection algorithm. However, instead of directly employing the labels received
and the attention matrix Aest to execute the model, the fairness mechanism is implemented
as follows. Let us denote the ratio of accepted offers in the preliminary query for
neighborhood Ni as ki
. Subsequently, the algorithm calculates a reweighting factor
associated with each neighborhood on the graph. Let kmax represent the maximum
accepted ratio among neighborhoods.
kmax = max
i
{ki} (9.26)
199
We define the reweighting factor for neighborhood i by
wi = (kmax
ki
)
α
, ∀i ∈ [1...m] (9.27)
In the given formulation, the training parameter α is utilized to guarantee that the
influence of reweighting factors is sufficiently pronounced to be reflected in the attention
matrix. The necessity of this tuning parameter stems from the fact that the values in the
attention matrix are normalized and might be relatively minor in certain situations. As an
example, consider three neighborhoods N1, N2, and N3 with the initial results of queries
being k1 = 0.8, k2 = 0.4, and k3 = 0.2, respectively. The reweighting factors for these
neighborhoods are calculated as w1 = 1, w2 = 2 and w3 = 4 when α is set to one.
Once the neighborhood reweighting factors are calculated, the algorithm continues by
reweighting the input edges of the similarity matrix based on the following procedure. For
every node i that has not been queried, i.e., i ∈ D \ Dqueried, consider all input edges that
originate from a node with a positive label. All such edges are multiplied by the
reweighting factor of the neighborhood where household i resides. This process is explained
in lines 7 to 14 of the algorithm. The reweighting of the edge happens in line 10 where
Aest[j, i] denotes the element in jth row and ith column of the attention matrix and
function N(j) returns the neighborhood of node j. The underlying rationale of this strategy
is that nodes in neighborhoods with reduced participation will exhibit higher similarity in
the attention matrix to nodes that have positively responded to offers. As a result, they
200
have a boosted likelihood of being chosen during model training. The updated attention
matrix is then used to train the household selection model.
9.6 Experimental Evaluation
9.6.1 Datasets
We conduct our experiments by combining multiple datasets from various sources, taking
into consideration significant aspects such as geographical diversity, the time series of
household electricity consumption, and socioeconomic factors.
Household-level Electricity Time-Series. The privacy concerns regarding household-level
electricity consumption have limited the availability of publicly accessible datasets, with
current datasets containing no more than 50 households. A comprehensive overview of
these datasets can be found in [8]. Nonetheless, the work in [136] has tackled this issue by
creating a synthetic dataset that covers households throughout the United States. This
dataset functions as a digital twin of a residential energy-use dataset within the residential
sector. For our experiments, we use the time-series data at the household level from this
dataset.
Education and Awareness. To indicate the level of awareness, the average ACT scores
are used, which are obtained from the EdGap dataset [142]. This dataset contains
socioeconomic features of high school students across the United States. The geospatial
coordinates of schools are derived by linking their identification number to data provided by
201
(a) California (b) Michigan (c) Texas
(d) California (e) Michigan (f) Texas
Figure 9.4: Average Hourly Consumption and Standard Deviation Error of Counties.
the National Center for Education Statistics [91], and then the average ACT scores are
calculated. Additionally, the percentage of people who have completed some college
education provided by [141] is used to further expatiate on the level of awareness.
Median Household Income and Unemployment Percentage. The county-level information
for the median household income as well as the unemployment percentage in 2014 is
extracted from the US Department of Agriculture website [141] and used as static features
for counties.
Climate. Two key attributes provided by the National Centers for Environmental
Information [43] are used as indicators of climate in counties: average temperature and
precipitation amount.
202
9.6.2 Experimental Setup
Overview. The experiments are divided into two parts, each looking at performance before
and after applying the framework. Subsection 9.6.3 specifically looks at how well the ILB
performs before the selection framework is applied, and investigates how the program
affects responsiveness cost and how much it can reduce the total consumption given a
certain amount of incentive. Subsection 9.6.4 focuses on the performance after the
framework has been applied in which factors such as responsiveness cost, the effects of rate
increases on those not participating in the program, and noise analysis in the selection
process are thoroughly evaluated.
Benchmarks & Comparison. We evaluate and compare the performance of the algorithm
in several aspects, demonstrating the effectiveness of the ILB. The comparison and
effectiveness of the approaches is compared in 4 key aspect including (I) the overall
performance of ILB compared to two state-of-the-art plans. (II) noise analysis to
understand the impact of inaccuracies on the model, (III) comparison of the graph’s
pattern recognition to ensure comparable performance in tuning over the existing
approaches. (IV) comparing and evaluating the proposed fairness mechanism with
approaches with the existing literature, including node reweighting [77], adaptive
selection [23], undersampling [62] and thresholding [87].
Geographic Diversity. Three states of California (CA), Texas (TX), and Michigan (MI)
are selected for the purpose of experiments, including the first five counties based on codes
203
published by the U.S. Census Bureau [16]. The geospatial location of counties used in
experiments and their corresponding statistics is provided in Figure 9.4.
Hyper-parameters Setting. The household electricity time series is used on an hourly
basis between September and December of 2014. The number of households participating in
the program in each county is 50, adding up to 250 households in each state. The dataset
was separated into three parts: training, evaluation, and testing sets, with a ratio of 7:2:1
respectively. Z-normalization was applied to normalize the input data, and training was
conducted using the RMSProp optimizer with a learning rate of 3e-4. Training took place
over 100 epochs, with a batch size of 32. The number of emergency days per month is set
to three days. The number of participants in the experiment is set to be one-quarter of the
population being considered and the default incentive provided to participants is 100 dollars
unless stated otherwise. The electricity PE for households is randomly selected based on a
Gaussian distribution with a mean of −0.25 and a standard deviation of 0.1.
Hardware and Software Setup. Our experiments were performed on a cluster node
equipped with an 18-core Intel i9-9980XE CPU, 125 GB of memory, and two 11 GB
NVIDIA GeForce RTX 2080 Ti GPUs. Furthermore, all neural network models are
implemented based on PyTorch version 1.13.0 with CUDA 11.7 using Python version 3.10.8.
204
9.6.3 Performance Analysis of ILB
In this subsection, we focus on the ILB program’s performance independent of the
framework. Initially, we evaluate the ILB’s efficiency in terms of responsiveness cost, and
subsequently, the program’s effectiveness in total demand reduction of electricity.
(a) California (b) Michigan
(c) Texas
Figure 9.5: Evaluation of Total Responsiveness Cost.
205
9.6.3.1 Responsiveness Cost
Figure 9.5 illustrates the results for the ILB program’s responsiveness cost. Recall that
responsiveness reflects the total incentive amount offered in relation to the observed
reduction, thus a lower cost indicates better performance. The figure showcases the
performance across three states: California, Michigan, and Texas. In each subfigure, given a
specified incentive (highlighted with a bar) offered to the entire community, each individual
assesses the benefit of participating in the program. This assessment is conducted using
Equation 9.17, which allows us to derive the community’s acceptance rate and represent it
on the x-axis, while the corresponding responsiveness cost is plotted on the y-axis. The
figure reveals that an increase in the incentive amount expectedly raises the acceptance rate,
but it also results in a higher responsiveness cost. The rate of increase for responsiveness
cost tends to be lower for smaller incentive amounts and grows for larger values.
9.6.3.2 Total Demand Reduction
Figure 9.6 shows the assessment of the total percentage reduction in consumption on
emergency days. The structure of subfigures is analogous to Figure 9.5, but they illustrate
the impact on the total consumption reduction instead of responsiveness cost on the y-axis.
An increasing trend is observed across all three states indicating that a rise in the incentive
amount and acceptance rate leads to a higher reduction in consumption on emergency days,
enhancing the agility and performance of the electricity network. This pattern hints at a
206
(a) California (b) Michigan
(c) Texas
Figure 9.6: Evaluation of Total Demand Reduction.
crucial trade-off between responsiveness cost and overall demand reduction, which becomes
increasingly apparent as the incentives and participant count increase.
9.6.4 Evaluating the Integrated Framework and ILB
Unlike the previous subsection, where offers were made to every household, this subsection
evaluates performance based on a proposed framework that incorporates socio-economic
207
factors and other determinants affecting the elasticity of demand for electricity when
selecting participants.
(a) California (b) Michigan (c) Texas
Figure 9.7: Responsiveness Performance of ILB.
(a) California (b) Michigan (c) Texas
Figure 9.8: Rate Hike on Non-participants in ILB.
9.6.4.1 Responsiveness Cost
The performance evaluation in terms of responsiveness cost is depicted in Figure 9.7. The
x-axis represents the reduction in consumption by ILB program participants during
emergency days, while the y-axis corresponds to the responsiveness cost. A quarter of
208
households with the highest scores based on our framework are selected to participate in
the program. As expected, once the amount of reduction by participants increases, a
considerable decrease in responsiveness is observed in all three states. The horizontal red
lines in the figure indicate the total reduction in demand for the whole community. The
intersection of the curve and the horizontal line shows when the reduction in demand by
ILB users accounts for a certain amount of total reduction in demand specified by the red
line. The figures demonstrate that even small reductions of 10 to 20 percent by participants
can lead to a significant reduction of approximately 5 to 10 percent in total demand. When
the reduction in ILB participants is 50%, the total reduction accounts for 20 to 25 percent.
Therefore, the proposed approach is a viable approach to effectively manage supply and
demand during emergencies and prevent severe outages across states and counties.
9.6.4.2 Rate Hikes on Non-participants
Figure 9.8 shows the impact of the ILB program on non-participants. It displays the rate
hikes in c/kW h for non-participants at different percentages of ILB participation and
corresponding incentives paid. Consistent with previous experiments, the PE for electricity
is considered, and participants with the highest scores based on our framework are selected
for the program. As it can be seen in the figure, increasing the percentage of participants
and the amount of incentives paid leads to a higher cost for non-participants. However, the
figure illustrates that in all three states, with a billing period of one month, by keeping the
209
Table 9.2: Accuracy of ILB Candidate Selection Framework.
Percentage of Noise Accuracy
California Texas Michigan
0 % 88 90 90
25 % 84 86 87
50 % 81 79 86
75 % 73 78 84
incentive amount within the range of 100 to 200 dollars, it is possible to only sacrifice a
minimal cost on hourly consumption of non-participants.
9.6.4.3 Candidate Selection and Noise Analysis
In Table 9.2, the proposed framework’s effectiveness for selecting candidates to participate
in ILB is presented alongside noise analysis on the similarity matrix. To introduce
inaccuracies in the attention matrix, a uniform random noise is added to each entry of the
matrix, following a distribution of Uniform(0, b) where b is set to three values: 25%, 50%
and 75% of the average value of the attention matrix. The results demonstrate that when
the attention matrix is accurate, the model’s accuracy is approximately 90%. However, as
inaccuracies increase, the model’s performance consistently declines across all three states.
However, it is notable that even when the amount of inaccuracy is large, it does not severely
impact the performance of the final model and the accuracy remains in an acceptable range.
210
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(a) California
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(b) Michigan
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(c) Texas
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(d) California
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(e) Michigan
0.0 0.2 0.4 0.6 0.8 1.0
Fairness
0.0
0.2
0.4
0.6
0.8
1.0
Utility
Uniform
ILB
Fair-ILB (1.5)
Fair-ILB (2)
Fair-ILB (2.5)
Node Reweighting
Undersampling
Adaptive Selection
Algorithms
(f) Texas
Figure 9.9: Pareto Analysis of Algorithms.
9.6.5 Fairness-Aware Selection of Participants
In this section, we assess and benchmark the effectiveness of our proposed fairness approach
by conducting a Pareto Analysis, as illustrated in Figure 9.9. We detail the state-of-the-art
211
algorithms shown in the figure below. These algorithms serve as benchmarks for evaluating
performance.
• ILB: The ILB framework without integrating the developed fairness-aware mechanism.
• Fair-ILB: The term represents the ILB algorithm augmented with the fairness
mechanism, as detailed in Section 9.5. It demonstrates outcomes for α set at 1, 1.5,
and 2.
• Node Reweighting [77]: The node reweighting approach [77] adjusts the weights of
household nodes in the training loss function. Its goal is to increase the significance of
positive instances in neighborhoods that are less represented while reducing their
effect in areas that are more represented, all of which is determined by the initial
queries made to neighborhoods..
• Adaptive Selection: The approach adopted from [23] focuses on adjusting the count of
invitees across different neighborhoods through a reweighting factor. Consequently,
areas with lower reweighting factors will see an increase in invitee numbers. Moreover,
the selection of top candidates for invitation in these neighborhoods will be guided by
their respective likelihood scores.
• Undersampling: The algorithm initially proposed in [62] seeks to lower discrimination
by undersampling, thereby ensuring an even selection of positive and negative
212
instances from each neighborhood. As a result, the training dataset maintains an
equal count of positive and negative instances from every neighborhood.
• Uniform or Thresholding: The algorithm proposed in [87] applies varied thresholds on
the classifier’s output across neighborhoods to equalize the number of selected
candidates. The objective is to lower selection criteria in neighborhoods with lower
participation and raise them in areas with more participation, thereby promoting
fairness.
To understand the performance of an algorithm, two key objectives are taken into
consideration: utility and fairness. These two objectives are the two axes of the diagrams in
Figure 9.9. The utility is measured based on Equation 9.24 reflecting the effectiveness of
selection and the fairness notion in Definition 22 used to demonstrate how fair the
algorithms are. The discrepancy in Fairness notion is referred to as discrimination. We
measure the discrimination in two aspects: Mean Absolute Deviation (MAD) and Mean
Absolute Error (MAE). The MAE of discrimination among neighborhoods can be
calculated as,
MAE :=
2
m(m − 1) ×
X
i,j,i̸=j
|
oi
pi
−
oj
pj
| (9.28)
The factor m(m − 1)/2 is the number of neighborhood pairs that exist on the map. The
metric essentially calculates the average value of discrimination between every two
213
neighborhoods. The second metric MAD aims to reveal how different the values are from
the average value, denoting the expected deviation from minimal discrimination.
MAD :=
1
m
×
Xm
i=1
|
oi
pi
− (
1
m
Xm
j=1
oj
pj
)| (9.29)
The data for utility, MAD, and MAE are provided directly in Table– in Appendix–. For
the purposes of visualization and carrying out Pareto analysis, these values are normalized
and subtracted from the maximum value of one, and this is referred to as ’fairness’ in
Figure 9.9. In the figure, the first row illustrates the performance based on MAD, while the
second row is derived from MAE. In all observed areas, the fair-ILB algorithm emerges as
the Pareto optimal, striking a balance between achieved utility and fairness across
communities. Without its fairness component, ILB alone yields the highest utility and the
least fairness, as it does not incorporate fairness considerations and focuses solely on
optimal selection. Conversely, the Thresholding method, also known as the Uniform
algorithm in the figure, exhibits the greatest fairness due to its strategy of selecting an
equal number of candidates from each neighborhood, but this results in the lowest utility.
The Node Reweighting technique shows performance most similar to our Fair-ILB. The
primary advantage of Fair-ILB over Node Reweighting lies in the fact that applying a
reweighting factor to a node affects not just its own neighborhood but also its connections
in other neighborhoods. Therefore, despite being somewhat effective due to high
intra-neighborhood correlation, Node Reweighting falls short of Fair-ILB’s efficiency
214
because of the existence of interneighborhood connections. The other two methodologies,
Undersampling and Adaptive Selection, fall below Fair-ILB in terms of both utility and
fairness. In conclusion, Fair-ILB effectively attains Pareto optimal performance by adeptly
balancing the dual objectives of fairness and utility.
215
Chapter 10
Conclusion
10.1 Summary
In conclusion, this dissertation has undertaken a thorough investigation into the
enhancement of privacy and fairness in the processing of spatio-temporal data through
advanced algorithms and ML. Initially, we established a solid foundation by providing the
required background and an overview of state-of-the-art techniques. Then, we introduced a
DP focused algorithm for publishing location histograms. Our journey continued with
addressing fairness in processing spatial user data, where we identified and aimed to
mitigate biases inherent in ML models regarding geospatial neighborhoods of users.
We then expanded our focus to include the temporal aspect of spatio-temporal data,
increasing its complexity and delving into the ethical processing of such datasets. A notable
development was an algorithm for maintaining user privacy in sharing energy consumption
data, alongside an incentive-based program for fairness-aware electricity demand
216
management, considering socio-economic family attributes. Furthermore, we explored the
responsible release of spatio-temporal trajectories for industrial and analytical purposes,
ensuring privacy preservation.
Extensive numerical evaluations, involving both real-world and synthetic datasets, were
a critical part of this dissertation. These evaluations, covering various aspects like query
types, geospatial regions, and data distributions, provided a comprehensive analysis of our
proposed solutions. The findings not only signify advancement over previous works but also
highlight potential future research directions in responsibly handling complex
spatio-temporal data. In summary, this dissertation underscored the importance of
balancing innovation with ethical considerations in AI, particularly in the realms of privacy
and fairness, setting a precedent for future developments in responsible AI applications.
10.2 Future Directions
There are many areas within fairness and privacy that are still underdeveloped and not
extensively researched. The following will outline some of the main recommended research
topics for future exploration.
10.2.1 Privacy and Fairness in Large Language Models
As we step into the era of large language models, the interplay between privacy and fairness
becomes an ever more critical component. As models tend towards being more
217
conversational, they will exhibit higher risks of running into privacy and fairness violations.
There are several avenues of active research that might dictate the future of this area.
The Use of APIs. A significant number LLMs have restricted access [15, 96, 134] and
some of these models can be accessed via APIs. These APIs can host models along with an
arsenal of ex-ante and post hoc qualitative checks that enable API owners to control
privacy as well as fairness of the model output. These qualitative checks are implemented
via a variety of techniques ranging from simple filters to secondary models that are trained
to detect and process model output in accordance with some chosen policies. As LLMs and
related APIs become more ubiquitous, the qualitative checks implemented by these APIs
are likely going to be key in terms of mitigating bias and privacy issues and improving the
overall responsibility of model output.
Logic-aware Models. When it comes to building models that are fair, the de-biasing
process can have an effect on privacy preservation [5]. One way to approach this problem is
explore bias mitigation methods that skip this step. Logic-aware language models [80] can
be trained to reduce harmful stereotypes. Instead of typical sentence encoding they use a
textual entailment which learns if parts of the second sentence text entails, contradicts or is
neutral with respect to parts of the first one. Models trained in this way were significantly
less biased than other baselines, without any extra data or additional training paradigms
used. Logic-aware training methods might be paired with privacy preservation techniques
in order to build models that are both private and fair. In order to address
privacy-preservation, smaller (i.e. 500X smaller than the state-of-the-art models; [80])
218
logical language models that are qualitatively measured as fair, can be deployed locally
with no human-annotated training samples for downstream tasks.
Privacy and Fairness in the Context of Learning from Human Feedback.
Fine-tuning with human feedback [97], has provided a promising way to make large
language models align more with human intent. This technique can also be utilized to train
models that are privacy-preserving and unbiased. Here, modes are expected to learn to
return content that is preferred by humans, based on a training loop where feedback is
provided via a reward model trained to rank model output based on what humans might
prefer. The collection of human preferences that are used to train the reward model can be
made in a fair and private way so that the reward model will learn these traits. This will in
turn enable the foundational model to learn to generalize its behavior based on the
feedback provided by the fair and privacy preserving reward model. It has already shown
some promise in reducing harmful content [97], but more research in this area is needed.
10.2.2 Fairness Through Privacy
The majority of previous approaches aimed at mitigating bias require access to sensitive
attributes. However, obtaining a significant amount of data with sensitive attributes is
often impractical due to people’s growing privacy concerns and legal compliance.
Consequently, a crucial area of research inquiry that merits attention is how to ensure fair
predictions while preserving privacy. This is a persistent challenge faced by technology
companies that seek to balance the goal of ensuring fair ML processing of user data,
219
including sensitive attributes such as Race and Gender, while simultaneously protecting
user privacy and restricting the use of sensitive user data.
10.2.3 Fair Privacy Protection
The authors in [40] pose a crucial question that sparked this subsection: Does a system
provide equivalent privacy protections to different groups of individuals? The main idea
behind fair privacy protection is to ensure that privacy mechanisms offer equal levels of
privacy to all users, meaning that users are being treated fairly in terms of the amount of
privacy protection they receive. Although there is a lower limit on the level of privacy
achieved, such as in DP, some groups of the population may receive more attention than
others in a broader context.
The significance of fair privacy protection is explained in the following example
predicated on an observation made in DP publication of the US 2020 census. In an
observation made by the US Census Bureau Researchers, the Laplace mechanism in DP
appeared to be disadvantaging low-populated areas like villages compared to highly
populated cities such as metropolitan areas [94]. To demonstrate this, let us consider two
cities, A and B, with populations a and b, respectively, where a << b. The populations are
sanitized using Laplace noise, with two noise values drawn from a Laplace distribution
(Lap(1/ϵ)) added to each population, and the private values are published. At first glance,
both cities appear to be sanitized using the same Laplace distribution, and both achieve
ϵ-DP. However, upon closer inspection, the amount of noise added per individual in each
220
city is examined. With knowledge that the variance of Laplace noise is 2/ϵ2
, the amount of
noise variance per individual is derived as 2/(aϵ2
) and 2/(bϵ2
). If a << b, then it can be
seen that 2/(bϵ2
) is much less than 2/(aϵ2
). In other words, the amount of noise per
individual in the low-populated city is much higher than in the highly populated city, which
raises questions about the fairness of the privacy guarantees imposed.
10.2.4 Going Beyond Race and Gender
The existing literature primarily focuses on understanding the relationship between fairness
and privacy concerning gender and race, with most studies considering these attributes in
binary settings. However, achieving fairness for other sensitive attributes and their
interaction with privacy remains a significant challenge. This issue is sometimes referred to
as subgroup fairness [86] and is a complex and underexplored area. For instance, spatial
information, which is often a proxy for race [88], includes numerous groups vulnerable to
unfair treatment. This information is crucial in social and political decision-making
processes.
10.2.5 Other Applications
The methods developed, along with the context described, can be utilized and are relevant
to a wide range of other fields. Key applications comprise:
Health Care: Patient data closely resemble trajectories and represent a critical area
where utmost privacy and fairness measures are essential to guarantee the responsible
221
handling and modeling of data. Throughout their lives, as patients frequent hospitals, they
receive a unique identifier, and their diagnoses are documented over time [2]. This
time-series data, coupled with their zip code and various locations, offer a substantial
information source for decision-making in areas like cancer treatment [155], insurance
premiums [32], and human studies [156]. The privacy-preserving methods and
fairness-aware approaches presented in the dissertation could be applied to manage such
data more responsibly.
Wireless Sensor Networks: The recent progress in big data and low-energy
consumption sensors has transformed wireless sensor networks into one of the most
challenging and beneficial applications of the century. These networks are employed in
various sectors, including environmental monitoring [95], industrial automation [105, 111],
and the development of smart cities [131]. Typically, their data are transferred and stored
as spatio-temporal information and utilized in a wide range of decision-making processes.
The techniques introduced in this dissertation can be used for improved modeling and more
responsible handling of such data.
Transportation: A primary source of spatio-temporal data is spatial trajectories,
recorded as a sequence of locations that an entity traverses over time. This type of data is
crucial for the planning of smart cities and the allocation of budgets [9]. Furthermore, it
plays a vital role for businesses in selecting strategic locations for establishing their
operations [52]. The methods presented in this dissertation can be utilized for more
222
responsible sharing, forecasting, and modeling of such data, while also ensuring fair
treatment of both groups and individuals.
223
Bibliography
[1] Sara Abdali, Sina Shaham, and Bhaskar Krishnamachari. “Multi-modal
misinformation detection: Approaches, challenges and opportunities”. In: ACM
Computing Surveys (CSUR) (2024).
[2] Mervat Abu-Elkheir, Mohammad Hayajneh, and Najah Abu Ali. “Data management
for the internet of things: Design primitives and solution”. In: Sensors 13.11 (2013),
pp. 15582–15612.
[3] Abbas Acar, Hidayet Aksu, A Selcuk Uluagac, and Mauro Conti. “A survey on
homomorphic encryption schemes: Theory and implementation”. In: ACM
Computing Surveys (Csur) 51.4 (2018), pp. 1–35.
[4] Gergely Acs, Claude Castelluccia, and Rui Chen. “Differentially private histogram
publishing through lossy compression”. In: 2012 IEEE 12th International Conference
on Data Mining. IEEE. 2012, pp. 1–10.
[5] Sushant Agarwal. “Trade-Offs between Fairness and Privacy in Machine Learning”.
In: 2020.
[6] Terry H Anderson. The pursuit of fairness: A history of affirmative action. Oxford
University Press, 2004.
[7] Julia Angwin, Jeff Larson, Lauren Kirchner, and Surya Mattu. Machine bias. May
2016.
224
[8] Toktam Babaei, Hamid Abdi, Chee Peng Lim, and Saeid Nahavandi. “A study and a
directory of energy consumption data sets of buildings”. In: Energy and Buildings 94
(2015), pp. 91–99.
[9] Asma Belhadi, Youcef Djenouri, Kjetil Nørvåg, Heri Ramampiaro, Florent Masseglia,
and Jerry Chun-Wei Lin. “Space–time series clustering: Algorithms, taxonomy, and
case study on urban smart cities”. In: Engineering Applications of Artificial
Intelligence 95 (2020), p. 103857.
[10] Nawal Benabbou, Mithun Chakraborty, and Yair Zick. “Fairness and diversity in
public resource allocation problems”. In: Bulletin of the Technical Committee on
Data Engineering (2019).
[11] Hal Berghel. “Malice domestic: The Cambridge analytica dystopia”. In: Computer
51.05 (2018), pp. 84–89.
[12] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth.
“Fairness in criminal justice risk assessments: The state of the art”. In: Sociological
Methods & Research 50.1 (2021), pp. 3–44.
[13] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization.
Cambridge university press, 2004, pp. 243–244.
[14] Dirk Brounen, Nils Kok, and John M Quigley. “Residential energy use and
conservation: Economics and demographics”. In: European Economic Review 56.5
(2012), pp. 931–945.
[15] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot
Learners. 2020. arXiv: 2005.14165 [cs.CL].
225
[16] US Census Bureau. American National Standards Institute (ANSI) and Federal
Information Processing Series (FIPS) codes. Mar. 2022.
[17] Paul J Burke and Ashani Abayasekara. “The price elasticity of electricity demand in
the United States: A three-dimensional analysis”. In: The Energy Journal 39.2
(2018).
[18] Joshua W Busby, Kyri Baker, Morgan D Bazilian, Alex Q Gilbert, Emily Grubert,
Varun Rai, Joshua D Rhodes, Sarang Shidore, Caitlin A Smith, and
Michael E Webber. “Cascading risks: Understanding the 2021 winter blackout in
Texas”. In: Energy Research & Social Science 77 (2021), p. 102106.
[19] Toon Calders and Sicco Verwer. “Three naive Bayes approaches for
discrimination-free classification”. In: Data mining and knowledge discovery 21.2
(2010), pp. 277–292.
[20] California ISO. Final Root Cause Analysis Mid-August 2020 Extreme Heat Wave.
Accessed: 2023-06-24. 2020.
[21] Alex Campolo, Madelyn Rose Sanfilippo, Meredith Whittaker, and Kate Crawford.
“AI now 2017 report”. In: (2017).
[22] Simon Caton and Christian Haas. “Fairness in machine learning: A survey”. In:
arXiv preprint arXiv:2010.04053 (2020).
[23] April Chen, Ryan Rossi, Nedim Lipka, Jane Hoffswell, Gromit Chan, Shunan Guo,
Eunyee Koh, Sungchul Kim, and Nesreen K Ahmed. “Graph Learning with
Localized Neighborhood Fairness”. In: arXiv preprint arXiv:2212.12040 (2022).
[24] Canyu Chen, Yueqing Liang, Xiongxiao Xu, Shangyu Xie, Yuan Hong, and Kai Shu.
“When Fairness Meets Privacy: Fair Classification with Semi-Private Sensitive
Attributes”. In: Workshop on Trustworthy and Socially Responsible Machine
Learning, NeurIPS 2022.
226
[25] Rui Chen, Gergely Acs, and Claude Castelluccia. “Differentially private sequential
data publication via variable-length n-grams”. In: ACM CCS. 2012, pp. 638–649.
[26] Chicago Crime Dataset. 2015.
[27] Alexandra Chouldechova. “Fair prediction with disparate impact: A study of bias in
recidivism prediction instruments”. In: Big data 5.2 (2017), pp. 153–163.
[28] Graham Cormode, Minos Garofalakis, and Michael Shekelyan. “Data-Independent
Space Partitionings for Summaries”. In: Proceedings of the 40th ACM
SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 2021,
pp. 285–298.
[29] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu.
“Differentially private spatial decompositions”. In: 2012 IEEE 28th International
Conference on Data Engineering. IEEE. 2012, pp. 20–31.
[30] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, and Thanh TL Tran.
“Differentially private summaries for sparse data”. In: Proceedings of the 15th
International Conference on Database Theory. 2012, pp. 299–311.
[31] Forbes Tech Council. How Privacy Got on the Calendar. Accessed: 2023-05-01. 2023.
url: https://www.forbes.com/sites/forbestechcouncil/2023/01/24/howprivacy-got-on-the-calendar/?sh=517cc83959f7.
[32] Leemore Dafny, Jonathan Gruber, and Christopher Ody. “More insurers lower
premiums: Evidence from initial pricing in the health insurance marketplaces”. In:
American Journal of Health Economics 1.1 (2015), pp. 53–81.
[33] Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. “Fast and sample near-optimal
algorithms for learning multidimensional histograms”. In: Conference On Learning
Theory. PMLR. 2018, pp. 819–842.
227
[34] Matthew F Dixon, Igor Halperin, and Paul Bilokon. Machine learning in Finance.
Vol. 1170. Springer, 2020.
[35] Gang Du, Wei Lin, Chuanwang Sun, and Dingzhong Zhang. “Residential electricity
consumption after the reform of tiered pricing for household electricity in China”. In:
Applied Energy 157 (2015), pp. 276–283.
[36] Cynthia Dwork. “Differential privacy: A survey of results”. In: International
conference on theory and applications of models of computation. Springer. 2008,
pp. 1–19.
[37] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel.
“Fairness through awareness”. In: Proceedings of the 3rd innovations in theoretical
computer science conference. 2012, pp. 214–226.
[38] Cynthia Dwork and Christina Ilvento. “Individual fairness under composition”. In:
Proceedings of Fairness, Accountability, Transparency in Machine Learning (2018).
[39] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating
noise to sensitivity in private data analysis”. In: Theory of cryptography conference.
Springer. 2006, pp. 265–284.
[40] Michael D Ekstrand, Rezvan Joshaghani, and Hoda Mehrpouyan. “Privacy for all:
Ensuring fair and equitable privacy protections”. In: Conference on fairness,
accountability and transparency. PMLR. 2018, pp. 35–47.
[41] Ahmed Eldawy and Mohamed F Mokbel. “Spatialhadoop: A mapreduce framework
for spatial data”. In: 2015 IEEE 31st international conference on Data Engineering.
IEEE. 2015, pp. 1352–1363.
[42] Commission for Energy Regulation (CER). “CER Smart Metering Project -
Electricity Customer Behaviour Trial, 2009-2010”. In: (2012). [Dataset].
228
[43] National Centers for Environmental Information.
[44] Evanthia Faliagka, Athanasios Tsakalidis, and Giannis Tzimas. “An integrated
e-recruitment system for automated personality mining and applicant ranking”. In:
Internet research (2012).
[45] Liyue Fan and Li Xiong. “An adaptive approach to real-time aggregate monitoring
with differential privacy”. In: IEEE Transactions on knowledge and data engineering
26.9 (2013), pp. 2094–2106.
[46] Natasha Fernandes, Annabelle McIver, and Carroll Morgan. “The Laplace
Mechanism has optimal utility for differential privacy over continuous queries”. In:
2021 36th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).
IEEE. 2021, pp. 1–12.
[47] Massimo Filippini and Shonali Pachauri. “Elasticities of electricity demand in urban
Indian households”. In: Energy policy 32.3 (2004), pp. 429–436.
[48] Brian Fischer. data science methodology and applications. May 2021.
[49] Anthony W Flores, Kristin Bechtel, and Christopher T Lowenkamp. “False positives,
false negatives, and false analyses: A rejoinder to machine bias: There’s software
used across the country to predict future criminals. and it’s biased against blacks”.
In: Fed. Probation 80 (2016), p. 38.
[50] Siamak Ghodsi, Harith Alani, and Eirini Ntoutsi. “Context matters for fairness–a
case study on the effect of spatial distribution shifts”. In: arXiv preprint
arXiv:2206.11436 (2022).
[51] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. “On calibration of
modern neural networks”. In: International conference on machine learning. PMLR.
2017, pp. 1321–1330.
229
[52] Arash Hajisafi, Haowen Lin, Sina Shaham, Haoji Hu, Maria Despoina Siampou,
Yao-Yi Chiang, and Cyrus Shahabi. “Learning dynamic graphs from all contextual
information for accurate point-of-interest visit forecasting”. In: Proceedings of the
31st ACM International Conference on Advances in Geographic Information Systems.
2023, pp. 1–12.
[53] Moritz Hardt, Eric Price, and Nati Srebro. “Equality of opportunity in supervised
learning”. In: Advances in neural information processing systems 29 (2016),
pp. 3315–3323.
[54] Alan M Hay. “Concepts of equity, fairness and justice in geographical studies”. In:
Transactions of the Institute of British Geographers (1995), pp. 500–508.
[55] Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang.
“Principled evaluation of differentially private algorithms using dpbench”. In:
Proceedings of the 2016 International Conference on Management of Data. 2016,
pp. 139–154.
[56] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. “Boosting the
Accuracy of Differentially Private Histograms through Consistency”. In: Proc. VLDB
Endow. 3.1–2 (Sept. 2010), pp. 1021–1032.
[57] Erhu He, Yiqun Xie, Xiaowei Jia, Weiye Chen, Han Bao, Xun Zhou, Zhe Jiang,
Rahul Ghosh, and Praveen Ravirathinam. “Sailing in the location-based fairness-bias
sphere”. In: Proceedings of the 30th International Conference on Advances in
Geographic Information Systems. 2022, pp. 1–10.
[58] Joanne Hinds, Emma J Williams, and Adam N Joinson. “ “It wouldn’t happen to
me”: Privacy concerns and perspectives following the Cambridge Analytica scandal”.
In: International Journal of Human-Computer Studies 143 (2020), p. 102498.
[59] Naoise Holohan, Spiros Antonatos, Stefano Braghin, and Pól Mac Aonghusa. “The
bounded Laplace mechanism in differential privacy”. In: arXiv preprint
arXiv:1808.10410 (2018).
230
[60] Ali Inan, Murat Kantarcioglu, Gabriel Ghinita, and Elisa Bertino. “Private record
matching using differential privacy”. In: Proceedings of the 13th International
Conference on Extending Database Technology. 2010, pp. 123–134.
[61] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. “Fairness
in learning: Classic and contextual bandits”. In: Advances in neural information
processing systems 29 (2016).
[62] Faisal Kamiran and Toon Calders. “Data preprocessing techniques for classification
without discrimination”. In: Knowledge and information systems 33.1 (2012),
pp. 1–33.
[63] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma.
“Fairness-aware classifier with prejudice remover regularizer”. In: Joint European
conference on machine learning and knowledge discovery in databases. Springer. 2012,
pp. 35–50.
[64] David Kernert, Frank Köhler, and Wolfgang Lehner. “SpMacho-Optimizing Sparse
Linear Algebra Expressions with Probabilistic Density Estimation.” In: EDBT. 2015,
pp. 289–300.
[65] David Kernert, Wolfgang Lehner, and Frank Köhler. “Topology-aware optimization
of big sparse matrices and matrix multiplications on main-memory systems”. In:
2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE.
2016, pp. 823–834.
[66] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. “A Notion of
Individual Fairness for Clustering”. In: arXiv preprint arXiv:2006.04960 (2020).
[67] Fragkiskos Koufogiannis, Shuo Han, and George J Pappas. “Optimality of the laplace
mechanism in differential privacy”. In: arXiv preprint arXiv:1504.00065 (2015).
231
[68] Nikita Kozodoi, Johannes Jacob, and Stefan Lessmann. “Fairness in credit scoring:
Assessment, implementation and profit implications”. In: European Journal of
Operational Research 297.3 (2022), pp. 1083–1094.
[69] Matt J Kusner, Joshua R Loftus, Chris Russell, and Ricardo Silva. “Counterfactual
fairness”. In: arXiv preprint arXiv:1703.06856 (2017).
[70] Michelle Seng Ah Lee and Luciano Floridi. “Algorithmic fairness in mortgage
lending: from absolute conditions to relational trade-offs”. In: Minds and Machines
31.1 (2021), pp. 165–191.
[71] Jing Lei. “Differentially private m-estimators”. In: Advances in Neural Information
Processing Systems 24 (2011), pp. 361–369.
[72] Franklin Leukam Lako, Paul Lajoie-Mazenc, and Maryline Laurent.
“Privacy-preserving publication of time-series data in smart grid”. In: Security and
Communication Networks 2021 (2021), pp. 1–21.
[73] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. “A data-and workload-aware
algorithm for range queries under differential privacy”. In: Proceedings of the VLDB
Endowment 7.5 (2014), pp. 341–352.
[74] Chao Li and Gerome Miklau. “An Adaptive Mechanism for Accurate Query
Answering under Differential Privacy”. In: Proc. VLDB Endow. 5.6 (2012),
pp. 514–525.
[75] Chao Li and Gerome Miklau. “Optimal error of query sets under the
differentially-private matrix mechanism”. In: Proceedings of the 16th International
Conference on Database Theory. 2013, pp. 272–283.
[76] Ninghui Li, Min Lyu, Dong Su, and Weining Yang. Differential privacy: From theory
to practice. Springer, 2017.
232
[77] Peizhao Li and Hongfu Liu. “Achieving fairness at no utility cost via data reweighing
with influence”. In: International Conference on Machine Learning. PMLR. 2022,
pp. 12917–12930.
[78] Haowen Lin, Sina Shaham, Yao-Yi Chiang, and Cyrus Shahabi. “Generating
Realistic and Representative Trajectories with Mobility Behavior Clustering”. In:
Proceedings of the 31st ACM International Conference on Advances in Geographic
Information Systems. 2023, pp. 1–4.
[79] Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai Lin.
“When machine learning meets privacy: A survey and outlook”. In: ACM Computing
Surveys (CSUR) 54.2 (2021), pp. 1–36.
[80] Hongyin Luo and James Glass. Logic Against Bias: Textual Entailment Mitigates
Stereotypical Sentence Reasoning. 2023. arXiv: 2303.05670 [cs.CL].
[81] Lingjuan Lyu, Yee Wei Law, Jiong Jin, and Marimuthu Palaniswami.
“Privacy-preserving aggregation of smart metering via transformation and
encryption”. In: 2017 IEEE Trustcom/BigDataSE/ICESS. IEEE. 2017, pp. 472–479.
[82] Zheng Xiang Ma, Min Zhang, Sina Shaham, Shu Ping Dang, and Jessica Hart.
“Literature review of the communication technology and signal processing
methodology based on the smart grid”. In: Applied Mechanics and Materials 719
(2015), pp. 436–442.
[83] Sepideh Mahabadi and Ali Vakilian. “Individual fairness for k-clustering”. In:
International Conference on Machine Learning. PMLR. 2020, pp. 6586–6596.
[84] Ryan McKenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala.
“Optimizing Error of High-Dimensional Statistical Queries under Differential
Privacy”. In: Proc. VLDB Endow. 11.10 (2018), pp. 1206–1219.
233
[85] Frank D McSherry. “Privacy integrated queries: an extensible platform for
privacy-preserving data analysis”. In: Proceedings of the 2009 ACM SIGMOD
International Conference on Management of data. 2009, pp. 19–30.
[86] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
Aram Galstyan. “A survey on bias and fairness in machine learning”. In: ACM
Computing Surveys (CSUR) 54.6 (2021), pp. 1–35.
[87] Aditya Krishna Menon and Robert C Williamson. “The cost of fairness in
classification”. In: arXiv preprint arXiv:1705.09055 (2017).
[88] Vishwali Mhasawade, Yuan Zhao, and Rumi Chunara. “Machine learning and
algorithmic fairness in public and population health”. In: Nature Machine
Intelligence 3.8 (2021), pp. 659–666.
[89] Mark Miller and Anna Alberini. “Sensitivity of price elasticity of demand to
aggregation, unobserved heterogeneity, price trends, and price endogeneity: Evidence
from US Data”. In: Energy Policy 97 (2016), pp. 235–249.
[90] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. “Obtaining well
calibrated probabilities using bayesian binning”. In: Twenty-Ninth AAAI Conference
on Artificial Intelligence. 2015.
[91] National Center for Education Statistics.
[92] Anand Nayyar, Lata Gadhavi, and Noor Zaman. “Machine learning in healthcare:
review, opportunities and challenges”. In: Machine Learning and the Internet of
Medical Things in Healthcare (2021), pp. 23–45.
[93] Kee Yuan Ngiam and Wei Khor. “Big data and machine learning algorithms for
health-care delivery”. In: The Lancet Oncology 20.5 (2019), e262–e273.
234
[94] William P O’Hare. Differential undercounts in the US census: who is missed?
Springer Nature, 2019.
[95] Luís ML Oliveira and Joel JPC Rodrigues. “Wireless Sensor Networks: A Survey on
Environmental Monitoring.” In: J. Commun. 6.2 (2011), pp. 143–151.
[96] OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].
[97] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.
Training language models to follow instructions with human feedback. 2022. arXiv:
2203.02155 [cs.CL].
[98] Dana Pessach and Erez Shmueli. “A review on fairness in machine learning”. In:
ACM Computing Surveys (CSUR) 55.3 (2022), pp. 1–44.
[99] John Platt et al. “Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods”. In: Advances in large margin
classifiers 10.3 (1999), pp. 61–74.
[100] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger.
“On fairness and calibration”. In: Advances in neural information processing systems
30 (2017).
[101] David Pujol and Ashwin Machanavajjhala. “Equity and Privacy: More Than Just a
Tradeoff”. In: IEEE Security & Privacy 19.6 (2021), pp. 93–97.
[102] Wahbeh Qardaji, Weining Yang, and Ninghui Li. “Differentially private grids for
geospatial data”. In: 2013 IEEE 29th international conference on data engineering
(ICDE). IEEE. 2013, pp. 757–768.
235
[103] Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy. “Mitigating bias
in algorithmic hiring: Evaluating claims and practices”. In: Proceedings of the 2020
conference on fairness, accountability, and transparency. 2020, pp. 469–481.
[104] Vibhor Rastogi and Suman Nath. “Differentially private aggregation of distributed
time-series with transformation and encryption”. In: Proceedings of the 2010 ACM
SIGMOD International Conference on Management of data. 2010, pp. 735–746.
[105] Mohsin Raza, Nauman Aslam, Hoa Le-Minh, Sajjad Hussain, Yue Cao, and
Noor Muhammad Khan. “A critical analysis of research potential, challenges, and
future directives in industrial wireless sensor networks”. In: IEEE Communications
Surveys & Tutorials 20.1 (2017), pp. 39–95.
[106] Christopher Riederer and Augustin Chaintreau. “The Price of Fairness in Location
Based Advertising”. In: (2017).
[107] George Rutherglen. “Disparate impact under title VII: an objective theory of
discrimination”. In: Va. L. Rev. 73 (1987), p. 1297.
[108] Dimitris Sacharidis, Giorgos Giannopoulos, George Papastefanatos, and
Kostas Stefanidis. “Auditing for Spatial Fairness”. In: arXiv preprint
arXiv:2302.12333 (2023).
[109] Andrew D Selbst. “Disparate impact in big data policing”. In: Ga. L. Rev. 52 (2017),
p. 109.
[110] Sina Shaham. “Location Privacy in the Era of Big Data and Machine Learning”.
PhD thesis. 2019.
[111] Sina Shaham, Shuping Dang, Miaowen Wen, Shahid Mumtaz, Varun G Menon, and
Chengzhong Li. “Enabling cooperative relay selection by transfer learning for the
industrial internet of things”. In: IEEE Transactions on Cognitive Communications
and Networking 8.2 (2022), pp. 1131–1146.
236
[112] Sina Shaham, Ming Ding, Matthew Kokshoorn, Zihuai Lin, Shuping Dang, and
Rana Abbas. “Fast channel estimation and beam tracking for millimeter wave
vehicular communications”. In: IEEE Access 7 (2019), pp. 141104–141118.
[113] Sina Shaham, Ming Ding, Bo Liu, Shuping Dang, Zihuai Lin, and Jun Li. “Privacy
preservation in location-based services: A novel metric and attack model”. In: IEEE
Transactions on Mobile Computing 20.10 (2020), pp. 3006–3019.
[114] Sina Shaham, Ming Ding, Bo Liu, Shuping Dang, Zihuai Lin, and Jun Li. “Privacy
preserving location data publishing: A machine learning approach”. In: IEEE
Transactions on Knowledge and Data Engineering 33.9 (2020), pp. 3270–3283.
[115] Sina Shaham, Ming Ding, Bo Liu, Zihuai Lin, and Jun Li. “Machine learning aided
anonymization of spatiotemporal trajectory datasets”. In: IEEE INFOCOM
2019-IEEE conference on computer communications workshops (INFOCOM
WKSHPS). IEEE. 2019, pp. 1–6.
[116] Sina Shaham, Ming Ding, Bo Liu, Zihuai Lin, and Jun Li. “Transition-Entropy: a
novel metric for privacy preservation in location-based services”. In: IEEE
INFOCOM 2019-IEEE Conference on Computer Communications Workshops
(INFOCOM WKSHPS). IEEE. 2019, pp. 1–6.
[117] Sina Shaham, Gabriel Ghinita, Ritesh Ahuja, John Krumm, and Cyrus Shahabi.
“HTF: Homogeneous Tree Framework for Differentially Private Release of Large
Geospatial Datasets with Self-tuning Structure Height”. In: ACM Transactions on
Spatial Algorithms and Systems 9.4 (2023), pp. 1–30.
[118] Sina Shaham, Gabriel Ghinita, Ritesh Ahuja, John Krumm, and Cyrus Shahabi.
“HTF: Homogeneous Tree Framework for Differentially-Private Release of Large
Geospatial Datasets with Self-Tuning Structure Height”. In: ACM Transactions on
Spatial Algorithms and Systems (2022).
237
[119] Sina Shaham, Gabriel Ghinita, Ritesh Ahuja, John Krumm, and Cyrus Shahabi.
“HTF: homogeneous tree framework for differentially-private release of location
data”. In: Proceedings of the 29th International Conference on Advances in
Geographic Information Systems. 2021, pp. 184–194.
[120] Sina Shaham, Gabriel Ghinita, Bhaskar Krishnamachari, and Cyrus Shahabi.
“Differentially Private Publication of Smart Electricity Grid Data”. In: In submission
(2024).
[121] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. “Differentially-Private
Publication of Origin-Destination Matrices with Intermediate Stops”. In: Proceedings
of the 25th International Conference on Extending Database Technology (EDBT).
2022.
[122] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. “Enhancing the performance of
spatial queries on encrypted data through graph embedding”. In: IFIP Annual
Conference on Data and Applications Security and Privacy. Springer. 2020,
pp. 289–309.
[123] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. “Fair Spatial Indexing: A
paradigm for Group Spatial Fairness”. In: arXiv preprint arXiv:2302.02306 (2023).
[124] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. “Models and mechanisms for
spatial data fairness”. In: Proceedings of the VLDB Endowment. International
Conference on Very Large Data Bases. Vol. 16. 2. NIH Public Access. 2022, p. 167.
[125] Sina Shaham, Gabriel Ghinita, and Cyrus Shahabi. “Supporting secure dynamic
alert zones using searchable encryption and graph embedding”. In: The VLDB
Journal 33.1 (2024), pp. 185–206.
[126] Sina Shaham, Arash Hajisafi, Minh K Quan, Dinh C Nguyen,
Bhaskar Krishnamachari, Charith Peris, Gabriel Ghinita, Cyrus Shahabi, and
Pubudu N Pathirana. “Holistic Survey of Privacy and Fairness in Machine Learning”.
In: arXiv preprint arXiv:2307.15838 (2023).
238
[127] Sina Shaham, Matthew Kokshoorn, Ming Ding, Zihuai Lin, and
Mahyar Shirvanimoghaddam. “Extended kalman filter beam tracking for millimeter
wave vehicular communications”. In: 2020 IEEE International Conference on
Communications Workshops (ICC Workshops). IEEE. 2020, pp. 1–6.
[128] Sina Shaham, Matthew Kokshoorn, Zihuai Lin, Ming Ding, and Yi Wu. “Raf:
Robust adaptive multi-feedback channel estimation for millimeter wave mimo
systems”. In: 2018 IEEE Wireless Communications and Networking Conference
(WCNC). IEEE. 2018, pp. 1–6.
[129] Sina Shaham, Bhaskar Krishnamachari, and Matthew Kahn. “ILB: Graph Neural
Network Enabled Emergency Demand Response Program For Electricity”. In: arXiv
preprint arXiv:2310.00129 (2023).
[130] Sina Shaham, Saba Rafieian, Ming Ding, Mahyar Shirvanimoghaddam, and
Zihuai Lin. “On the importance of location privacy for users of location based
applications”. In: arXiv preprint arXiv:1911.01633 (2019).
[131] Himanshu Sharma, Ahteshamul Haque, and Frede Blaabjerg. “Machine learning in
wireless sensor networks for smart cities: a survey”. In: Electronics 10.9 (2021),
p. 1012.
[132] Suraj Shetiya, Ian P Swift, Abolfazl Asudeh, and Gautam Das. “Fairness-aware
range queries for selecting unbiased data”. In: 2022 IEEE 38th International
Conference on Data Engineering (ICDE). IEEE. 2022, pp. 1423–1436.
[133] Daniel J Solove. “Understanding privacy”. In: (2008).
[134] Saleh Soltan, Shankar Ananthakrishnan, Jack G. M. FitzGerald, Rahul Gupta,
Wael Hamza, Haidar Khan, Charith Peris, Stephen Rawls, Andy Rosenbaum,
Anna Rumshisky, Chandana Satya Prakash, Mukund Sridhar, Fabian Triefenbach,
Apurv Verma, Gokhan Tur, and Prem Natarajan. “AlexaTM 20B: Few-shot learning
using a large-scale multilingual seq2seq model”. In: arXiv (2022).
239
[135] Shuang Song, Yizhen Wang, and Kamalika Chaudhuri. “Pufferfish privacy
mechanisms for correlated data”. In: Proceedings of the 2017 ACM International
Conference on Management of Data. 2017, pp. 1291–1306.
[136] Swapna Thorve, Young Yun Baek, Samarth Swarup, Henning Mortveit,
Achla Marathe, Anil Vullikanti, and Madhav Marathe. “High resolution synthetic
residential energy use profiles for the United States”. In: Scientific Data 10.1 (2023),
p. 76.
[137] Financial Times. “Facebook privacy breach”. In: Financial Times (2020), pp. 11–12.
[138] Ali Tizghadam, Hamzeh Khazaei, Mohammad HY Moghaddam, and Yasser Hassan.
Machine learning in transportation. 2019.
[139] Jacopo Torriti. “Time of Use (ToU) electricity tariffs: Assessing the impacts in terms
of energy consumption, peak demand and costs”. In: Energy Policy 49 (2012),
pp. 423–430.
[140] Geoffrey KF Tso and Kelvin KW Yau. “A study of domestic energy usage patterns
in Hong Kong”. In: Energy 28.15 (2003), pp. 1671–1682.
[141] US Department of Agriculture.
[142] Peter VanWylen. Visualizing the education gap.
[143] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In:
Advances in neural information processing systems 30 (2017).
240
[144] Veraset. Veraset Movement data for the USA, The largest, deepest and broadest
available movement dataset (anonymized GPS signals).
https://datarade.ai/data-products/veraset-movement-data-for-the-usathe-largest-deepest-and-broadest-available-movement-dataset-veraset.
[Online; accessed 19-May-2021]. 2021.
[145] Caroline Wang, Bin Han, Bhrij Patel, and Cynthia Rudin. “In pursuit of
interpretable, fair and accurate machine learning for criminal recidivism prediction”.
In: Journal of Quantitative Criminology (2022), pp. 1–63.
[146] Yongzhi Wang, Hua Lv, and Yuqing Ma. “Geological tetrahedral model-oriented
hybrid spatial indexing structure based on Octree and 3D R*-tree”. In: Arabian
Journal of Geosciences 13 (2020), pp. 1–11.
[147] Zhongli Wang, Shuping Dang, Sina Shaham, Zhenrong Zhang, and Zhihan Lv.
“Basic research methodology in wireless communications: The first course for
research-based graduate students”. In: IEEE Access 7 (2019), pp. 86678–86696.
[148] Craig D Wenger, Douglas H Phanstiel, M Violet Lee, Derek J Bailey, and
Joshua J Coon. “COMPASS: A suite of pre-and post-search proteomics software
tools for OMSSA”. In: Proteomics 11.6 (2011), pp. 1064–1074.
[149] Leonard Weydemann, Dimitris Sacharidis, and Hannes Werthner. “Defining and
measuring fairness in location recommendations”. In: Proceedings of the 3rd ACM
SIGSPATIAL international workshop on location-based recommendations, geosocial
networks and geoadvertising. 2019, pp. 1–8.
[150] Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. “Differential privacy via
wavelet transforms”. In: IEEE Transactions on knowledge and data engineering 23.8
(2010), pp. 1200–1214.
[151] Yonghui Xiao, Li Xiong, Liyue Fan, Slawomir Goryczka, and Haoran Li. “DPCube:
Differentially Private Histogram Release through Multidimensional Partitioning”. In:
7.3 (2014), pp. 195–222.
241
[152] Yonghui Xiao, Li Xiong, and Chun Yuan. “Differentially private data release through
multidimensional partitioning”. In: Workshop on Secure Data Management. Springer.
2010, pp. 150–168.
[153] Yiqun Xie, Erhu He, Xiaowei Jia, Weiye Chen, Sergii Skakun, Han Bao, Zhe Jiang,
Rahul Ghosh, and Praveen Ravirathinam. “Fairness by “Where”: A
Statistically-Robust and Model-Agnostic Bi-level Learning Framework”. In:
Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. 11. 2022,
pp. 12208–12216.
[154] Jia Xu, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Ge Yu, and Marianne Winslett.
“Differentially private histogram publication”. In: The VLDB journal 22 (2013),
pp. 797–822.
[155] Yiwen Xu, Ahmed Hosny, Roman Zeleznik, Chintan Parmar, Thibaud Coroller,
Idalid Franco, Raymond H Mak, and Hugo JWL Aerts. “Deep learning predicts lung
cancer treatment response from serial medical imaging”. In: Clinical Cancer Research
25.11 (2019), pp. 3266–3275.
[156] Scott L Zeger, Rafael Irizarry, and Roger D Peng. “On time series analysis of public
health and biomedical data”. In: Annu. Rev. Public Health 27 (2006), pp. 57–79.
[157] Chengyuan Zhang, Ying Zhang, Wenjie Zhang, and Xuemin Lin. “Inverted linear
quadtree: Efficient top k spatial keyword search”. In: IEEE Transactions on
Knowledge and Data Engineering 28.7 (2016), pp. 1706–1721.
[158] Jun Zhang, Xiaokui Xiao, and Xing Xie. “Privtree: A differentially private algorithm
for hierarchical decompositions”. In: Proceedings of the 2016 International
Conference on Management of Data. 2016, pp. 155–170.
[159] Li Zhang, Yuwen Qian, Ming Ding, Chuan Ma, Jun Li, and Sina Shaham. “Location
privacy preservation based on continuous queries for location-based services”. In:
IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops
(INFOCOM WKSHPS). IEEE. 2019, pp. 1–6.
242
[160] Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. “Towards
accurate histogram publication under differential privacy”. In: Proceedings of the
2014 SIAM international conference on data mining. SIAM. 2014, pp. 587–595.
[161] Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. “Differentially private
real-time release of sequential data”. In: ACM Transactions on Privacy and Security
26.1 (2022), pp. 1–29.
[162] Michael J Zimmer. “Emerging uniform structure of disparate treatment
discrimination litigation”. In: Ga. L. Rev. 30 (1995), p. 563.
243
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Privacy-aware geo-marketplaces
PDF
Differentially private learned models for location services
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Efficient crowd-based visual learning for edge devices
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Prediction models for dynamic decision making in smart grid
PDF
Efficient and accurate in-network processing for monitoring applications in wireless sensor networks
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Realistic and controllable trajectory generation
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Modeling intermittently connected vehicular networks
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Location-based spatial queries in mobile environments
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Privacy in location-based applications: going beyond K-anonymity, cloaking and anonymizers
Asset Metadata
Creator
Shaham, Sina
(author)
Core Title
Responsible AI in spatio-temporal data processing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
04/19/2024
Defense Date
03/06/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
ethics,fairness,location,ML,OAI-PMH Harvest,privacy,responsible AI
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Krishnamachari, Bhaskar (
committee chair
), Raghavendra, Cauligi (
committee member
), Shahabi, Cyrus (
committee member
)
Creator Email
sinashaham@gmail.com,sshaham@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113889739
Unique identifier
UC113889739
Identifier
etd-ShahamSina-12839.pdf (filename)
Legacy Identifier
etd-ShahamSina-12839
Document Type
Dissertation
Format
theses (aat)
Rights
Shaham, Sina
Internet Media Type
application/pdf
Type
texts
Source
20240422-usctheses-batch-1143
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
fairness
privacy
responsible AI