Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
(USC Thesis Other)
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A Function Approximation View of Database Operations for Efficient, Accurate, Privacy-Preserving & Robust Query Answering with Theoretical Guarantees by Sepanta Zeighami A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2024 Copyright 2024 Sepanta Zeighami Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii I Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Thesis Statement and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 The Neural Database Framework . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1.1 NeuroDB for Approximate Query Processing . . . . . . . . . . . 6 1.2.1.2 NeuroDB for Privacy-Preserving Query Answering . . . . . . . 7 1.2.1.3 NeuroDB for Incomplete Relational Data . . . . . . . . . . . . . 7 1.2.2 Theoretical Analysis of Learned Database Operations . . . . . . . . . . . . 8 1.2.2.1 Theoretical Performance Guarantees for Static Learned Indexes 8 1.2.2.2 Theoretical Performance Guarantees for Dynamic Learned Database Operations . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2.3 Required Model Size for Learned Database Operations . . . . . 10 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: The Function Approximation View of Database Operations . . . . . . . 12 2.1 Operation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Rank Operation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Cardinality Operation Function . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Range-Aggregate Operation Function . . . . . . . . . . . . . . . . . . . . . 14 2.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Learned Database Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 II The Neural Database Framework . . . . . . . . . . . . . . . . . . . . . . . . 17 Chapter 3: Overview of The Neural Database Framework . . . . . . . . . . . . . . . 18 Chapter 4: A Neural Database for Answering Range Aggregate Queries . . . . . . . 20 ii 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 DQD Bound for Neural Networks Answering RAQs . . . . . . . . . . . . . . . . . 26 4.3.1 DQD Bound Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1.1 Incorporating Data Distribution . . . . . . . . . . . . . . . . . . 26 4.3.1.2 Theorem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3.1.3 Impact of Distribution and LDQ . . . . . . . . . . . . . . . . . . 31 4.3.1.4 Measuring Complexity in Practice . . . . . . . . . . . . . . . . . 32 4.3.2 DQD Bound Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2.1 Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2.2 Bounding Approximation Error . . . . . . . . . . . . . . . . . . 33 4.3.2.3 Bounding Sampling Error . . . . . . . . . . . . . . . . . . . . . . 38 4.3.2.4 Completing the Proof . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.3 Other Query Functions and Model Choices . . . . . . . . . . . . . . . . . 40 4.3.3.1 AVG Aggregation Function . . . . . . . . . . . . . . . . . . . . . 40 4.3.3.2 Other Query Functions . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.3.3 DQD for Query Modelling Approaches . . . . . . . . . . . . . . 42 4.4 NeuroSketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.1 NeuroSketch Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.2 NeuroSketch Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.3 General RAQs and Real-World Application . . . . . . . . . . . . . . . . . . 47 4.5 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.2 Baseline Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5.2.1 Results Across Datasets . . . . . . . . . . . . . . . . . . . . . . . 53 4.5.2.2 Results Across Different Workloads . . . . . . . . . . . . . . . . 54 4.5.3 Model Architecture Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.5.3.1 Time/Space/Accuracy Trade-Offs of Model Architectures . . . . 56 4.5.3.2 Visualizing NeuroSketch for Different Model Depth . . . . . . . 58 4.5.4 NeuroSketch Generalization Analysis . . . . . . . . . . . . . . . . . . . . . 58 4.5.5 Ablation Study of Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5.6 NeuroSketch Preprocessing Time Analysis . . . . . . . . . . . . . . . . . . 61 4.5.7 Confirming DQD Bound with NeuroSketch . . . . . . . . . . . . . . . . . 62 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8.1 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8.1.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.8.1.2 Proof of Technical Lemmas for Theorem 2 . . . . . . . . . . . . 69 4.8.1.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.8.1.4 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8.2 Utilizing Construction in Practice . . . . . . . . . . . . . . . . . . . . . . . 88 Chapter 5: A Neural Database for Differentially Private Spatial Range Queries . . 90 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 iii 5.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Spatial Neural Histograms (SNH) . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.1 Baseline Solution using DP-SGD . . . . . . . . . . . . . . . . . . . . . . . 98 5.3.2 A different learning paradigm for RCQs . . . . . . . . . . . . . . . . . . . 100 5.3.3 Proposed approach: SNH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.4 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.1 Step 1: Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.2 Step 2: SNH Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.3 Model Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5 End-to-End System Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.1 System Tuning with ParamSelect . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.1.1 ParamSelect for ρ . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.1.2 Generalizing ParamSelect to any system parameter . . . . . . . 114 5.5.2 Privacy and Security Discussion . . . . . . . . . . . . . . . . . . . . . . . . 115 5.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.6.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.6.1.2 SNH system parameters . . . . . . . . . . . . . . . . . . . . . . 118 5.6.1.3 Other experimental settings . . . . . . . . . . . . . . . . . . . . 119 5.6.2 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.6.3 Ablation Study for SNH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6.3.1 Modeling choices . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6.3.2 Balancing Uniformity Errors . . . . . . . . . . . . . . . . . . . . 123 5.6.3.3 ParamSelect and ρ . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.6.3.4 SNH Learning Ability in Non-Uniform Datasets . . . . . . . . . 124 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.9.1 DP Proof and Security Discussion . . . . . . . . . . . . . . . . . . . . . . . 129 5.9.2 Complementary Experimental Results . . . . . . . . . . . . . . . . . . . . 131 5.9.3 Differentially Private STHoles Implementation . . . . . . . . . . . . . . . 135 5.9.4 ParamSelect Feature Engineering and Feature Selection . . . . . . . . . . 137 Chapter 6: A Neural Database for Queries on Incomplete Relational Data . . . . . 140 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2 Definitions and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2.1 NeuroComplete Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3 Training Set Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.4 Query Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.4.2 Row Relevance Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4.3 Row Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.4.4 Multiple Tables and Final Embedding Algorithm . . . . . . . . . . . . . . 158 6.5 End-to-End System and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.1 End-To-End System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5.2 Further Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.6 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 iv 6.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.6.2 Comparison Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.6.3 Training vs. Test Query Distribution Analysis . . . . . . . . . . . . . . . . 169 6.6.4 Multiple Incomplete Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.6.5 Scalability and Efficiency Anlysis . . . . . . . . . . . . . . . . . . . . . . . 171 6.6.6 Number of Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.6.7 Case-Study: Estimating AVG Visit Duration . . . . . . . . . . . . . . . . . . 173 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 III Theoretical Analysis of Learned Database Operations . . . . . . . . 177 Chapter 7: Overview of Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . 178 Chapter 8: Theoretical Performance Guarantees for Static Learned Indexes . . . . 180 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 8.3 Asymptotic Behaviour of Learned Indexing . . . . . . . . . . . . . . . . . . . . . . 186 8.3.1 Constant Time and Near-Linear Space . . . . . . . . . . . . . . . . . . . . 186 8.3.2 Log-Logarithmic Time and Constant Space . . . . . . . . . . . . . . . . . 187 8.3.3 Log-Logarithmic Time and Quasi-Linear Space . . . . . . . . . . . . . . . 189 8.3.4 Distributions with Other Domains . . . . . . . . . . . . . . . . . . . . . . 190 8.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.4.1 Proof of Theorem 6: PCA Index . . . . . . . . . . . . . . . . . . . . . . . . 191 8.4.1.1 Approximating Rank Function . . . . . . . . . . . . . . . . . . . 192 8.4.1.2 Index Construction and Querying . . . . . . . . . . . . . . . . . 193 8.4.1.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 193 8.4.2 Proof of Theorem 7: RDS Algorithm . . . . . . . . . . . . . . . . . . . . . 194 8.4.2.1 Approximating Rank Function . . . . . . . . . . . . . . . . . . . 194 8.4.2.2 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.4.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 196 8.4.3 Proof of Theorem 8: RDA Index . . . . . . . . . . . . . . . . . . . . . . . . 197 8.4.3.1 Approximating Rank Function . . . . . . . . . . . . . . . . . . . 197 8.4.3.2 Index Construction and Querying . . . . . . . . . . . . . . . . . 199 8.4.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 201 8.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.5.1 Results on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 202 8.5.2 Results on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 8.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Chapter 9: Theoretical Guarantees for Dynamic Learned Database Operations . . 212 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 9.1.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 v 9.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.2.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.2.2 Learned Database Operations . . . . . . . . . . . . . . . . . . . . . . . . . 220 9.3 Analysis through Distribution Learnability . . . . . . . . . . . . . . . . . . . . . . 221 9.3.1 The Modeling Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 9.3.1.1 Defining Distribution Learnability . . . . . . . . . . . . . . . . . 223 9.3.1.2 Proving Distribution Learnability . . . . . . . . . . . . . . . . . 225 9.3.2 The Model Utilization Problem . . . . . . . . . . . . . . . . . . . . . . . . 227 9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.4.1 Indexing Dynamic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.4.2 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 9.4.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 9.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 9.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 9.7.1 Formalized Setup and Operations . . . . . . . . . . . . . . . . . . . . . . . 236 9.7.2 Distribution Learnability Through Function Approximation . . . . . . . . 238 9.7.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 9.7.3.1 Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 9.7.3.2 Theorem 12 (formally Theorem 19) . . . . . . . . . . . . . . . . 242 9.7.3.3 Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 9.7.3.4 Learned Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 247 9.7.3.5 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . 256 9.7.3.6 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 Chapter 10: Required Model Size for Learned Database Operations . . . . . . . . . . 272 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 10.1.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 10.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 10.3 Lower Bounds On Model Size for Database Operations . . . . . . . . . . . . . . . 281 10.3.1 Bounds Considering Worst-Case error . . . . . . . . . . . . . . . . . . . . 282 10.3.2 Bounds Considering Average-Case Error with Uniform Distribution . . . 283 10.3.2.1 Learned Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 283 10.3.2.2 Learned Cardinality Estimation . . . . . . . . . . . . . . . . . . 285 10.3.2.3 Range-Sum Estimation . . . . . . . . . . . . . . . . . . . . . . . 285 10.3.3 Bounds Considering Average-Case Error with Arbitrary Distribution . . . 286 10.4 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 10.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 10.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 10.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 10.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 10.8.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 10.8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 10.8.3 Results with ∞-Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 10.8.3.1 Proof of Theorem 20 . . . . . . . . . . . . . . . . . . . . . . . . 298 10.8.3.2 Proof of Lemma 19 . . . . . . . . . . . . . . . . . . . . . . . . . 300 10.8.4 Results with 1-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 vi 10.8.4.1 Proof of Theorem 21 . . . . . . . . . . . . . . . . . . . . . . . . 301 10.8.4.2 Proof of Theorem 22 . . . . . . . . . . . . . . . . . . . . . . . . 303 10.8.4.3 Proofs for Range-Sum Estimation . . . . . . . . . . . . . . . . . 312 10.8.5 Results with µ-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 10.8.5.1 Proof of Theorem 25 . . . . . . . . . . . . . . . . . . . . . . . . 314 10.8.5.2 Proof of Lemma 20 . . . . . . . . . . . . . . . . . . . . . . . . . 315 10.8.6 Proof of Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . 317 IV Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Chapter 11: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 vii List of Tables 4.1 Dataset information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Median visit duration for general rectangles . . . . . . . . . . . . . . . . . . . . . 56 4.3 Improvement of partitioning over no partitioning . . . . . . . . . . . . . . . . . . 60 4.4 DQD Bound on 2D Real/Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . 65 5.1 Summary of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2 Urban datasets characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3 Validation set error of ParamSelect in predicting ρ . . . . . . . . . . . . . . . . . . 138 6.1 Incomplete dataset generation setup . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2 Testing query predicate and group by attributes . . . . . . . . . . . . . . . . . . . 165 9.1 Summary of results for data sampled from a distribution learnable class X (CE: cardinality estimation, †: for simplicity assuming S X n , B X n are at most linear in data size, see Theorem 13 and Theorem 17 for general cases). . . . . . . . . . . . . 214 9.2 Asymptotic complexities of some distribution learnable classes for rank function defined in Lemma 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 10.1 Our bounds on required model size in terms of data size, n, dimensionality, d, tolerable error, ϵ, and domain size, u. Each column shows the result when ϵ is the tolerable error for the specified error scenario. X: No non-trivial bound possible . 274 viii List of Figures 4.1 (left) Database of location signals. (right) Avg. visit duration query function. Color shows visit duration in hours . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Constructed neural network and its architecture. Values on edges and nodes show edge weight and unit bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.3 Neural Network Construction Steps . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.4 NeuroSketch Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Measure column distribution (shared y-axis) . . . . . . . . . . . . . . . . . . . . . 49 4.6 RAQs on different datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Varying query range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.8 Varying no. of active attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.9 Varying agg. function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.10 Time/Space/Accuracy Trade-Off with Different Model Architectures . . . . . . . . 55 4.11 Learned NeuroSketch Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.12 Generalization Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.13 Preprocessing Time Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.14 DQD Bound on Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.15 2D data subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.16 Learned and True Query Functions on 2D Datasets . . . . . . . . . . . . . . . . . 61 4.17 Function surface of gˆi(x) for a 2-dimensional x . . . . . . . . . . . . . . . . . . . 71 4.18 Non-linarities in 3 dimensions. Figure shows the input space portioned by hyperplanes corresponding to points of non-linearity. . . . . . . . . . . . . . . . . 77 4.19 Construction vs. SGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.1 Spatial Neural Histogram System . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 SHN Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Data Collection: map view (left), true cell count heatmap (middle), ε-DP heatmap with noisy counts (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Model Training: Augmented query sets of size r1 to rk (top) are used to learn neural network models (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Model utilization: 30m query answered from 25m network (left), 90m query from 100m network (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.6 Impact of privacy budget: VS, SPD-VS and CABS datasets . . . . . . . . . . . . . . 120 5.7 Impact of privacy budget: GW dataset . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.8 Impact of data and query size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.9 Study of modeling choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.10 Impact of uniformity assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.11 Impact of ρ and ParamSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 ix 5.12 SNH learns patterns on GMM dataset of 16 components. Color shows number of data points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.13 Impact of data skewness (ε = 0.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.14 Milwaukee (VS) ϵ = 0.2, n = 100k . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.15 Replacing uniformity error with noise . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.16 Study of ParamSelect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.17 Impact of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.18 Impact of model depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.19 ε = 0.05, σ = 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.20 ε = 0.2, σ = 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.21 ε = 0.05, σ = 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.22 ε = 0.2, σ = 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.23 ε = 0.05, σ = 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.24 ε = 0.2, σ = 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.1 Running Example of Apartments Dataset . . . . . . . . . . . . . . . . . . . . . . . 144 6.2 NeuroComplete Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.3 Query Embedding Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.4 Row Relevance Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.5 Multi table query embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.6 Dataset information [49] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.7 Results for H1 AVG Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.8 Results for H2 AVG Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.9 Results for M1 AVG Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.10 Results for M2 AVG Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.11 Results for H1 COUNT Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.12 Results for H2 COUNT Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.13 Results for M1 COUNT Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.14 Results for M2 COUNT Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.15 (a) and (b): visualizing training and test distributions. (c): Avg. distance to the nearest training query from test queries. . . . . . . . . . . . . . . . . . . . . . . . 169 6.16 Robustness to Missing Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.17 Query Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.18 Training size and duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.19 Comparison of Sampling and Learning . . . . . . . . . . . . . . . . . . . . . . . . 170 8.1 A learned index used to solve the rank problem. . . . . . . . . . . . . . . . . . . . 182 8.2 Approximation with a piecewise constant function . . . . . . . . . . . . . . . . . 187 8.3 Approximation with c.d.f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.4 RMI of height log log n with piecewise constant models . . . . . . . . . . . . . . . 188 8.5 Constant Query and Near-Linear Space . . . . . . . . . . . . . . . . . . . . . . . . 199 8.6 Log-Logarithmic Query and Constant Space . . . . . . . . . . . . . . . . . . . . . 199 8.7 Log-Logarithmic Query and Quasi-Linear Space . . . . . . . . . . . . . . . . . . . 199 8.8 Constant Query Time on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . 201 8.9 Near-Linear Space on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.10 Log-Logarithmic Query on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . 202 8.11 Quasi-Linear Space on Real Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 202 x 9.1 Structure of the learned dynamic index . . . . . . . . . . . . . . . . . . . . . . . . 248 9.2 Insertion Causing a split in index . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 10.1 Theoretical Bounds in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 xi Abstract Machine learning models have been recently used to replace various database components (e.g., index, cardinality estimator) and have shown substantial performance enhancements over existing non-learned alternatives. Such approaches take a function approximation view of the database operations. They consider a database operation as a function that can be approximated (e.g., an index is a function that maps items to their location in a sorted array) and learn a model to approximate the operation’s output. However, the theoretical characteristics of such approaches have not been well understood. This lack of theoretical guarantees on their performance greatly limits their practical applicability. Besides, from a practical perspective, existing approaches only optimize specific components within a database system, leaving the accuracy and efficiency of the entire database system as a whole unoptimized for a specific workload. This thesis addresses the above two shortcomings. It provides the first-ever theoretical guarantees for various learned database operations and presents novel practical solutions to improve the performance of learned database systems. From a practical perspective, we develop the Neural Database (NeuroDB) framework which extends the function approximation view of database operations by considering the entire database system as a function that can be approximated. In this framework, we train neural networks that take queries as input and output query answer estimates. Using xii this framework, we show substantial performance benefits for various important database problems including approximate query processing, privacy-preserving query answering, and query answering on incomplete datasets. From a theoretical perspective, we present a pioneering theoretical study of the function approximation view of database operations, providing the first-ever theoretical analysis of various learned database operations, including indexing, cardinality estimation, sorting, and range-sum estimation. Our analysis provides theoretical guarantees on the performance of the learned models, showing why and when they perform well. Furthermore, we theoretically study the model size requirements, showing how model size needs to be set to be able to achieve a desired accuracy level. Our results build a foundation for theoretical analysis of learned database operations that enhances our understanding of the learned operations and provide the much-needed theoretical guarantees on their performance for robust practical deployment. xiii Part I Preliminaries 1 Chapter 1 Introduction 1.1 Background Database systems are the backbone of many real-world applications, including data analytics, decision-making, and machine learning, supporting various data operations (answering queries, sorting data, inserting data in an index, etc.) as needed by downstream tasks. Systems often perform (part of) such data operations approximately, for a variety of reasons spanning different applications: (1) Often an approximate answer can be refined to find the exact answers more efficiently [63, 62, 160], (2) exact answers may be too computationally expensive to obtain, and fast accurate estimates can be preferable [51, 27], (3) to protect privacy one may have to provide approximate answers to avoid revealing too much information about an individual (e.g., to protect differential privacy) [8, 108], and (4) due to missing records or attribute values, true query answers may be unknown and systems may only be able to estimate true query answers using available data [164, 49]. Consequently, approximate database operations have been ubiquitous in database systems and research. Recently, machine learning methods have been used to perform such database 2 operations, and have shown significant practical benefits across various tasks [63, 62, 61, 51]. Such learned database operations replace a specific database operation, e.g., indexing or cardinality estimation, with a learned model. The models are trained to perform a specific desired operation well and, by utilizing existing patterns in the data and queries, are often much faster than the non-learned alternatives. An example is learned indexing, where instead of building an index on an array to search its elements, a model is learned that predicts the location of the item in the array. Learned database operations follow a function approximation view of the database operations where a specific database operation is seen as a function that takes an input (i.e., the query) and produces an output (i.e., the query answer). Then, a model is learned that approximates this function well. For example, in the case of learned indexing, a learned model approximates the function that maps a query (input) to its location in a sorted array (output). Many existing results show significant practical benefits to the use of learned models in this framework, with use cases in indexing [62, 40], cardinality estimation [61, 89], sorting [63], query optimization [131], etc. These benefits are attributed to being able to optimize the performance of the learned model for the specific data and query workload. Yet, there remain open questions, both theoretical and in practice, that limit the effectiveness of the learned models and undermine their deployment in the real world. First, from a practical perspective, the entire database system as a whole may not be welloptimized, both in terms of efficiency and accuracy, to perform a specific data and query workload. Regarding efficiency, although existing approaches use machine learning to optimize specific components within the database system, the used components (e.g., using a specific index, or choice of data representation) may not be the best choice for a specific workload. This leads 3 to database systems that are suboptimal for the data and query workload in terms of efficiency. Moreover, the accuracy of database systems is often affected by real-world constraints, including the need to guarantee user privacy while answering queries, and the existence of incomplete data records (e.g., when some records are missing due to data integration or incomplete data collection) within a database. The database system needs to be well-optimized for a query workload to answer queries from the workload as accurately as possible under such real-world constraints. Overall, to design truly instance-optimized database systems, one needs to be able to optimize the end-to-end database systems for the specific data and query workload at hand to achieve the best accuracy and efficiency under real-world constraints. Second, even after addressing the above question in practice, learned methods lack theoretical guarantees on their performance. This lack of theoretical guarantees poses a significant hurdle to their practical deployment, especially since the non-learned alternatives often provide the required theoretical guarantees (e.g., [5, 100, 48, 13]). Such guarantees are needed to ensure the reliability of the learned operations at deployment time, that is, to ensure consistent performance of the learned model on databases where the learned model had not been apriori evaluated. It is unknown why and when such learned models perform well; a lack of explainability which makes the choice of when to use learned models over non-learned alternatives difficult. Theoretical guarantees are especially important for dynamic datasets (that is, datasets that support insertion of new data points). In such cases, one needs to know, before deploying a specific learned approach in practice, how well it will perform in the future and possibly in the presence of a shifting data distribution. 4 1.2 Thesis Statement and Summary This thesis presents a thorough treatment of the above two questions, taking a significant step in the evolution of learned database operations. We introduce novel solutions for building fullyoptimized database systems in practice, and provide an in-depth theoretical analysis of learned database operations that builds the foundation for their theoretical understanding. By doing so, this thesis demonstrates that: Thesis Statement. “Learned models can be used to perform specific database operations efficiently and accurately under real-world constraints on dynamic datasets with theoretical guarantees.” Part I of this thesis discusses the required preliminaries and formalizes notations. Then, in Part II we discuss how we can use learned models to perform database operations efficiently and accurately under real-world constraints such as query answering while preserving privacy and answering queries on incomplete datasets. Part III shows how this can be done on dynamic datasets with theoretical guarantees. This thesis is summarized below. 1.2.1 The Neural Database Framework To design database systems that are fully optimized for a data and query workload, in Part II, we present the neural database framework (or NeuroDB for short), a fundamentally new approach to data management that trains models to directly answer queries. The framework is more formally described in Chapter 3, where we discuss how, in this framework, neural networks are trained, using supervised learning, to take queries as input and output query answer estimates. Datasets are represented by neural network weights and are queried through a model forward pass. By 5 training neural networks end-to-end, NeuroDB optimizes the entire query processing pipeline, thus building learned models that are fully optimized for a specific workload. NeuroDB extends the function approximation view of database operations by considering the entire query answering pipeline as a function that can be approximated, and learns neural networks to approximate it. NeuroDB always outputs an estimate of the query answer (not the exact answer), and we show significant benefits to this framework when approximate query answers are acceptable. Specifically, we next discuss the use of NeuroDB for approximate query processing (Sec. 1.2.1.1), privacy-preserving query answering (Sec. 1.2.1.2), and querying incomplete datasets (Sec. 1.2.1.3), scenarios where approximate answers are acceptable. 1.2.1.1 NeuroDB for Approximate Query Processing We discuss the application of NeuroDB to approximate query processing in Chapter 4, where the goal is to provide efficient and approximate query answers. We specifically consider answering range aggregate queries (RAQs), which are the building block of many real-world applications (e.g., calculating net profit for a period from sales records). RAQs filter a dataset by range predicate and ask for an aggregation of some attribute in the filtered dataset. Due to the large volume of data, exact answers can take too long to compute and fast approximate answers may be preferred. There is a time/space/accuracy trade-off, where algorithms can sacrifice accuracy for time or space. We show that applying the NeuroDB framework can provide better trade-offs than existing learned or non-learned methods. Our results show that, following the NeuroDB framework, we can train small neural networks that can be efficiently evaluated and provide accurate query answer estimates. 6 1.2.1.2 NeuroDB for Privacy-Preserving Query Answering Chapter 5 discusses the application of NeuroDB to privacy-preserving query answering, where the goal is to answer queries on a database while preserving the privacy of the users who contributed records to the dataset. Specifically, we consider answering COUNT queries on location datasets, that is, answering how many people are in a certain location. Such queries are common on location datasets, often collected from mobile apps and used for various purposes such as optimizing traffic or studying disease spread. Differential privacy is often used to protect privacy and ensure that the location of a specific user cannot be inferred when releasing aggregate location density information from such datasets. We show how the NeuroDB framework can be applied to this setting by training a model while preserving differential privacy to answer the queries. Our results show that using the NeuroDB framework, one can answer queries much more accurately than existing methods, while providing the same privacy guarantees. 1.2.1.3 NeuroDB for Incomplete Relational Data Finally, in Chapter 6, we discuss how NeuroDB can help with answering queries on incomplete datasets, where the goal is to provide accurate query answers while the observed dataset is known to contain missing records. Real-world databases are often incomplete for various reasons, including the cost of data collection, privacy considerations or as a side effect of data integration/preparation. For instance, to know housing prices in an area, collecting information for every house is costly, if not impossible, but Airbnb already provides a sample for free [54] (dataset is a sample because it only contains Airbnb housing and not other housing sources). In such scenarios, some records are entirely missing from the datasets. Meanwhile, OLAP applications require answering 7 aggregate queries on such incomplete datasets. We study this problem, aiming to provide accurate query answers while only having access to such an incomplete dataset. Our results show that the NeuroDB framework can provide much more accurate query answers than existing methods when some records are systematically missing from a database. 1.2.2 Theoretical Analysis of Learned Database Operations In Part III of this thesis, we focus on the theoretical study of learned database operations, as overviewed in Chapter 7. Specifically, we broadly study various learned database operations (including indexing, sorting, and cardinality estimation), developing theoretical tools that build a foundation for theoretical analysis of learned database operations. Our analysis focuses on two problems. First, we study how to provide performance guarantees for learned database operations for specific modeling choices, where we show bounds on the time and space complexity of such learned database operations in static (i.e., fixed) datasets (Sec. 1.2.2.1) and dynamic datasets (i.e., when new data records can be inserted) in the presence of distribution shift (Sec. 1.2.2.2). Second, we study what modeling choices are needed to be able to perform database operations well, where we specifically show lower bounds on the required model size to be able to perform various database operations to a desired accuracy (Sec. 1.2.2.3). 1.2.2.1 Theoretical Performance Guarantees for Static Learned Indexes First, in Chapter 8 we study the problem of indexing, where experimental results show significant benefits to the use of learned solutions [44, 62, 40, 33]. In this fundamental problem in data management, the goal is to find, given a query, the elements in the dataset that match the query (e.g., given a number q find the student with grade=q, where “grade=q” is the query on a 8 dataset of students). When the query is on a single attribute (e.g., we filter students only based on grade), and that data is sorted based on this attribute, B-trees and its variants find the answer in O(log n). Experimental results, however, show that learned indexes can be significantly faster than B-trees (and other non-learned approaches). Meanwhile, no theoretical result has justified their superior practical performance. We theoretically study this problem, showing that, on expectation, learned approaches are indeed orders of magnitude, and asymptotically, faster than non-learned alternatives. Specifically, we show that using the same space overhead as traditional indexes (e.g., a B-tree), and under mild assumptions on the data distribution, a learned index can answer queries in O(log log n) operations on expectation, a significant and asymptotic improvement over the O(log n) of traditional indexes. We also show that With the slightly higher but still near-linear space consumption O(n 1+ϵ ), for any ϵ > 0, a learned index can achieve O(1) expected query time. 1.2.2.2 Theoretical Performance Guarantees for Dynamic Learned Database Operations Next, in Chapter 9, we show that theoretical tools developed in our analysis of static learned indexes can be generalized both to multiple different database operations and also to study the operations in dynamic datasets and under distribution shift. Theoretical study of learned database operations in dynamic datasets is particularly important. This is because, for dynamic datasets, significant empirical benefits of learned models are often accompanied by the caveat that, especially when data distribution changes, models’ performance may deteriorate after new insertions [33, 89, 138], possibly to worse than non-learned methods [141]. Thus, it is theoretically unclear why and when learned models outperform non-learned methods, and no existing theoretical work shows any advantage in using the learned methods in dynamic datasets and under 9 distribution shift. We present the first known theoretical bounds on the performance of learned models for indexing and cardinality estimation in the presence of insertions, painting a thorough picture of why and when they outperform non-learned alternatives for these fundamental database operations. Our analysis develops the notion of distribution learnability, a characteristic of data distributions that helps quantify the performance of learned database operations for data from such distributions. Using this notion, our results are distribution dependent (as one expects bounds on learned operations should be), without making unnecessary assumptions about data distribution. Our developed theoretical framework builds a foundation for the analysis of learned database operations in the future. To show its broader applicability, we present a theoretical analysis of learned sorting, showing theoretical bounds on its performance and proving why and when it outperforms non-learned methods. 1.2.2.3 Required Model Size for Learned Database Operations Finally, in Chapter 10, we study the required model size to be able to perform various database operations to a desired accuracy on a dataset of a specific size. We thoroughly study the relationship between model size, data size and accuracy under different error metrics and for various database operations. We present the first known bounds on the model size needed to achieve a desired accuracy when using machine learning to perform indexing, cardinality estimation and range-sum estimation. Our results can be interpreted in two ways. In the first interpretation, given a model size and data size, our results provide a lower bound on the worst-case possible error. This bound shows what error can be guaranteed by a model of a certain size (and how bad the model can get) after it is deployed in practice. This is important, because datasets change in practice and our bound on error helps quantify if a model of a given size can guarantee a desired 10 accuracy level when the dataset changes (or that we need a larger model). In the second interpretation, our results provide a lower bound on the required model size to achieve a desired accuracy level across datasets. This shows how large the model needs to be, to be able to guarantee the desired accuracy, and has significant implications for resource management in database systems. For instance, it helps a cloud service provider decide how much resources it needs to allocate (and calculate the cost) for learned models to be able to guarantee an accuracy level across all its database instances. 1.3 Thesis Outline The rest of this thesis is organized as follows. Part I presents the required preliminaries, where Chapter 2 formalizes the discussion on the function approximation view of database operations and gives an overview of existing learned approaches. Part II discusses the NeuroDB framework, where Chapter 3 formalizes the framework, Chapter 4 discusses its application to approximate query processing, Chapter 5 discusses its application to privacy-preserving query answering, and Chapter 6 discusses its application to answering queries on incomplete datasets. Part III presents our theoretical analysis of learned database operations, where Chapter 7 gives an overview of our analysis framework, Chapter 8 presents our analysis of static learned indexes, Chapter 9 generalizes the analysis to various dynamic learned database operations under distribution shift and Chapter 10 discusses model size requirements for answering database queries using learned models. Finally, Chapter 11 discusses conclusions from this thesis. 11 Chapter 2 The Function Approximation View of Database Operations This thesis broadly studies the function approximation view of database operations, where database operations are seen as functions that can be approximated. Consequently, machine learning models are used for the purpose of approximation. We first introduce the notion of operations functions in Sec. 2.1, which helps us formalize this function approximation view. Then, in Sec. 2.2 we give a broad overview of learned database operations that use this function approximation view to perform database operations. 2.1 Operation functions Consider a d-dimensional dataset, D, with n records. Let fD(x) be a function that takes an input, x, performs a desired operation on the dataset, and returns an output. For instance, in the case of cardinality estimation, fD(x) can be the function that takes a query x as input and returns the number of points in D that match the query x. Another example is indexing, where, for a sorted one-dimensional D, fD takes x as input and returns the location of x in the dataset D. We refer to fD as an operation function. This thesis specifically considers operation functions used for 12 indexing, sorting, cardinality estimation, and range-aggregate query answering. We next discuss the operations functions used for these operations. 2.1.1 Rank Operation Function We discuss the rank function, which is used for indexing and sorting. Consider a 1-dimensional sorted dataset D (i.e., a sorted 1-dimensional array). Given a query q, the goal is to return the index i ∗ = Pn i=1 IDi≤q, where I is the indicator function, and Di is the i-th element of D. i ∗ is the index of the largest element no greater than q and is 0 if no such element exists. Furthermore, if q ∈ D, q will be at index i ∗ + 1. i ∗ is referred to as the rank of q. Define the rank function of the dataset D as rD(q) = Pn i=1 IDi≤q, which takes a query as an input and outputs its rank. 2.1.2 Cardinality Operation Function Next, we discuss the cardinality function which is used for cardinality estimation, and consequently in query optimization. Consider a d-dimensional dataset. A query predicate q = (c1, ..., cd, r1, ..., rd), specifics the condition that the i-th attribute is in the interval [ci , ci + ri ]. Define Ip,q as an indicator function equal to one if a d-dimensional point p = (p1, ..., pd) matches a query predicate q = (c1, ..., cd, r1, ..., rd), that is, if cj ≤ pj ≤ cj + rj , ∀j ∈ [d] ([k] is defined as [k] = {1, ..., k} for integers k). Then, the answer to a cardinality estimation query is the number of points in D that match the query q, i.e., cD(q) = P i∈[n] IDi,q. We refer to cD as the cardinality function of the dataset D, which takes a query as an input and outputs the cardinality of the query. 13 2.1.3 Range-Aggregate Operation Function Finally, we discuss the range-aggregate functions used for approximate query processing, and query answering on incomplete and private datasets, as discussed in Part II of this thesis. Consider a d-dimensional dataset D with n records and attributes, A1, ..., Ad. Consider the following SQL query. SELECT AGG(Am) FROM D WHERE c1 ≤ A1 < c1 + r1 AND ... AND cd ≤ Ad < cd + rd For any i, ci and ci + ri are the lower and upper bounds on the attribute Ai . ci and ri can be −∞ and +∞, respectively, in which case there are no restrictions on the values of Ai in the query. AGG is a user defined aggregation function, with examples including SUM, AVG and COUNT aggregation functions. Am is called the measure attribute, where m is an integer between 1 and d. Let c = (c1, ..., cd) and r = (r1, ..., rd) be d-dimensional vectors. We call the pair q = (c, r) a query instance. Different query instances correspond to different range predicates for the measure attribute Am and aggregation function AGG. We define the range-aggregate function aD(.) so that for a query q, aD(q) is the answer to the above SQL statement. 2.1.4 Discussion As mentioned, rank function is used for indexing/sorting (as the function indicates the correct location of an item in a sorted array), cardinality function is used for cardinality estimation and query optimization, and range-aggregate function is used for approximate query processing. Nonetheless, these functions are related. Specifically, rD, cD and aD, are, in that order, generalizations of each other. That is, rD(x) = cD(−∞, x) for a one dimensional dataset D, and that 14 cD is the same as aD with aggregation function COUNT. Consequently, many of the techniques developed in our theoretical analysis are broadly applicable to all the functions, and, for the purpose of theoretical analysis, we often study them together. Moreover, the general framework of learned database operations follows the same blueprint across the operations, while extra considerations are needed when considering privacy-preserving query answering, or query answering on incomplete datasets. In the rest of this thesis, we use the term opration function (or query function) to collectively refer to the rank, cardinality and range-aggregate functions, and use the notation fD ∈ {rD, cD, aD} to refer to all the three functions, rD, cD and aD (for instance, fD ≥ 0 is equivalent to the three independent statements that rD ≥ 0, cD ≥ 0 and aD ≥ 0). We drop the dependence on D if it is clear from context and use f(x). 2.2 Learned Database Operations Learned database operations use machine learning to approximate the operation functions. Although unsupervised methods are possible (e.g., [51, 75]), this thesis mostly focuses on supervised methods where most existing work use the following framework. First, during training, a function approximator, ˆf(.; θ) is learned to approximate the function f, for f ∈ {r, a, c}, where for different inputs, the operations are performed on the database to find the ground-truth answer, and the models are optimized through a mean squared loss (although other losses are also used, e.g., [88]). Subsequently, at test time, for a test input x, ˆf(x; θ) is used as an estimate of the operations output, which is obtained by performing a forward pass of the model. Overall, this function approximation view has been followed by many learned database operations and across various 15 applications [62, 61, 51, 86, 81, 75, 8]. We provide a more detailed review of the existing work for each application and database operation as we discuss the specific applications throughout. 16 Part II The Neural Database Framework 17 Chapter 3 Overview of The Neural Database Framework We propose the Neural Database (NeuroDB) framework, that extends the function approximation view of database operations by observing that the entire database system is a function that can be approximated. The NeuroDB framework trains neural networks to predict query answers. The neural networks are trained in a supervised learning fashion, where the model takes a query as input and outputs an estimate of the query answer. Datasets are represented by neural network weights and are queried through a model forward pass. For a database D, consider an operation function fD(q) that takes a query, q, as an input and outputs its correct query answer. NeuroDB trains a neural network, ˆf(q; θ), that takes a query q as its input and outputs an estimate to the query answer. The training objective is to ensure ˆf and fD are similar, e.g., P q∈Q | ˆf(q; θ)−fD(q)| is minimized for a query workload Q, so that query answer estimates are accurate. Following this framework, to apply NeuroDB to a specific application scenario, we need to specify a query representation, and a training procedure for the application. NeuroDB inherently produces query answer estimates, not exact answers, and thus applies to scenarios where exact answers are not required. In the next three chapters, we discuss the application of NeuroDB to three settings: (1) approximate query processing (Chapter 4), (2) privacy-preserving query answering (Chapter 5), 18 and (3) querying incomplete datasets (Chapter 6). Overall, our results show that NeuroDB, by optimizing the entire query-answering pipeline for a specific query workload, is able to provide accurate and efficient answers for the three abovementioned applications. 19 Chapter 4 A Neural Database for Answering Range Aggregate Queries 4.1 Introduction Range aggregate queries (RAQs) are intrinsic to many real-world applications, e.g., calculating net profit for a period from sales records or average pollution level for different regions for city planing [75]. Due to large volume of data, exact answers can take too long to compute and fast approximate answers may be preferred. In such scenarios, there is a time/space/accuracy tradeoff, where algorithms can sacrifice accuracy for time or space. For example, consider a geospatial database containing latitude and longitude of location signals of individuals and, for each location signal, the duration the individual stayed in that location. A potential RAQ on this database, useful for understanding the popularity of different Points of Interests, is to calculate the average time spent by users in an area. Approximate answers within a few minutes of the exact answer can be acceptable in such applications. We use this scenario as our running example. Research on RAQs has focused on improving the time/space/accuracy trade-offs. Various methods such as histograms, wavelets and data sketches (see [27] for a survey) have been proposed to model the data for this purpose. Recent efforts use machine learning (ML) [75, 129, 20 51] to improve the performance. Such approaches learn models of the data to answer RAQs. Experimental results show ML-based methods outperform non-learning methods in practice. Nonetheless, there is no theoretical understanding of when and why an ML based approach performs well. This is because modeling data makes it difficult to reason about the performance of specific queries. That is, some queries may be easier to answer than others, e.g., average value of one attribute may be constant for different query ranges, while that of another attribute might change drastically. Furthermore, modelling the data misses the opportunity to utilize information about queries in practice. For instance, patterns in query answers can be used to learn a compact representation of the data with respect to the queries, improving the performance, while there may be no such patterns within the entire dataset. In this chapter, instead of learning data models, we propose to learn query models. In our example of calculating the average visit duration for a POI, the input to a query model is the POI location and the model is trained to output the average visit duration for the POI. Query modeling skips learning explicitly the data distribution and instead learns query answers, so that we can explicitly relate errors in modeling to errors in query answering. Nevertheless, this is non-trivial and requires a detailed study of modelling errors. To the best of our knowledge, no existing attempt in the literature theoretically relates data and query properties to the error of a learned model when answering RAQs. We utilize neural networks as our query model. Specifically, we consider training a neural network that takes as input an RAQ and outputs the answer to the query. We theoretically study this approach, and provide, for the first time, a Data distribution and Query Dependent error bound (hereafter referred to as DQD bound) for neural networks when answering RAQs. DQD bound 21 theoretically relates properties of the data distribution and the RAQ to the accuracy a neural network can achieve when answering the query. In our theoretical analysis, we consider AVG, COUNT and SUM queries, assume the database is a collection of i.i.d samples from a data distribution and make a suitable Lipschitz assumption on the query and data distribution. We then use VC-sampling theory and our novel result on neural network approximation power to show the existence of a neural network that can answer the queries on the database with bounded error. The bound gets tighter (i.e., more accurate neural networks can be learned) as the data size, or query range, increases. Alternatively, a smaller neural network can be used to answer queries with a fixed desired error when the data size, or query range, increases. Intuitively, this is a result of the reduction in variance (due to sampling) of query answers when the database is larger, because more data points are sampled from the data distribution. Furthermore, our results utilize the Lipschitz property to provide a complexity measure that quantifies the difficulty of answering a query from a data distribution. Using the complexity measure, our results show settings where existence of a small neural network with low query answering error is guaranteed. To confirm our theoretical results, we design NeuroSketch, a neural network framework that answers RAQs orders of magnitude faster than state-of-the-art and with better accuracy. NeuroSketch uses DQD results to allocate more model capacity to queries that are difficult to answer, thereby reducing error without increasing query time. While DQD provides a theoretical grounding for NeuroSketch, in practice NeuroSketch is not limited to some of the assumptions we made to prove DQD bounds, for example, it can answer more general RAQs, such as STD and MEDIAN. To summarize, our major contributions are: 22 • We present the first theoretical analysis for using ML to answer RAQs. This includes a novel analysis framework, a novel use of VC-sampling theory and a novel result on neural network approximation power. • We show theoretically how data distribution, data size, query range and aggregation function are related to the neural network error when answering RAQs. This opens the possibility for a query optimizer that, for a data distribution, decides when to build and use a neural network for query processing. • To confirm our theoretical results, we design NeuroSketch, the first neural network framework to answer generic RAQs. • Extensive experiments show that NeuroSketch provides orders of magnitude gain in query time and better accuracy over state-of-the-art, (DBEst [75] and DeepDB [51]) on real-world, TPC-benchmark and synthetic datasets. We present our problem definition in Sec. 4.2, DQD bound in Sec. 4.3, NeuroSketch in Sec. 4.4, our empirical study in Sec. 4.5, related work in Sec. 4.6 and conclude in Sec. 4.7. 4.2 Problem Definition Problem Setting. Consider a dataset D with n records and ¯d attributes, A1, ..., Ad¯. Assume records of D are random i.i.d samples from a data distribution χ and Ai ∈ [0, 1] with probability 1 for all 1 ≤ i ≤ ¯d (otherwise the attributes can be normalized). We first consider the following SQL query and discuss extensions to general RAQs in Sec.4.4.3. 23 SELECT AGG(Am) FROM D WHERE c1 ≤ A1 < c1 + r1 AND ... AND cd¯ ≤ Ad¯ < cd¯+ rd¯ For any i, ci and ci + ri are the lower and upper bounds on the attribute Ai . ci and ri can be 0 and 1, respectively, in which case there are no restrictions on the values of Ai in the query. We say that an attribute is not active in the query in that case, and is active otherwise. AGG is a user defined aggregation function, with examples including SUM, AVG and COUNT aggregation functions. Am is called the measure attribute, where m is an integer between 1 and ¯d. Let c = (c1, ..., cd¯) and r = (r1, ..., rd¯) be ¯d-dimensional vectors. We call the pair q = (c, r) a query instance. Different query instances correspond to different range predicates for the measure attribute Am and aggregation function AGG. We define the function fD(.) so that for a query q, fD(q) is the answer to the above SQL statement. We call fD : [0, 1]d → R a query function, where d = 2 ¯d is the dimensionality of the query function. Furthermore, we define Q = {(c, r) ∈ [0, 1]d , ci+ri ≤ 1∀i} as the set of all possible queries. Example 1. Consider a database of user location reports and the duration a user stayed in the reported location, shown in Fig. 4.1 (left). On this database, consider the RAQ of returning average visit duration of users in a 50m×50m rectangle with bottom left corner at the geo-coordinate (c1, c2). The query function, fD(c1, c2) := fD(c1, c2, 50m, 50m), takes as input the geo-coordinate of the rectangle and outputs the average visit duration of data points in the rectangle (we have fixed r1 and r2 to 50m in this example). Fig. 4.1 (right) plots fD(c1, c2), which shows, fD(−95.3615, 29.758, 50m, 50m) = 9, i.e., for query instance (-95.3615, 29.758, 50m, 50m) the answer is 9. Neural Networks to Answer RAQs We learn a neural network, ˆf(.; θ), to approximate the query function, fD(.). The neural network takes as input an RAQ, q. The model forward 24 Figure 4.1: (left) Database of location signals. (right) Avg. visit duration query function. Color shows visit duration in hours pass outputs an answer, ˆf(q; θ). The goal is to train a neural network so that its answer to the query, ˆf(q; θ), is similar to the ground-truth, fD(q). If such a neural network is small and can be evaluated fast, we can use the neural network to directly answer the RAQ efficiently and accurately, by performing a forward pass of the model. Problem Statement. Let Σ( ˆf) be the space complexity of the neural network, which is the amount of space required to store all its parameters. Let τ ( ˆf) be its query time complexity, which is the time it takes to perform a forward pass of the neural network. We study the error, ∥fD − ˆf∥, in answering queries, where we mostly consider the 1-norm, defined as ∥fD − ˆf∥1 = R q∈Q |fD(q) − ˆf(q)| or ∞-norm, defined as ∥fD − ˆf∥∞ = supq∈Q |fD(q) − ˆf(q)|. The problem studied in this chapter is learning to answer range aggregate queries with time and space constraints, formulated as follows. Problem 1. Given a query function fD, class of possible neural networks, F, and time and space requirements t and s, find arg min fˆ∈F ∥fD − ˆf∥ s.t. Σ( ˆf) ≤ s, τ ( ˆf) ≤ t. 25 Notation. Bold face letters, e.g., c, denote vectors, and subscripts denote the elements of a vector, e.g., ci is the i-th element of c. 4.3 DQD Bound for Neural Networks Answering RAQs We theoretically study the relationship between the accuracy a neural network can achieve when answering RAQs and data and query properties. Sec. 4.3.1, states our Data distribution and Query Dependent error bound (DQD bound) when considering SUM and COUNT aggregation functions, and discusses its implications. We prove the bound in Sec. 4.3.2. We present results for AVG query function in Sec. 4.3.3 and discuss how our techniques can be generalized to other query functions and modelling choices. 4.3.1 DQD Bound Statement 4.3.1.1 Incorporating Data Distribution The data distribution, χ, underlying a database, D, impacts the difficulty of answering queries on the database with a neural network. For instance, in Example 1, if all users have the same visit duration for all their visits, the query function fD(c1, c2) will be constant, and thus can be easily modeled. On the other hand, the skewness in the data distribution, as depicted in Fig. 4.1, can make answering queries more difficult as the query function fD(c1, c2) changes drastically from one location to another. Importantly, this is a property of the data distribution, χ, and not only of the observed database D. For instance, we expect similar skewness in observations if we collect more user data (i.e., as D grows), or if location data are collected from a different period of time not covered in D (i.e., a different sample of χ). Thus, by incorporating data distribution 26 in our analysis, we are able to study the impact of data size as well as properties intrinsic to the distribution (that will be unaffected by the randomness in observations) on answering RAQs. To do so, (1) we need to capture the dependence of query answers on data distribution and (2) find a means of measuring the complexity of modeling query answers when data follows a certain distribution. To capture the dependence on data distribution, we define distribution query function, fχ(q), as the expected value of the query function, i.e., fχ(q) = ED∼χ[fD(q)], where D is sampled from data distribution, χ. We refer to the query function, fD(q), as observed query function to distinguish it from distribution query function. To capture the difficulty of modeling a function, we use the ρ-Lipschitz property. A function, f, is ρ-Lipschitz if |f(x)−f(x ′ )| ≤ ρ∥x−x ′∥1, for all x and x ′ in the domain of the function, where we consider ρ-Lipschitz property in 1-norm. Intuitively, ρ captures the magnitude of correlation between x and f(x). It bounds how much f(x) can change with a change in x. If ρ is large, f can change abruptly even with a small change in x. This makes the function more difficult to approximate, as more model parameters will be needed to account for all such possible abrupt changes. Combining the above, we propose to use the Lipschitz constant of the normalized Distribution Query function, referred to as LDQ, as a measure of difficulty of answering RAQs. LDQ is the Lipschitz constant of the function f(q) = fχ(q) n = 1 n ED∼χ[fD(q)]. We normalize the distribution query function by data size to account for its change in magnitude when data size increases (for sum and count queries, magnitude of fD(q) increases as data size increase). LDQ is a property of χ and fD. For ease of reference, we often implicitly assume a given data distribution χ and refer to LDQ as a property of a query function. 27 4.3.1.2 Theorem Statement Let f S D and f C D be query functions with aggregation functions SUM and COUNT, respectively, and let ρS and ρC be their respective LDQs. For i ∈ {S, C}, we study the time, space and accuracy of a neural network, ˆf, when approximating f i D, as formalized below. Theorem 1 (DQD Bound). For i ∈ {S, C}, there exists a neural network ˆf with space and query time complexity O˜(d(κ1ρdε−1 1 + 1)d ), where O˜ hides logarithmic factors, s.t. P D∼χ 1 n ∥ ˆf − f i D∥1 ≥ ε1 + ε2 ≤ κ d+1 2 dε−d 2 exp (−κ −1 2 ε 2 2n), Where κ1 and κ2 are universal constant. Proof of Theorem 1 is presented in Sec. 4.3.2. Here, we discuss the theorem statement and its implications. A Confidence/Error Analysis. DQD bound relates, with a desired probability (i.e., confidence level), error a neural network can achieve when answering RAQs to its query time and space complexity through data dependent properties. The error is scaled by data size, n, to account for the change in the magnitude of query answers when data size changes. Parameter ε1 allows trading-off accuracy for space or time complexity and ε2 allows trading-off accuracy for confidence in the bound. The probability is over sampling a database from the data distribution. That is, DQD states that, when observing a database D that follows a distribution χ, with high probability, there exists a neural network that can answer RAQs on D and achieve the specified time/space/accuracy trade-off. 28 Distribution Dependent Complexity Measure. DQD bound establishes LDQ of the query function as a measure of complexity when answering RAQs with neural networks. It implies that query time will be faster when LDQ is small. LDQ is a property of the data distribution and the query in question. Thus, Theorem 1 allows us to quantify how easy or difficult it is to approximate query answers for a data distribution using a neural network. We provide specific examples of LDQ for different data distributions in Sec. 4.3.1.3 and empirically confirm impact of LDQ on query answering in Sec. 4.5.7. Faster on Larger Databases. Let the confidence in the DQD bound be δ = κ d+1 2 dε−d 2 exp (−κ −1 2 ε 2 2n). Fixing the value of δ, we observe that n and ε2 are negatively correlated, where increasing data size n leads to reduction in ε2. That is, for a fixed confidence parameter, the error of a neural network decreases as data size increases. Now let ε = ε1 + ε2 be the total neural network error. Also fixing ε in addition to δ but allowing ε1 to vary, we observe that increase in data size results in smaller query time and space complexity, for a fixed neural network error and confidence level. Thus, DQD bound shows the counter-intuitive result that when answering queries with a neural network query time can be lowered by increasing the database size. We empirically confirm this phenomenon in Sec. 4.5.7. Intuitively, this happens because when data size is larger the model only needs to learn the patterns in the data distribution, while for smaller databases, the observed database can be different from the data distribution and the model has to memorize all the points, making it more challenging. 29 Low-Error Cases. DQD bound shows that a neural network can answer queries fast and accurately if the data size is large and LDQ of a query function is small. Thus, DQD bound shows scenarios when using a neural network can provide good performance and presents a property of data distribution that can guarantee low error for a neural network framework when answering RAQs. Nonetheless, it does not preclude neural networks from performing well in other scenarios, which requires further theoretical investigation. Achieving Zero Error. For a fixed and small data size, even if neural network size is allowed to approach infinity, the DQD bound provides a non-zero error bound. That is, letting neural network size go to infinity by reducing ε1 to zero does not achieve total zero error (we empirically verify this in Sec. 4.5.7), as the total error in that case will be equal to ε2 (which can be large depending on n). This is because fD can be discontinuous even though fχ is assumed to be Lipschitz continuous, so that no neural network can approximate it exactly. Points of discontinuity can be seen in Fig. 4.1 (right), where the query answer can suddenly change. Such points of discontinuity happen when the query boundary matches a data point, because in such cases, arbitrarily small changes to the query boundary can change the query answer. As data size increases, fD behaves more like a continuous function (because fχ is Lipschitz continuous), so the achievable error by a neural network goes down. Note that techniques that create a discontiuous function approximator, e.g., quantizating the query space, can potentially help a neural network achieve zero error, as a large enough neural network can memorize a fininte set of points exactly [155]. However, our DQD bound is for queries over space of reals (i.e., approximation of infinite set of points), and without input preprocessing or quantization. 30 4.3.1.3 Impact of Distribution and LDQ The model complexity needed to answer RAQs depends on data distribution through LDQ of f S D and f C D . We provide examples of LDQ for different distributions. Example 2. Let χ be a 1-dimensional uniform distribution. By definition, we have f C χ (c1, r1) = nPp∼χ[p ∈ (c1, r1)] , where (c1, r1) defines a query range (see Sec. 4.2) and p is a data point sampled from χ. χ is uniform so Pp∼χ[p ∈ (c1, r1)] = r1. Differentiating and using the definition, ∂f C χ (c1,r1) ∂c1 = 0 and ∂f C χ (c1,r1) ∂r1 = n, so that 1 n f C χ (c1, r1) is ρ-Lipschitz with ρ = 1. A similar result also holds for 1 n f S χ (c1, r1). The small Lipschitz constant matches the intuition that uniform distribution is easy to approximate. Example 3. Let χ be a 1-dimensional Gaussian distribution with standard deviation σ and µ = 0, we have that | ∂Pp∼χ[p ∈ (c1, r1)] ∂c1 | = | 1 σ √ 2π (e − 1 2 ( c1+r1 σ ) 2 − e − 1 2 ( c1 σ ) 2 )| ≤ 2 σ √ 2π and that | ∂Pp∼χ[p ∈ (c1, r1)] ∂r1 | = | 1 σ √ 2π e − 1 2 ( c1+r1 σ ) 2 | ≤ 1 σ √ 2π so that 1 n f C χ (c1, r1) is ρ-Lipschitz with ρ = 3 σ √ 2π . Thus, for smaller σ the function becomes more difficult to model, as the neural network has to model a sharp change in the function. 31 4.3.1.4 Measuring Complexity in Practice DQD bound can help decide whether to use neural networks to answer RAQs, or to design complexity aware algorithms for practical use-cases (as we do in Sec. 4.4). Such use-cases require measuring LDQ, which can be difficult in practice. For two queries q and q ′ , the Lipschitz constant bounds the maximum change in the function, f, normalized by distance, |f(q)−f(q ′ )| ∥q−q′∥ . Since this maximum is calculated over all query pairs, it is difficult to estimate. Furthermore, it depends on the data distribution itself, while we only have access to samples from the distribution. In practice, we observed that the Average Query function Change, AQC, can be used as a proxy for LDQ. Specifically, we define AQC = 1 ( |Q| 2 ) P q,q′∈Q |f(q)−f(q ′ )| ∥q−q′∥ , where Q ⊆ Q is a set of queries sampled from all possible queries. We experimentally verify the usefulness of this complexity measure in Sec. 4.5.5. 4.3.2 DQD Bound Proof 4.3.2.1 Analysis Framework For a neural network ˆf when modelling a query function, fD, we decompose its error, ∆ = 1 n ∥fD − ˆf∥1, into two terms, approximation error and sampling error: ∆ ≤ 1 n ∥fχ − ˆf∥1 | {z } approximation error, ∆a + 1 n ∥fχ − fD∥1 | {z } sampling error, ∆s (4.1) Approximation error, ∆a, quantifies how accurately the neural network can approximate the distribution query function. ∆a depends on the space/time complexity of the neural network. For instance, larger neural networks have more representation power and can approximate a 32 distribution query function more accurately. Sampling error, ∆s, quantifies the difference, due to sampling, between the distribution and observed query functions. ∆s depends on data size: the more data sampled, the more similar observed and distribution query functions will be (latter is the expected value of the former). We bound each term separately in Secs. 4.3.2.2 and 4.3.2.3. Sec. 4.3.2.4 combines the results which yields Theorem 1. 4.3.2.2 Bounding Approximation Error For a desired bound on approximation error, ∆a, we characterize the time/space complexity required for a neural network to achieve the error bound. Universal function approximation theorem [104, 56] guarantees existence of a neural network of arbitrary time/space complexity that can achieve any desired error value, but does not show its time/space complexity. Recent work (e.g., [71, 152, 102]) study number of neural network parameters needed to achieve a desired error. However, number of neural network parameters cannot be related to its space complexity, because magnitude of the parameters can be unbounded, thus leading to unbounded storage cost even for a fixed number of parameters. We present the following theorem, showing the required time/space complexity to achieve a desired error bound, ε1 (see Sec. 4.6 for a comprehensive discussion of related work). Theorem 2. Given a ρ-Lipschitz function f, there exists a neural network, ˆf, with space and time complexity O˜(d(κρdε−1 1 + 1)d ) , where O˜ hides logarithmic factors in ρ, d and k, such that (a) ∥f − ˆf∥1 ≤ ε1. (b) Furthermore, if d ≤ 3, ∥f − ˆf∥∞ ≤ ε1, Where κ is a universal constant. 33 Theorem 2 (a) bounds ∆a by considering fχ as the function, f, in the theorem statement. Theorem 2 (b) provides a stronger guarantee that can provide an ∞-norm DQD bound in low dimensions. For conciseness, we have not stated that version of DQD bound since the ideas are similar. Theorem 2 is a step towards characterizing neural network approximation power in a data management context. We expect tighter characterizations to be possible, especially for high dimensions. Our theoretical framework for DQD bound can readily benefit from such tighter characterizations. Nonetheless, d is small for many practical applications when answering RAQs. For instance, in Example 1 that mimics a real-world use-case, the query function is 4-dimensional. Proof Sketch of Theorem 2. We uniformly partition the space into cells and construct a neural network that estimates cell vertices exactly. This memorization property is used to bound error within each cell. For instance, Fig. 4.2 (a) shows the distribution query function for a COUNT query with fixed range r = 0.1 on a two-dimensional Gaussian data distribution. A 3x3 grid on input space creates 16 vertices, shown in Fig. 4.2 (a). Our construction ensures that the error for these 16 vertices is zero, as shown in Fig. 4.2 (b). Network Architecture. We construct a ReLU neural network, ˆf, with two hidden layers, shown in Fig. 4.2 (c). ˆf can be written as a summation of k smaller units, called g-units. Each g-unit ensures that a cell vertex is memorized correctly and k controls neural network size. The i-th g-unit, gˆi for 1 ≤ i ≤ k, is constructed as shown in Fig. 4.2 (c). It has d inputs, d units in its first layer and 1 unit in its second layer. Each input is only connected to one of the units in the first layer with weight -1. All units in the first layer are connected to the unit in the second layer, and their weight is −M, where M is a constant at least equal to 1. The j-th unit, 1 ≤ j ≤ d in first layer has bias bj,i and the unit in second layer has bias 1 t for an integer t. The output of the second unit is multiplied by a parameter ai . Then, the neural network is ˆf(x) = Pk i=1 gˆi(x) + b, 34 Figure 4.2: Constructed neural network and its architecture. Values on edges and nodes show edge weight and unit bias. where b is the bias of the third layer. The tunable parameters of the neural network are ai , bj,i, and b for 1 ≤ i ≤ k and 1 ≤ j ≤ d. Network Parameters. Let the set of cell vertices in the uniform grid be P = {(i1, ..., id)/t, ir ∈ Z, 0 ≤ ir ≤ t}, for an integer t so that k = |P| = (t + 1)d (recall that input space is [0, 1]d ). Also let π i be the base t + 1 representation of an integer i written as a vector, i.e., π i = (π i 1 , ..., πi d ) so that i = Pd r=1 π i r (t + 1)d−r . For example, when t = 3, π 6 = (1, 2), since 6 = 1(t + 1) + 2. Note that πi t ∈ P and ⟨ π0 t , ..., πk−1 t ⟩ is an ordering of cell vertices. Alg. 1 enumerates using this ordering over the cell vertices and sets, at the i-th iteration, the parameters of the i-th g-unit so that πi t is correctly memorized. It calculates, yˆ, the estimate of the neural network for point πi t based on the g-units set before the i-th iteration (line 3). Then it sets the parameter of the i-th g-unit to account for the difference between yˆ and the true value, f( πi t ). Fig. 4.3 shows this process in our example. On the left, Fig. 4.3 shows, at the end of each iteration i, the function b + Pi j=1 gˆj (x) (define P0 j=1 gˆj (x) = 0). On the right it shows that at the 10-th iteration, the model sets gˆ10 to memorize the 10-th point correctly. Alg. 1 and g-unit architecture are designed so that when the 10-th point is memorized, the neural network value for the previously memorized points does not change. 35 Figure 4.3: Neural Network Construction Steps Algorithm 1 Neural Network Construction Input: A function f, a parameter t Output: Neural network ˆf 1: b ← f(0) 2: for i ← 1 to (t + 1)d − 1 do 3: yˆ ← b + Pi−1 j=1 gˆj ( πi t ) 4: for r ← 1 to d do 5: br,i ← π i r t 6: ai ← t(f( πi t ) − yˆ) 7: return ˆf Proving the Bound. We provide proof sketch for Theorem 2 (a), using lemmas formally stated and proven in Sec. A of our technical report [10]. Proof for Theorem 2 (b) is similar. Lemma A.1 states that ˆf(x) achieves zero error at cell vertices, i.e., f(x) = ˆf(x), ∀x ∈ P. (4.2) 36 Furthermore, f is ρ-Lipschitz so its change is bounded within each cell. That is, for x, x ′ ∈ C i , where C i = { πi t +z, z ∈ [0, 1 t ] d} is the subset of input space in the i-th cell, the Lipschitz property implies |f(x) − f(x ′ )| ≤ ρd t . (4.3) Lemma A.2 proves that ˆf change is bounded within each cell, i.e. | ˆf(x) − ˆf(x ′ )| ≤ ϕ(d, ρ, t, x, x ′ ) (4.4) for some function ϕ specified in Lemma A.2. ϕ depends on x and x ′ since the bound is different depending on where in space x and x ′ are. Using triangle inequality with Eq. 4.3 and 4.4, we have | ˆf(x) − f(x) − ( ˆf(x ′ ) − f(x ′ ))| ≤ dρ t + ϕ(d, ρ, t, x, x ′ ). (4.5) Letting x ′ = πi t in Eq. 4.5 and using Eq. 4.2, we obtain | ˆf(x) − f(x)| ≤ dρ t + ϕ(d, ρ, t, x, π i t ). (4.6) Lemma A.3 shows that integrating right hand side of Eq. 4.6 over x and across cells yields 3ρd t so we bound the 1-norm error as ∥ ˆf − f∥1 ≤ 3ρd t . (4.7) 37 Lemma A.4 shows that space and time complexity of ˆf is O˜(kd). Setting ε1 = 3ρd t and κ = 3, recalling that k = (t + 1)d , and substituting k = (κρdε−1 1 + 1)d in the space/time complextiy experssion proves Theorem 2 (a). Lemma proofs require a detailed study of neural network behaviour, see Sec. A of technical report [10] . 4.3.2.3 Bounding Sampling Error We present the following theorem that bounds the sampling error with high probability. Theorem 3. Let f C χ and f S χ be distribution query functions for COUNT and SUM aggregation functions and f C D and f S D the corresponding observed query functions for a database, D, of n points in d dimensions sampled from χ. For i ∈ {S, C}, P D∼χ 1 n ∥f i χ − f i D∥∞ > ε2 ≤ κ d+1dε−d 2 exp (−κ −1 ε 2 2n), Where κ is a universal constant. Theorem 3 provides a high probability bound on ∆s in Eq. 4.1. The proof of Theorem 3 uses VC sampling theory, which presents a novel use of VC theory for the database literature. VC theory helps us understand the impact of the distribution a database follows on operations performed (e.g., answering RAQs) on the database. In fact, Theorem 3 is independent of our use of learned models, and simply characterizes impact of sampling when answering RAQs on a database that follows a certain data distribution. This is different from the typical use of VC theory in machine learning, where the goal is to study generalization of a trained model to unseen testing data. We present a proof sketch for the case of COUNT. Proof for SUM is similar, but uses a generalization of VC-dimension. 38 Proof Sketch of Theorem 3 for COUNT. We start by rewriting the query function. Define the indicator function h as h C q (p) = 1 if ∀i, ci ≤ pi < ci + ri 0 otherwise. So fD(q) = P p∈D h C q (p) and fχ(q) = nEp∼χ[hq(p)]. Let HC = {h C q , ∀q}, so to bound error supq 1 n |fD(q) − fχ(q)|, we bound sup h∈HC | 1 n X p∈D h(p) − Ep∼χ[h(p)]|. (4.8) VC-dimension of HC is known to be 2d [121] (see Lemma A.12 of technical report [10] ), so applying VC theory bounds [11] (stated in Theorem A.11 of technical report [10] ) to Eq. 4.8 proves the theorem. 4.3.2.4 Completing the Proof Let ε1 and ε2 be the two error parameter, and let ˆf be the neural network in Theorem 2 that achieves error ε1. Furthermore, let E1 be the event 1 n ∥f i χ − f i D∥∞ ≤ ε2 holds for a random D sampled from χ. Observe that if E1 holds, by triangle inequality, the event E2 defined as 1 n ∥ ˆf − f i D∥1 ≤ ε2 + ε1 also holds. Thus, P[E1] ≤ P[E2]. Taking the complement of both event, and observing that probability of complement of E1 is bounded by Theorem 3 yields Theorem 1. 39 4.3.3 Other Query Functions and Model Choices Proof of DQD bound for SUM and COUNT aggregation functions decomposes the error into approximation error and sampling error. Theorem 2, which bounds the approximation error, is independent of the aggregation function used and applies to any function. To utilize the theoretical framework for other query functions, we need to bound the corresponding sampling error (Theorem 3 is specific to SUM and COUNT). In Sec. 4.3.3.1, we discuss this for AVG aggregation function and provide a general discussion for other query functions in Sec. 4.3.3.2. In Sec. 4.3.3.3 we discuss the applicability of our analysis framework to other modeling choices. 4.3.3.1 AVG Aggregation Function Our study of AVG aggregation function is a variation of that of SUM and COUNT. We discuss the differences, then present our sampling error bound. First, we consider a variation of distribution query function, defined as ¯f A χ (q) = f S χ (q) f C χ (q) . which we found to be easier to theoretically study ( ¯f A χ is not the expected answer to AVG query, but expected answer to SUM query divided by expected answer to COUNT query). Since it depends on data distribution, it still allows us to study impact of data distribution on query answering. Second, we define LDQ as the Lipschitz constant of ¯f A χ . LDQ in this case is not normalized by data size (as it was for SUM and COUNT in Sec. 4.3.1.1), since magnitude of query answers for AVG do no change as data size changes. Third, for small query ranges few points in the database may match the query, even if data size is large. In such cases, for AVG aggregation function, the observed query function will be a poor estimate of the distribution query function. For COUNT or SUM query functions, few data points in a range means that both SUM and COUNT values are small, but this is not the case for the AVG function whose distribution query answer is independent of 40 the number of points sampled in the range. To capture this dependence on query range, we define Qξ = {q, s.t., f C χ (q) ≥ ξ}. Our bound depends on ξ, which captures the probability of observing a point in a range. Lemma 1. Recall that f A D (q) = f S D(q) f C D (q) is the AVG query function. Let err(q) = |f¯A χ (q)−f A D(q)| |f¯A χ (q)|+1 . We have P D∼χ h supq∈Qξ err(q) ≥ ε i ≤ κ d+1d 1 + ε ξε d exp −κ −1 ( ξε 1 + ε ) 2n , Where κ is a universal constant. Proof Sketch. Proof applies Theorem 3 to numerator and denominator of AVG query function (Sec. A.4 of technical report [10] ). Combining Lemma 1 and Theorem 2 show similar discussions to Sec. 4.3.1.2 on dependence on data distribution and size also apply to AVG queries. Lemma 1 also shows impact of query range. More Accurate on Larger Ranges. Impact of query range is modeled through the parameter ξ. Larger ξ means the bound applies to larger ranges, where the confidence in the bound increases with ξ. Fixing the confidence level, observe that ξ and ε are negatively correlated. Increasing the query ranges considered reduces the sampling error. Thus, if LDQ of the query function is small (approximation error is low) and query range is large (sampling error is low), a neural network can answer AVG RAQs accurately and efficiently. LDQ can be calculated similar to examples in Sec. 4.3.1.2. 41 4.3.3.2 Other Query Functions Bounding sampling error for queries with COUNT, SUM or AVG aggregation functions but different range predicates (e.g., circular predicate (c, r) matching points p, ∥p − c∥2 ≤ r) can be done similar to proof of Theorem 3 (only finding range predicate’s VC-dimension needs further study). However, applicability of VC theory depends on the aggregation function. 4.3.3.3 DQD for Query Modelling Approaches Our analysis framework allows for providing DQD bounds for other query modeling approaches, where we define query modelling as an approach that directly models the query answers. Furthermore, our analysis of sampling error (Theorem 3, Lemma 1) does not depend on modeling choices and is generic to query modeling approaches. Thus, insights about the role of data size can be applicable to other query modeling approaches. For instance, consider answering count queries on uniformly distributed data in range [0, 1], as in Example 2. For data size n, as data size increases, the number of data points in a query (c1, r1) becomes more similar to r1 × n, which is the expected number of points that fall in any range of length r1. Thus, one can estimate the answer to count query with a model gˆ defined as gˆ(c1, r1) = n × r1. Answering queries with gˆ takes constant time (it’s a single operation), and its accuracy improves as data size increases, as supported by Theorem 3. 4.4 NeuroSketch DQD bound formalizes how complexity of answering RAQs relates to data and query properties. In this section, we present a novel complexity-aware neural network framework, NeuroSketch, 42 Figure 4.4: NeuroSketch Framework that utilizes results from DQD bound to allocate model capacity. We first present an overview of NeuroSketch, then discuss its details and finally discuss how it can be used in real-world database systems together with our DQD bound. 4.4.1 NeuroSketch Overview The key idea behind NeuroSketch design is that, even on the same database, some queries can be more difficult to answer than others (e.g., larger ranges vs. smaller ranges, see Sec. 4.3.3.1). By allocating more model capacity to queries that are more difficult, we can improve the performance. We do so by partitioning the query space and training independent neural networks for each partition. The partitioning allows diverting model capacity to harder queries, which our DQD bound allows us to quantify. By creating models specialized for a specific part of the query space, query specialization allows us to control how model capacity is used across query space. Fig. 4.4 shows an overview of NeuroSketch. During a pre-precessing step, (1) we partition and index the query space using a kd-tree. The partitioning is done based on our query specialization principle, with the goal of training a specialized neural network for different parts of the query space. (2) To account for the complexity of the underlying function in our partitioning, we merge the nodes of the kd-tree that are easier to answer based on our DQD bound, so that our model only 43 Algorithm 2 partition_&_index(N, h, i) Input: A kd-tree node N, tree height h and dimension, i to split the node, N on Output: A kd-tree with height h rooted at N 1: if h = 0 then 2: return 3: N.val ← median of N.Q along i-th dimension 4: N.dim ← i 5: Qlef t = {q|q ∈ N.Q, q[N.dim] ≤ N.val} 6: Qright = {q|q ∈ N.Q, q[N.dim] > N.val} 7: for x ∈ {lef t, right} do 8: Nx ← new node 9: Nx.Q ← Qx 10: N.x ← Nx ▷Adding Nx as left or right child of N 11: get_index(Nx, h − 1,(N.dim + 1) mod d) has to specialize for the certain parts of the space that are estimated to be more difficult. (3) After some nodes of the kd-tree have been merged, we train a neural network for all the remaining leaves of the kd-tree. Finally, to answer queries at query time, we traverse the kd-tree to find the leaf node a query falls inside, and perform a forward pass of the neural network. 4.4.2 NeuroSketch Details Training NeuroSketch uses a training query set Q ⊆ Q. Q can be sampled from Q according to a workload distribution, or can be a uniform sample in the absence of any workload information. We do not assume access to workload information, but our framework can take advantage of the query workload if available. Partitioning & Indexing. To partition the space, we choose partitions that are smaller where the queries are more frequent and larger where they are less frequent. This allows us to divert more model capacity to more frequent queries, thereby boosting their accuracy if workload information is available. We achieve this by partitioning the space such that all partitions are equally probable. To do so, we build a kd-tree on our query set, Q, where the split points in the kd-tree 44 Algorithm 3 merge(N, s) Input: kd-tree root node N and desired number of partitions s Output: kd-tree with s leaf nodes 1: repeat 2: for all Leaf nodes N do 3: AQCN ← 1 ( |N.Q| 2 ) P q,q′∈N.Q,q̸=q′ |fD(q)−fD(q ′ )| ∥q−q′∥ 4: N ← the leaf node with smallest ACQN 5: N.marked ← true 6: for all Sibling leaf nodes N1, N2 do 7: if N1.marked = N2.marked = true then 8: Merge N1 and N2 9: until There are s leaf nodes can be considered as estimates of the median of the workload distribution (conditioned on the current path from the root) along one of its dimensions. We build the kd-tree by specifying a maximum height, h, and splitting every node until all leaf nodes have height h, which creates 2 h partitions. Splitting of a node N is done based on median of one of the dimensions of the subset, N.Q, of the queries, Q, that fall in N. Alg. 2 shows this procedure. To build an index with height h rooted at a node, Nroot (note that Nroot.Q = Q), we call partition_&_index(Nroot, h, 0). We note that other partitioning methods (e.g., clustering the queries to perform partitioning) are also possible, but we observed kd-tree to be a simple practical solution with little overhead that performed well. Merging. We merge some of kd-tree leaves using DQD bound. As discussed in Sec. 4.3.1.4, LDQ can be difficult to measure in practice, so we use AQC as a proxy, as shown in Alg. 3. At each iteration, we first measure the approximation complexity for the leaf nodes, in line 3, where the approximation complexity, AQCN for a leaf node N is calculated based on queries that fall in the node N. Then, we mark the node with the smallest AQCN for merging. When two sibling leaf nodes are marked, they are merged together, as shown in line 8. The process continues until the 45 Algorithm 4 Model Training Input: A dataset D, a kd-tree node N Output: Neural network ˆf for node N 1: Initialize the parameters, θ, of a neural network ˆf(.; θ) 2: repeat 3: Sample, Qbatch, a subset of N.Q 4: Update θ in direction −∇θ P q∈Qbatch (fˆ(q;θ)−fD(q))2 |Qbatch| 5: until convergence 6: return ˆf Algorithm 5 answer_query(N, q) Input: kd-tree root node N and query q Output: Answer to q 1: while N is not leaf do 2: if q[N.dim] ≤ N.val then 3: N ← N.lef t 4: else 5: N ← N.right return N.model.forward_pass(q) number of remaining leaf nodes reaches the desired threshold. In practice, we observed that the quantity AQCN is correlated with the error of the neural networks, which empirically justifies this design choice (see Sec. 4.5.5). Training Neural Networks. We train an independent model for each of the remaining leaf nodes after merging. For a leaf node, N, the training process is a typcial supervised learning procedure and shown in Alg. 4 for completeness. The answer to queries for training, used in line 4 of Alg. 4, can be collected through any known algorithm, where a typical algorithm iterates over the points in the database, pruned by an index, and for a candidate data point checks whether it matches the RAQ predicate or not. This is a pre-processing step and is only performed once to train our model. The process is embarrassingly parallelizable across training queries, if preprocessing time is a concern. Furthremore, if the data is disk resident, we keep partial SUM/COUNT 46 answers for each training query while scanning data from disk, so a single scan of data is sufficient (similar to building disk-based indexes) to collect training query answers. Once trained, NeuroSketch is much smaller than data and expected to fit in memory, so it will be much faster than disk-based solutions. We use Adam optimizer [60] for training and train a fully connected neural network for each of the partitions. The architecture is the same for all the partitions and consists of nl layers, where the input layer has dimensionality d, the first layer consists of lf irst units, the next layers have lrest units and the last layer has 1 unit. We use relu activation for all layers (except the output layer). nl , lf irst and lrest are hyper-parameters of our model. Although approaches in neural architecture search [170] can be applied to find them, they are computationally expensive. Instead, we do a grid search to find the hyper-parameters so that NeuroSketch satisfies the space and time constraints in Problem 1 while maximizing its accuracy. Answering Queries. As shown in Alg. 5, to answer a query, q, first, the kd-tree is traversed to find the leaf node that the query q falls into. The answer to the query is a forward pass of the neural network corresponding to the leaf node. 4.4.3 General RAQs and Real-World Application General RAQs. NeuroSketch can be used for more general RAQs than defined in Sec. 4.2. An RAQ consists of a range predicate, and an aggregation function AGG. In NeuroSketch, we make no assumption on the aggregation function AGG and our empirical results evaluated NeuroSketch on SUM, AVG, COUNT, MEDIAN and STD. We consider range predicates that can be represented by a query instance q, and a binary predicate function, Pf (q, x), that takes as inputs a point in the database, x, x ∈ D, and the query instance q, and outputs whether x matches the predicate or not. Then, given a predicate function and an aggregation function, range aggregate queries 47 can be represented by the query function fD(q) = AGG({x : x ∈ D, Pf (x, q) = 1}). We avoid specifying how the predicate function should be defined to keep our discussion generic to arbitrary predicate functions, but some examples follow. To represent the RAQs of the form discussed in Sec. 4.2, q can be defined as lower and upper bounds on the attributes and Pf (q, x) defined as the WHERE clause in Sec. 4.2. We can also have Pf (q, x) = x[1] > x[0]×q[0]+q[1], so that Pf (q, x) and q define a half-space above a line specified by q. For many applications, WHERE clauses in SQL queries are written in a parametric form [95, 96, 97] (e.g., WHERE X1 >?param1 OR X2 >?param2, where ?param is the common SQL syntax for parameters in a query). Such queries can be represented as query functions by setting q to be the parameters of the WHERE clause. NeuroSketch and DQD in Practice. Possible RAQs correspond to various query function and NeuroSketch learns different models for different query functions. This follows the query specialization design principle, where a specialized model is learned to answer a query function well. A query processing engine can be used to decide which query functions to use NeuroSketch for. This can happen both on the fly, when answering queries, and during database maintenance. During maintenance, DQD bound can be used to decide which queries to build NeuroSketch for (e.g., for queries with small LDQs). Moreover, after NeuroSketch is built for a query function, DQD can be used to decide whether to use NeuroSketch for a specific query instance or not on the fly. For instance, queries with large ranges (that NeuroSketch answers accurately according to DQD) can be answered by NeuroSketch, while queries with smaller ranges can be asked directly from the database. 48 0 300 900 PM2.5 ( g/m3 ) 0 0.1 0.2 0.3 Freq. PM -9k 0 9k Net Profit ($) TPC 0 20 Visit Duration (h) 0 0.1 0.2 Freq. VS -2 0 2 Measure Column GMM Figure 4.5: Measure column distribution (shared y-axis) PM VS G5 G10 G20 TPC1 TPC10 Datasets 0.00 0.02 0.04 0.06 0.08 Normalized MAE (a) Error PM VS G5 G10 G20 TPC1 TPC10 Datasets 10 2 10 3 10 4 10 5 Time ( s) (b) Query time PM VS G5 G10 G20 TPC1 TPC10 Datasets 10 0 10 2 10 4 Size (MB) (c) Storage NeuroSketch TREE-AGG VerdictDB DeepDB DBEst Figure 4.6: RAQs on different datasets 4.5 Empirical Study 4.5.1 Experimental Setup System Setup. Experiments are performed on a machine with Ubuntu 18.04 LTS, an Intel i9- 9980XE CPU (3GHz), 128GB RAM and a GeForce RTX 2080 Ti NVIDIA GPU. Datasets. Table 4.1 shows the datasets used in our experiments, with details discussed below. Fig. 4.5 shows the histogram of measure column values used in the experiments. PM. PM [68] contains Fine Particulate Matter (PM2.5) measuring air pollution and other statistics (e.g., temperature) for locations in Beijing. Similar to [75], PM2.5 is the measure attribute. TPC-DS. We used TPC-DS [85], a synthetic benchmark dataset, with scale factors 1 and 10, respectively referred to as TPC1 and TPC10. Since we study RAQs, we use the numerical attributes in store_sales table as our dataset, and net_profit as measure attribute. 49 Dataset G5, G10, G20 PM [68] TPC1, TPC10 [85] VS # Points 105 4.17×104 2.65×106 , 2.65×107 105 Dim 5, 10, 20 4 13 3 Table 4.1: Dataset information 0.01 0.03 0.05 0.10 Quary Range 0.00 0.02 0.04 0.06 Normalized MAE (a) Error 0.01 0.03 0.05 0.10 Quary Range 10 2 10 3 10 4 Time ( s) (b) Query Time NeuroSketch VerdictDB DeepDB TREE-AGG Figure 4.7: Varying query range 1 2 3 No. active attributes 0.00 0.02 0.04 0.06 Normalized MAE (a) Error 1 2 3 No. active attributes 10 2 10 3 10 4 Time ( s) (b) Query time NeuroSketch TREE-AGG DeepDB VerdictDB Figure 4.8: Varying no. of active attributes Veraset. As was used in our running example, we use Veraset dataset, which contains anonymized location signals of cell-phones across the US collected by Veraset [135], a data-as-a-service company. Each location signal contains an anonymized id, timestamp and the latitude and longitude of the location. We performed stay point detection [153] on this dataset (to, e.g., remove location signals when a person is driving), and extracted location visits where a user spent at least 15 minutes and for each visit, also recorded its duration. 100,000 of the extracted location visits in downtown Houston were sampled to form the dataset used in our experiments, which contains three columns: latitude, longitude and visit duration. We let visit duration to be the measure attribute. GMMs. We study data dimensionality with synthetic 5, 10 and 20 dimensional data from Gaussian mixture models (GMM) (100 components, random mean and co-variance), referred to as G5, G10 and G20. GMMs are often used to model real data distribution [114]. Query Distribution. Our experiments consider query functions consisting of AVG, SUM, STDEV (standard deviation) and MEDIAN aggregation functions together with two different predicate functions. First, similar to [75], our experiments show the performance on the predicate 50 AVG SUM STD Agg. function 0.00 0.01 0.02 0.03 Normalized MAE (a) Error AVG SUM STD Agg. function 10 2 10 3 10 4 10 5 Time ( s) (b) Query time NeuroSketch TREE-AGG DeepDB VerdictDB Figure 4.9: Varying agg. function function defined by the WHERE clause in Sec. 4.2. We consider up to 3 active attributes in the predicate function. To generate a query instance with r active attributes, we first select, uniformly at random, r activate attributes (from a total of d possible attributes). Then, for the selected active attributes, we randomly generate a range. Unless otherwise stated, the range for each active attribute is uniformly distributed. This can be thought of as a more difficult scenario for NeuroSketch as it requires approximating the query function equally well over all its domain, while also giving a relative advantage to other baselines, since they are unable to utilize the query distribution. Unless otherwise stated, for all datasets except Veraset, we report the results for one active attributes and use AVG aggregation function. For Veraset, we report the results setting latitude and longitude as active attributes. Second, to show how NeuroSketch can be applied to application specific RAQs, in Sec. 4.5.2.2, we discuss answering the query of median visit duration given a general rectangle on Veraset dataset. Measurements. In addition to query time and space used, we report the normalized absolute error for a query in the set of test queries, T, defined as |fD(q)−fˆD(q,θ)| 1 |T | P q∈T |fD(q)| . We ensure that none of the test queries are in the training set. The error is normalized by average query result magnitude to allow for comparison over different data sizes and datasets when the results follow different scales. 51 Learned Baselines. We use DBEst [75] and DeepDB [51] as the state-of-the-art model-based AQP engines. Both algorithms learn data models to answer RAQs. We use the open-source implementation of DBEst available at [74] and DeepDB at [50]. For DBEst, we perform a gird search on its MDN architecture (number of layers, layer width, number of Gaussian componenets) and optimize it per dataset. For DeepDB we optimize its RDC threshold for each dataset. We do not use [129] as a baseline, which samples new data points at query time from a learned model to answer queries because the results in [129] show worse accuracy and same query time ([129] improves storage) compared with sampling directly from the data (which we have included as baseline). We also modified NeuroCard [148], a learned cardinality estimation method to answer RAQs, but we observed the modified approach to perform worse than DeepDB on RAQs. We do not present the results for [148], since it is not designed for RAQs and performed worse than DeepDB. Sampling-based Baselines. We use VerdictDB [99] as our sampling-based baseline, using its publicly available implementation [98]. We also implemented a sampling-based baseline designed specifically for range aggregate queries, referred to as TREE-AGG. In a pre-processing step and for a parameter k, TREE-AGG samples k data points from the database uniformly. Then, for performance enhancement and easy pruning, it builds an R-tree index on the samples, which is well-suited for range predicates. At query time, by using the R-tree, finding data points matching the query is done efficiently, and most of the query time is spent on iterating over the points matching the predicate to compute the aggregate attribute required. For both TREE-AGG and VerdictDB, we set the number of samples so that the error is similar to that of DeepDB. NeuroSketch Training and Evaluation. NeuroSketch training is performed in Python 3.7 and Tensorflow 2.1, with implementation publicly available at [9]. Model training is done on GPU. 52 Models are saved after training. For evaluation, a separate program written in C++ and running on CPU loads the saved model, and for each query performs a forward pass on the model. Model evaluation is done with C++ and on CPU, without any parallelism for any of the algorithms. Unless otherwise stated, model depth is set to 5 layers, with the first layer consisting of 60 units and the rest of 30 units. The height of the kd-tree is set to 4, and parameter s = 8 so that the kd-tree has 8 leaf nodes after merging. 4.5.2 Baseline Comparisons 4.5.2.1 Results Across Datasets Fig. 4.6 (a) shows the error on different datasets, where NeuroSketch provides a lower error rate than the baselines. Fig. 4.6 (b) shows that NeuroSketch achieves this while providing multiple orders of magnitude improvement in query time. NeuroSketch has a relatively constant query time because, across all datasets, NeuroSketch’s architecture only differs in its input dimensionality, which only impacts number of parameters in the first layer of the model and thus changes model size by very little. Due to our use of small neural networks, we observe that model inference time for NeuroSketch is very small and in the order of few microseconds, while DeepDB and DBEst answers queries multiple orders of magnitude slower. DBEst does not support multiple active attributes and thus its performance is not reported for VS. The results on G5 to G20 show the impact of data dimensionality on the performance of the algorithms. As was suggested by our theoretical results, for NeuroSketch, the error increases as dimensionality increases. A similar impact can be seen for DeepDB, manifesting itself in increased query time. Furthermore, the R-tree index of TREE-AGG often allows it to perform better than the other baselines, especially 53 for low dimensional data. Finally, Fig. 4.6 (c) shows the storage overhead of each methods. NeuroSketch answers queries accurately by taking less than one MB space, while DeepDB’s storage overhead increases with data size, to more than one GB. 4.5.2.2 Results Across Different Workloads We use TPC1 and VS to study impact of query workload on performance of the algorithms. Unless otherwise stated results are on TPC1. Due to its poor performance on TPC1 and not supporting multiple active attributes (for VS queries), we exclude DBEst from the experiments here. Impact of Query Range. We set the query range to x percent of the domain range, for x ∈ {1, 3, 5, 10} and present the results in Fig. 4.7. The error of NeuroSketch increases for smaller query ranges, as our theoretical results suggest. As mentioned before, this is because for smaller ranges NeuroSketch needs to memorize where exactly each data point is, rather than learning the overall distribution of data points. Nevertheless, NeuroSketch provides better accuracy than the baselines for query ranges at least 3 percent, and performs queries orders of magnitude faster for all ranges. If more accurate answers are needed for smaller ranges, increasing the model size of NeuroSketch can improve its accuracy at the expense of query time (see Sec. 4.5.3). Impact of No. of Active Attributes. In Fig. 4.8, we vary the number of active attributes in the range predicate from one to three. Accuracy of all the algorithms drops when there are more active attributes, with NeuroSketch outperforming the algorithms both in accuracy and query time. Having more active attributes is similar to having smaller ranges, since fewer points will match the query predicate. Thus, our theoretical results explain the drop in accuracy. Impact of Aggregation Function. Fig. 4.9 shows how different aggregation functions impact performance of the algorithms. NeuroSketch is able to outperform the algorithms for all 54 10 1 10 2 10 3 10 4 Time ( s) 0.00 0.05 0.10 Normalized MAE (a) Accuracy/Time Trade-off 10 2 10 1 10 0 Space (fraction of data size) 0.00 0.05 0.10 Normalized MAE (b) Accuracy/Space Trade-off (h, 120, 5) (h, 30, 5) (0, w, 5) (0, 30, d) (0, 120, d) DeepDB TREE-AGG VerdictDB Figure 4.10: Time/Space/Accuracy Trade-Off with Different Model Architectures (a) Depth=10 Lat. 29.760 29.758 29.756 29.754 29.752 -95.362 (b) Depth=5 -95.366 -95.366 -95.362 Lon. Lon. 0 2 4 6 8 Figure 4.11: Learned NeuroSketch Visualization aggregation functions. VerdictDB and DeepDB implementation did not support STDEV and no result is reported for STDEV for these methods. Median Visit Duration Query Function. We consider the query of median visit duration given a general rectangular range. The predicate function takes as input coordinates of two points p and p ′ , representing the location of two non-adjacent vertices of the rectangle, and an angle, ϕ, that defines the angle the rectangle makes with the x-axis. Given q = (p, p ′ , ϕ), the query function returns median of visit duration of records falling in the rectangle defined by q. This is a common query for real-world location data, and data aggregators such as SafeGraph [116] publish such information. Table 4.2 shows the results for this query function. Neither DeepDB nor DBEst can answer this query. The predicate function is not supported by those methods, and extending those methods to support them is not trivial. On the other hand, NeuroSketch can answer this query function, with similar performance to other queries on VS dataset. Although VerdictDB can be extended to support this query function, the current implementation does not support the aggregation function, so we do not report the results on VerdictDB. 55 10 2 10 3 10 4 10 5 10 6 No. Samples 10 2 10 1 Norm. MAE (a) Impact of Training Size 10 2 10 3 10 4 10 5 10 6 No. Samples 10 3 10 2 10 1 dist. NTQ (b) Test to Train Distnace VS PM TPC1 width=120 width=30 Figure 4.12: Generalization Study Metric NeuroSketch TREEAGG DeepDB & VerdictDB Norm. MAE 0.045 0.052 N/A Query time (µs) 25 601 N/A Table 4.2: Median visit duration for general rectangles 4.5.3 Model Architecture Analysis 4.5.3.1 Time/Space/Accuracy Trade-Offs of Model Architectures Setup. We study different time /space/accuracy trade-offs achievable by NeuroSketch and other methods in Fig. 4.10 based on different system parameters. For NeuroSketch, we vary number of layers (referred to as depth of the neural network), d, number of units per layer (referred to as width of the neural network), w, and height of the kd-tree, h, to see their impact on its time/space/accuracy (we avoid merging kd-tree nodes here, and study the impact of merging separately in Sec. 4.5.5). Fig. 4.10 shows several possible combinations of the hyperparameters. For each line in Fig. 4.10, NeuroSketch is run with two of the hyperparameters kept constant and one changing. The line labels are of the form (height, width, depth), where two of height, width or depth have numerical values and are the constant hyperparameters for that particular line. Furthermore, the value of one of height, width or depth is {d, w, h} and is the variable hyperparameter for the plotted line. For example, line labelled (h, 120, 5) means the experiments 56 for the corresponding line are with a NeuroSketch architecture with 120 number of units per layer, 5 layers and each point plotted corresponds to a different value for the kd-tree height, and label (0, 30, d) means the experiments are run with varying depth of the neural network, with kd-tree height 0 (i.e. only one partition) and the neural width network is 30. The hyperparameter values are as follows. For lines (h, 120, 5) and (h, 30, 50), kd-tree height is varied from 0 to 4, for the line labelled (0, w, 5) neural network width is {15, 30, 60, 120} and for lines (0, 120, d) and (0, 30, d) neural network depth is {2, 5, 10, 20}. TREE-AGG and VerdictDB are plotted for sampling sizes of 100%, 50%, 20% and 10% of data size. For DeepDB, we report results for RDC thresholds in [0.1, 1] (minimum error is at RDC threshold=0.3. Error increases for values less than 0.1 or more than 1). Results. Fig. 4.10 (a) shows the trade-off between query time and accuracy. NeuroSketch performs well when fast answers are required but some accuracy can be sacrificed, while if accuracy close to an exact answer is required, TREE-AGG can perform better. Furthermore, Fig. 4.10 (b) shows the trade-off between space consumption and accuracy. Similar to time/accuracy tradeoffs, we observe that when the error requirement is not too stringent, NeuroSketch can answer queries by taking a very small fraction of data size. Finally, NeuroSketch outperforms DeepDB in all the metrics. Furthermore, comparing TREE-AGG with VerdictDB shows that, on this particular dataset, the sampling strategy of VerdictDB does not improve upon uniform sampling of TREE-AGG while the R-tree index of TREE-AGG improves the query time over VerdictDB. Moreover, Fig 4.10 shows the interplay between different hyperparameters of NeuroSketch. We see that increasing depth and width of the neural networks improves the accuracy, but after a certain accuracy level the improvement plateaus and accuracy even worsens if depth of the neural network is increased but the width is too small (i.e., the red line). Nevertheless, using partitioning 57 method allows for further improving the time/accuracy trade-off as it improves the accuracy at almost no cost to query time. We also observe that kd-tree improves the space/accuracy trade-off, compared with increasing the width or depth of neural networks. This shows that our paradigm of query specialization is beneficial, as learning multiple specialized models each for a different part of the query space performs better than learning a single model for the entire space. We discuss these results in the context of our DQD bound in Sec. 4.5.7. 4.5.3.2 Visualizing NeuroSketch for Different Model Depth Fig. 4.11 shows the function NeuroSketch has learned for our running example, for two neural networks with the same architecture, but with depths 5 and 10. Comparing Fig. 4.11 with Fig. 4.1, we observe that NeuroSketch learns a function with similar patterns as the ground truth but the sharp drops in the output are smoothened out. We also observe that the learned function becomes more similar to the ground truth as we increase the number of parameters. Note that the neural networks are of size about 9% and 3.8% of the data size. 4.5.4 NeuroSketch Generalization Analysis Fig. 4.12 studies generalization ability of NeuroSketch from train to test queries across across datasets. The results are for a NeuroSketch with tree height 0 (i.e., no partitioning), neural network depth 5 and with neural network widths of 30 and 120. Fig. 4.13 (a) shows that training size of about 100,000 sampled query points is sufficient for both architectures to achieve close to their lowest error. Furthermore, when sample size is very small, smaller architecture generalizes better, while the larger neural network improves performance when enough samples are available. 58 PM VS G5 G10 G20 TPC1 TPC10 Datasets 10 0 10 1 Time (sec.) (a) Training Set Generation 20 40 60 Duration (mins.) 1 1.2 1.5 2 Norm. MAE Ratio (b) Architecture Search 10 2 10 1 10 0 10 1 Duration (mins.) 10 2 10 1 10 0 Norm MAE (c) Training Duration VS PM TPC1 width=120 width=30 Figure 4.13: Preprocessing Time Study n " ! n !μ ! Figure 4.14: DQD Bound on Synthetic Datasets In Fig. 4.13 (b), we plot the average Eucleadian distance from test queries to their nearest training query, refered to as dist. NTQ. To compare across datasets, datasets are scaled to be in [0, 1] for this plot, and the difference in dist. NTQ values is due to different data dimensionality and number of active attributes in the queries. We ensure none of the test queries appear in the training set, but as the number of training samples increases, dist. NTQ decreases. Nonetheless, when model size is small, eventhough increasing number of samples beyond 100,000 decreaes dist. NTQ, model accuracy does not improve. This suggests that for small neural networks, the error is due to the capacity limit of the model to learn the query function, and not lack of training data. 4.5.5 Ablation Study of Partitioning We study the impact of merging in the prepossessing step of NeuroSketch. Recall that we set the tree height to 4, so that the partitioning step creates 16 partitions that are merged usi Dataset Normalized AQC STD % Improved (Merging) % Improved (No Merging) VS 1.02 47.6 44.1 PM 0.30 22.8 18.6 TPC1 0.17 23.5 6.7 G5 0.41 12.0 13.2 G10 0.10 6.8 6.8 G20 0.07 14.6 14.6 Correlation with STD 0.87 0.94 Table 4.3: Improvement of partitioning over no partitioning AQC, after which 8 partitions remain. We compare this approach with two alternatives. (1) We perform no partitioning and train a single neural network to answer any query. (2) We set the tree height to 3 so that we obtain 8 partitions without performing any merging. Table 4.3 shows the result of this comparison. It shows that performing partitioning, either with merging or without merging is better than no partitioning across all datasets. Second, for almost all datasets, merging provides better or equal performance compared with no merging. Thus, in practice, using AQC as an estimate for function complexity to merge nodes is beneficial. In fact, we observed a correlation coefficient of 0.61 between AQC and the error of trained models, which quantifies the benefits of using AQC as an estimate for function complexity. It also implies that AQC can be used to decide whether a query function is too difficult to approximate. For instance, in a database system, the query optimizer may build NeuroSketches for query functions with smaller AQC, and use a default query processing engine to answer query functions with larger AQC. Furthermore, Table 4.3 shows that the benefit of partitioning is dataset dependent. We observed a strong correlation between the standard deviation of AQC estimates across leaf nodes of the kd-tree and the improvement gain from partitioning. Specifically, Let R = {AQCN , ∀ leaf N}, 60 Figure 4.15: 2D data subsets 29.6 29.73 29.8 Latitude 4 6 8 Visit Duration (h) (a) VS (2D) 0 25 Temperature (C) 50 75 100 125 PM2.5 (b) PM (2D) 0 10000 Ext. sales price ($) 0 2k 5k Net Profit ($) (c) TCP (2D) Ground-Truth NeuroSketch Figure 4.16: Learned and True Query Functions on 2D Datasets as calculated in line 3 of Alg. 3. We calculate STD(R) AVG(R) as the normalized AQC STD for each dataset. This measurement is reported in the second column of Table 4.3. The last row of the table shows the correlation of the improvement for the partitioning methods with this measure. The large correlation suggests that when the difference in the complexity of approximation for different parts of the space is large, partitioning is more beneficial. This matches our intuition for using partitioning, where our intention is to allow specialized models to focus on the complex parts of the query space. It shows that partitioning is beneficial if there are parts of the space that are more complex than others. 4.5.6 NeuroSketch Preprocessing Time Analysis Training Set Generation. Fig. 4.13 (a) shows the time it takes to generate the training set of 100,000 queries is at most 60 seconds, with most datasets taking only a few seconds. The reported results are obtained by answering the queries in parallel on GPU. The queries are answered by 61 scanning all the database records per query and with no indexing. We expect faster training set generation by building indexes. Achitecture Search. Fig. 4.13 (b) shows the time to perform architecture search for each dataset. We use Optuna [92], a tool that uses baysian optimization to perform hyperparameter search. We use the query time and space requirement (to solve Problem 1), to limit maximum number of neural network parameters. Then, we use Optuna to find the width and depth of the neural network that minimizes error. We run Optuna for a total of one hour and set model size limit to be equal to the nueral network size in our default setting. For a point in time, t we report the ratio of error of the best model found by Optuna upto time t divided by error of our default model architecture. This ratio over time is plotted in Fig. 4.13 (b). The figure shows that Optuna find a model that provides accruacy within 10% of our default architecture in around 20 minutes. It also finds a better architecture for VS dataset than our default, showing that NeruoSketch accuracy can be improved by performing dataset specific parameter optimization. Optuna trains models in parallel (multiple models fit in a single GPU), and also stops training early if a setting is not promissing, so that more than 300 parameter settings are evaluated in the presented one hour for each dataset. Training Time. Fig. 4.13 (c) shows the accuracy of neural networks during training. Models converge within 5 minutes of training across datasets, and error fluctuates when training for longer. Models with larger width converge faster. 4.5.7 Confirming DQD Bound with NeuroSketch Model Size and DQD. We revisit Fig 4.10 in the context of our DQD bound. First, unsurprisingly, we observe that the overall trend of improved accuracy for larger models matches DQD. More 62 interestingly, we further observe that Fig 4.10 shows increase in data size increases accuracy, but only up to a certain point, after which increasing model size has little impact. This also matches DQD, where, in Theorem 1, increasing size which reduces ε1 only reduces total error (i.e., ε1 +ε2) up to when ε1 = 0. After ε1 = 0, error cannot be reduced further by increasing number of parameters. As discussed in Sec. 4.3.1.2, this is because fD, unlike fχ, may be a discontinuous function, so error of a neural network is not guaranteed to ever go to zero (i.e. Theorem. 2 doesn’t apply to fD). Data Size, LDQ and DQD. We corroborate the observations made in the DQD bound with NeuroSketch using synthetic datasets, so that we can calculate the corresponding LDQs. We sample n points from uniform, Gaussian and two-component GMM distributions (see Sec. 4.3.1.3 on how to calculate their LDQs) and answer RAQs with COUNT aggregation function on the sampled datasets, varying the value of n. We train NeuroSketch with partitioning disabled to isolate the neural network ability to answer queries. Fig. 4.14 shows the result of this experiment. In Fig. 4.14 (a), we fix the neural network architecture so that query time and space complexity is fixed (we use one hidden layer with 80 units) and train NeuroSketch for different data sizes and distributions. We observe that, as DQD bound suggests, the error decreases for larger data sizes. Furthermore, uniform distribution, which has a smaller LDQ, achieves the lowest error, then Gaussian whose LDQ is larger and finally GMM which has the largest LDQ. Fig. 4.14 (b) shows similar observations, but with accuracy fixed to 0.01 and space and time complexity allowed to change. Specifically, we perform a grid search on model width, where we train NeuroSketch for different model widths and find the smallest model width where the error is at most 0.01. We report query time of the model found with our grid search in Figs. 4.14 (b). As DQD bound suggests, the query time and space consumption 63 decrease when data size increases. Moreover, the same observations hold for storage cost, where we haven’t plotted the results as they look identical to that of Figs. 4.14 (b) (both storage cost and query time are a constant multiple of the number of parameters of the neural network, so both storage cost and query time are constant multiples of each other). Interestingly, for small data sizes, the difficulty of answering queries across distributions does not follow their LDQ order, where uniform distribution is harder when n = 100 compared with a Gaussian distribution. When data size is small, a neural network has to memorize the location of all the data points, which can be more difficult with uniform distribution as the observed points may not follow any recognizable pattern. Nonetheless, as data size increases, as suggested by DQD bound, the error, query time and space complexity improve, and the difficulty of answering queries from different distributions depends on the LDQ. DQD and Real/Benchmark Distributions. To further investigate impact of data distribution on accuracy, we visualize 2D subsets of PM, VS and TPC1. We perform RAQs that ask for AVG of the measure attribute where predicate column falls between c and c+r, where r is fixed to 10% of column range and c is the query variable (and input to the query function). Fig. 4.15 plots the datasets. Fig. 4.16 shows the corresponding true query functions and the function learned by NeuroSkech (without partitioning). Sharp changes in the VS dataset caues difficulties for NeuroSketch, leading to inaccuracies around such sharp changes. This is reflected in both AQC and MAE values shown in Table 4.4 (Norm. AQC is AQC of the functions after they are scaled to [0, 1] to allow for comparions across datasets), where PM and TPC which have less such changes have smaller AQC and MAE. We use Fig. 4.16 (a) to illustrate why abrupt changes (i.e., large LDQ) make function approximation difficult. Observe in Fig. 4.16 (a) such an abrupt change in query function where lat. is 64 Dataset VS (2D) PM (2D) TPC (2D) Norm. MAE 0.035 0.014 0.0029 Norm. AQC 1.28 0.95 0.77 Table 4.4: DQD Bound on 2D Real/Benchmark Datasets between 29.73 and 29.8 (the begning and end of the linear piece are marked in the figure with vertical lines). We see that a single linear piece is assigned to approximate the function in that range (recall that ReLU neural networks are piece-wise linear functions). Such a linear piece has high error, as it cannot capture the (non-linear) change in the function. The error resuling from this approximation grows as the magnitude of the abrupt change in the true function increases. Alternatively, more linear pieces are needed to model the change in the function, which results in a larger neural network. 4.6 Related Work Answering RAQs. The methods for answering RAQs can be divided into sampling-based methods [48, 5, 22, 99] and model-based methods [27, 118, 75, 129, 51, 156, 7]. Sampling-based methods use different sampling strategies (e.g., uniform sampling, [48], stratified sampling [22, 99]) and answer the queries based on the samples. Model-based methods develop a model of the data that is used to answer queries. The models can be of the form of histograms, wavelets, data sketches (see [27] for a survey) or learning based regression and density based models [75, 129, 51]. These works create a model of the data and use the data models to answer queries. In the case of learned models, a model is created that learns the data, in contrast with NeuroSketch that predicts the query answer. That is, regression and density based models of [75], generative model of [129] and the sum-product network of [51] are models of the data created 65 independent of potential queries. We experimentally showed that our modeling choice allows for orders of magnitude performance improvement. Secondly, data models can answer specific queries, (e.g. [75] answers only COUNT, SUM, AVG, VARIANCE, STDDEV and PERCENTILE aggregations) while, our framework can be applied to any aggregation function. Finally, our theoretical analysis for using a learned model is novel, in that it studies why and when a neural network can perform well. Such a study is missing across all existing learning based methods. Furthermore, learned cardinality estimation [61, 142, 55, 149, 148] is related to our work, in that it answers COUNT queries. However, we consider general aggregation functions and such methods do not apply (we also observed that modifying a representative of such approaches, [148], to answer RAQs performed worse than DeepDB in practice). [61] uses neural networks for cardinality estimation and thus our theoretical results are applicable to justify their success. Furthermore, [55] theoretically studies training size needed to learn selectivity function, which is orthogonal to our work. Neural Network Approximation. To approximate a function f with a neural network, similar to Theorem 2 but under different settings, existing work [102, 16, 71, 123, 150, 152, 151, 56, 124] characterize neural network size, s, in terms of its error, ε, in the form s = C1ε −dC2 , where C1 and C2 depend on properties of f. The works differ in their notions of size and assumptions on f, leading to different C1 and C2 values. Closest to our setting, [56, 123, 124, 102] bound approximation error for Lipschitz functions for a given number of neural network parameters, but don’t consider the storage cost. Storage cost cannot be related to the number of parameters if the magnitude of the parameters are unbounded, as is the case in [56, 123, 124]. [102] also does not explicitly bound the storage cost, but analyzing their construction yields a bound that, compared to our result, is exponentially worse in ρ and polynomially worse in d. 66 4.7 Conclusion We presented the first DQD bound for an ML method when answering RAQs. Our DQD bound shows how the error of a neural network relates to the data distribution, data size and the query function. Based on our DQD bound, we introduced NeuroSketch, a neural network framework for efficiently answering RAQs, with orders of magnitude improvement in query time over the stateof-the-art algorithms. A NeuroSketch trained for a query function is typically much smaller than the data and answers RAQs without accessing the data. This is beneficial for efficient release and storage of data. For instance, location data aggregators (e.g., SafeGraph [116]) can train a NeuroSketch to answer the average visit duration query, and release it to interested parties instead of the dataset. This improves storage, transmission and query processing costs for all parties. Future work can focus on DQD bounds for high dimensions and studying approximation error for separate function classes. Our Lipschitz assumption is very generic (only assumes a bound on the function derivative magnitude), and can yield a loose bound in high dimensions or for some functions classes (e.g., linear functions that can have large derivative magnitude but are easy to approximate). Additionally, modeling impact of query workload on neural network accuracy, as well as studying parallelism and model pruning methods [15] to remove unimportant model weights for faster evaluation time. Support for dynamic data is another interesting future direction. One approach is to frequently test NeuroSketch, and re-train the neural networks whose accuracy fall below a certain threshold. We conjecture that DQD can be used to decide how often retraining is required. 67 4.8 Appendix 4.8.1 Proofs 4.8.1.1 Proof of Theorem 2 To bound the approximation error, we first establish that the memorization is correct at the vertices of all the cells. Then, we ensure that the change in the neural network is bounded within the cell. Since the neural network is exactly accurate at vertices of the cells and it doesn’t change too much within each cell, its error within each cell is bounded. We present a sequence of lemmas to formally establish this argument, the proofs of which are deferred Sec. 4.8.1.2. First we establish the correct memorization property. Lemma 2 (Memorization). For all, p ∈ P, |f(p) − ˆf(p)| = 0. Next, we bound neural network change in the following Lemma. For the purpose of the lemma, define C ∗ i = {x ∈ R d : xr ∈ [ π i r t , π i r t + 1 t ]∀1 ≤ r ≤ d}, which is the subset of the input space that falls in the i-th cell. Also define Ci = {x ∈ R d : xr ∈ [ π i r t , π i r t + ( 1 t − 1 M t)]∀1 ≤ r ≤ d}, which is a subset of C ∗ i and let C ′ i = C ∗ i \ Ci . The lemma divides each cell into two regions, Ci and C ′ i and bounds the neural network change in each region. When d ≤ 3, we are able to prove a tighter bound on the neural network change, which helps prove the tighter bound in low dimensions of Theorem 2. Lemma 3 (Bounded Change). For any, i ∈ {0, ...,(t + 1)d − 1}, (a) For all x ∈ Ci , we have ˆf(x) = f( πi t ). (b) For all x ∈ Ci , x ′ ∈ C ′ i , and for all x ∈ C ′ i , x ′ ∈ C ′ i , we have | ˆf(x) − ˆf(x ′ )| ≤ kd3ρ2 d−1 t 68 (c) If d ≤ 3, for all x, x ′ ∈ C ∗ i , we have | ˆf(x) − ˆf(x ′ )| ≤ 36 ρd t Using the above lemma, together with the ρ-Lipschitz property of f and triangle inequality, integrating over x to obtain 1-norm, or taking the ∞-norm for d ≤ 3 gives the following bound on neural network error. Lemma 4 (Bounded Error). The neural network error is bounded as follows. (a) ∥ ˆf − f∥1 ≤ 3ρd t (b) If d ≤ 3, ∥ ˆf − f∥∞ ≤ 37ρd t Furthermore, we bound the space and time complexity of the neural network as follows. Lemma 5 (Space and Time Complexity). Number of bits needed to store the neural network parameters is O(kd log ρ + d log d + log k) = O˜(kd) and a neural network forward pass requires O(kd) operations. Theorem 2 follows by setting ε1 = κρd t , for κ = 37, and recalling that k = (t + 1)d so that k = (κρdε−1 1 + 1)d . Thus, Error is bounded by ε1, and space and time complexity are O˜(d(κρdε−1 1 + 1)d ). 4.8.1.2 Proof of Technical Lemmas for Theorem 2 Proof of Memorization Lemma 2 Intuitively, the memorization property follows based on the construction of g-units, as shown in Fig. 4.17. As the figure shows, g-units are non-zero only for a quadrant of the space, the location of which can be controlled with g-unit parameters. This ensures that when the construction iteratively memorizes new points, the neural network will not forget the value of the previously memorized points. The proof formalizes this idea. 69 The following proposition first establishes some properties of the construction. Proposition 1. Based on the construction in Alg. 1, the following properties hold. (a) For any i, j ∈ {0, ...,(t + 1)d − 1}, we have gˆj ( π i t ) = aj t if ∀r, πj r ≤ π i r 0 otherwise (b) At the i-th iteration of Alg. 1, we have b + Pi j=1 gˆi(π i/t) = f(π i/t). Proof of Prop. 1. To prove part (a) First, note that based on the construction, a g-unit can be written as gˆj (x) = ajσ( X d r=1 −Mσ(−xr + π j r t ) + 1 t ) (4.9) Assume for some r, we have π j r t − π i r t > 0, so σ(− π i r t + π j r t ) = − π i r t + π j r t . Together with Eq. 4.9 we get gˆj (x) = ajσ(M( π i r t − π j r t ) + 1 t + X d r ′=1,r′̸=r −Mσ(− π i r ′ t + π j r ′ t )) π i r and π j r are integers so we have π i r ≤ π j r − 1 and recall that M ≥ 1. Thus, M( π i r t − π j r t ) + 1 t ≤ 1(π j r−1 t − π j r t ) + 1 t = 0. Given that Pd r ′=1,r′̸=r −Mσ(− π i r′ t + π j r′ t ) ≤ 0, we have Pd r ′=1 −Mσ(− π i r′ t + π j r′ t ) + 1 t ≤ 0 and thus gˆj ( πi t ) = 0. If ∀r, π j r ≤ π i r , then σ(− π i r t + π j r t ) = 0 for all r, So gˆj ( πi t ) = aj t 70 Figure 4.17: Function surface of gˆi(x) for a 2-dimensional x To prove part (b), by line 6 of the algorithm, ai t = f( π i t ) − (b + X i−1 j=1 gˆj ( π i t )). The result follows using part (a). Next, to prove Lemma 2, by Prop.1 (b), for any p ∈ P, where p = πi t for some i, at the i-th iteration of Alg. 1, we ensure that Pi j=1 gˆi(p) +b = f(p). For the j-iteration, j > i, we have that π j r > πi r for some r. Thus, by Prop. 1 (a), gˆj (p) = 0. So, ˆf(p) = Pk j=1 gˆj (p) + b = f(p). Proof of Bounded Change Lemma 3 First, we prove the following Lemma that bounds the magnitude of the weights of the neural network. Lemma 6. For any i, |ai | ≤ 2 d−1dρ. Proof. For convinience, define gˆ0(x) = b Then, by line 6 of Alg. 1, we have that |ai | = t|f( π i t ) − X i−1 j=0 gˆj ( π i t )| (4.10) 71 Consider Pi−1 j=0 gˆj ( pii t ). By Prop. 1 (a), we have that Pi−1 j=0 gˆj ( πi t ) = P j∈Ii aj t , where Ii = {j ∈ Z : 0 ≤ π j r ≤ π i r} \ {i}. Define I r i = {j ∈ Z : 0 ≤ π j r ′ ≤ π i r ′ ∀r ′ ̸= r, 0 ≤ π j r ≤ π i r − 1}. Clearly, ∪ d r=1I r i = Ii . Thus, we use the inclusion-exclusion principle to rewrite P j∈Ii ai t , making sure each term in the sum is present exactly once. We have X j∈Ii aj t = X ∅̸=S⊆{1,...,d} (−1)|S|+1 X j∈∩r∈SI r i aj t For any S consider the index jS such that π jS r = π i r if r ̸∈ S and π jS r = π i r − 1 otherwise. Observe that ∩r∈SI r i = {j ∈ Z : 0 ≤ π j r ′ ≤ π i r ′, 0 ≤ π j r ≤ π i r − 1, ∀r ∈ S, r′ ̸∈ S} = IjS so that P j∈∩r∈SI r i aj t = P j∈IjS aj t . Then, by Prop. 1 (a) and Lemma 2, P j∈IjS aj t = ˆf(π jS /t) = f(π jS /t). Putting this in Eq. 4.10, we get |ai | =t|f( π i t ) − X ∅̸=S⊆{1,...,d} (−1)|S|+1f(π jS /t)| =t| X S⊆{1,...,d} (−1)|S| f(π jS /t)| ≤ 2 d−1 dρ Where the last inequality follows from the ρ-Lipschitz property of f and that every two points π jS and π jS′ for S, S′ ⊆ {1, ..., d} are at most d t apart and that there are 2 d−1 positive and negative terms in the summation. Next, we provide the following lemma to bound the change in a piece-wise linear function. 72 Lemma 7. For a piece-wise linear function ˆf where the magnitude of the gradient of each piece is bounded by B, and for two points x ′ and x ∗ in the domain of the function, we have | ˆf(x ′ )− ˆf(x ∗ )| ≤ B∥x ∗ − x ′∥. Let h(α) = ˆf(αx ′ + (1 − α)x ∗ ), for α ∈ [0, 1], so we are interested in |h(0) − h(1)|. Let α1, ..., αl be the points of non-linearity of ˆf on the line {x ∈ R d : x = αx ′ − (1 − α)x ∗ , 0 ≤ α ≤ 1} (i.e., where ∇ ˆf does not exist). So h(αi) − h(αi+1) = ˆf(αix ∗ + (1 − αi)x ′ ) − ˆf(αi+1x ∗ + (1 − αi+1)x ′ ) = mi (αix ∗ + (1 − αi)x ′ − (αi+1x ∗ + (1 − αi+1)x ′ )) for a vector mi which is the gradient of the i-th linear piece of ˆf. Letting α0 = 0 and αl+1 = 1, we have |h(0) − h(1)| = | Pl i=0 h(αi) − h(αi+1)| ≤ Pl i=0 |h(αi) − h(αi+1)|. Since ∥mi∥ ≤ B for all i we have X l i=0 |h(αi) − h(αi+1)| ≤ X l i=0 ∥mi ∥∥αix ∗ + (1 − αi)x ′ − (αi+1x ∗ + (1 − αi+1)x ′ )∥ ≤B X l i=0 ∥αix ∗ + (1 − αi)x ′ − (αi+1x ∗ + (1 − αi+1)x ′ )∥ =B X l i=0 (αi+1 − αi)∥x ∗ − x ′ ∥ =B∥x ′ − x ∗ ∥. 73 Finally, we are ready to prove Lemma 3. Proof of Part (a). For any i, we study the behaviour of gˆj (x) for x ∈ Ci and all j. Note that xr, the r-th dimension of x can be written as xr = π i r t + zr, where 0 ≤ zr ≤ 1 t − 1 M t for all r. If ∃r where xr < π j r t , then π i r t < π j r t so that xr− π j r t ≤ π j r−1 t + 1 t − 1 M t − π j r t = − 1 M t. So M(xr− π j r t )+ 1 t ≤ 0. Given that Pd r ′=1,r′̸=r −Mσ(−xr ′ + π j r′ t ) ≤ 0, we have Pd r ′=1 −Mσ(−xr ′ + π j r′ t )+ 1 t ≤ 0 and thus gˆj (x) = 0. If ∀r, xr ≥ π j r t , then σ(−xr + π j r t ) = 0 for all r, So gˆi(x) = ai t . Thus, for all j, gˆj (x) is constant for x ∈ Ci , which implies the neural network is constant. Given that ˆf( πi t ) = f( πi t ), and that πi t ∈ Ci , we have ˆf(x) = f( πi t ) for all x ∈ Ci . Proof of Part (b). Note that ˆf is a piece-wise linear function. If x ∈ C ′ i , let x ∗ = x. Otherwise, let x ∗ be the closest point in Ci to x ′ . Since ˆf is constant in Ci , | ˆf(x) − ˆf(x ′ )| = | ˆf(x ∗ ) − ˆf(x ′ )|, so we only need to prove the result for x ∗ . Note that ∥x ∗ − x ′∥ ≤ d M t. Using Lemma 7, we have that | ˆf(x) − ˆf(x ′ )| ≤ B d M t . (4.11) It remains to find B, the bound on the magnitude of the gradient of ˆf for all linear pieces. Note that the derivative in every direction is bounded by Pk j=1 M|ai |, so that the gradient norm is at most B ≤ d X k j=1 M|ai |. (4.12) Combining Eq. 4.11 with Eq. 4.12 and Lemma 6 we obtain 74 | ˆf(x) − ˆf(x ′ )| ≤ d X k j=1 M|ai | d M t ≤ 2 d−1 d 3 ρ k t which completes the proof. Proof of part (c). For ease of discussion, we first provide this elementary lemma used to bound the derivative of a linear function. Lemma 8. Consider a linear function f : R d → R, and two points x, x ′ ∈ R d where x ′ = x + hx i for h > 0 and x i ∈ R d , where x i j = 1 for i = j and x i j = 0. The derivative of f in the direction of x i , is |f(x)−f(x ′ )| h . Proof. Follows trivially from definition of derivative and that f is linear. To prove part (c), we set M = 1. We say f(x) is linear at x in the direction of u if there exists an ε > 0 and constants a and b such that for all 0 < ε′ < ε, f(x + ε ′u) = ax + b. As before, we say f(x) is linear at x if it is linear for all directions at x and that x is a point of non-linearity if f(x) is not linear at x. Observe that non-linearities happen only when the input to a ReLU unit is zero. We first study the non-linearities created by the i-th g-unit, for any i ∈ {0, ..., td − 1}. The first layer ReLU units create non-linearities when xr = π i r t . The second layer ReLU units create non-linearities where Pd r=1 −σ(−xr + π i r t ) + 1 t = 0. Thus, the non-linearities are where x ∈ {x : P r∈S xr − π i r t + 1 t = 0, ∅ ̸= S ⊆ {1, ...d}}. Hence, the set of all non-linearirites of the neural network is {x : P r∈S xr = 1 t ( P r∈S π i r − 1), ∅ ̸= S ⊆ {1, ...d}, 0 ≤ i ≤ t d − 1} 75 Consider a cell created by first-layer non-linearities with it’s maximum corner at πi t . For any S and i-th and j-th g-unit for j ̸= i, observe that hyperplanesP r∈S xr = 1 t ( P r∈S π i r− 1) and P r∈S xr = 1 t ( P r∈S π j r − 1), j ̸= i, are parallel. Specifically, they either define the same hyperplane if π j r = π i r , ∀r ∈ S and, otherwise, they are at least 1 t apart. Thus, consider the uniform partitioning of the space into cells with width 1 t , done in the construction of the neural network. We see that hyperplanes passing through the i-th cell are {x : P r∈S xr − π i r t + 1 t = 0, ∅ ̸= S ⊆ {1, ...d}}. These points of non-linearity partition each cell into linear pieces. We consider the i-th cell, and bound the error for each linear piece. Furthermore, if |S| = 1, points of non-linearity overlap borders of the cell which are the same points of non-linearity of the first layer ReLU units. So we consider {x : P r∈S xr − π i r t + 1 t = 0, ∅S ⊆ {1, ...d}, |S| ≥ 2} The case of d = 2. There is only one non-lineararity hyperplane, P2 r=1 xr − π i r t + 1 t = 0, in each cell. Thus, each cell consists of two linear pieces, L1 = {x : P2 r=1 xr − π i r t + 1 t ≥ 0} and L2 = {x : P2 r=1 xr − π i r t + 1 t ≤ 0}. For a set I of integers, we define π i,I as the vector such that for r ∈ I, π i,I r = π i r − 1 and π i,I r = π i r otherwise. Thus, the set C = { πi,I t , I ⊆ {1, 2}}, is the set of all the four corners of the i-th cell. By memorization property of the construction, ∀p ∈ C, f(p) = ˆf(p). We define I1 = {1}, I2 = {2} and I3 = {1, 2}. Since ˆf is linear in Li , we can write ˆf(x) = mix + bi for x ∈ Li . To bound ∥m1∥, we apply Lemma 8 to points πi t and πi,I1 t and again to πi t and πi,I2 t . Note that since memorization is exact at these points, the change in ˆf is at most ρ t for every pair of points, and each pair are 1 t apart. So ∥m1∥ ≤ 2ρ. Replacing π i with π i,I3 and repeating the same argument, we also bound ∥m2∥ 76 Figure 4.18: Non-linarities in 3 dimensions. Figure shows the input space portioned by hyperplanes corresponding to points of non-linearity. by 2ρ. This bounds the gradient for each linear piece. Thus, using Lemma 7 with B = 2ρ, and observing that every pair of points in a cell are at most 2 t apart, we obtain |( ˆf(x) − ˆf(x ′ ))| ≤ 4ρ t which proves the lemma in this case. The case of d = 3. The argument is similar to d = 2, but now there are four hyperplanes partitioning a cell, three corresponding to |S| = 2 and one for |S| = 3. Let S1 = {2, 3}, S2 = {1, 3}, S3 = {1, 2} and S∗ = {1, 2, 3}. For any such set S, we define HS = {x : P r∈S xr − π i r t + 1 t = 0}, H+ S as the set of points above HS and H− S as the set of points below HS. Together, with the cell boundaries, the cell is partitioned into polytopes, where within each polytope ˆf is a linear function. We bound the error of each of the linear pieces one by one. Note that the polytopes can be defined by which side of each hyperplane they fall on. Fig. 4.18 shows how the i-th cell is partitioned based on the hyperplanes discussed above. 77 (1) The linear piece below all the hyperplanes, ∩SH− S , contains the points πi,I t when I ⊆ {1, ..., 3} and |I| ≥ 2. Thus, applying Lemma 7 three times we bound the derivative in each direction by ρ. (2) For any j, 1 ≤ j ≤ 3, let S = {S∗} ∪ {Sz : 1 ≤ z ≤ 3, z ̸= j} and consider the polytope where C = (∩S∈SH− S ) ∩ H+ Sj . Note that the derivative of function in this linear piece w.r.t., xj is the same as case (1), because HSj does not depend on xj . To bound the derivative w.r.t xz for z ̸= j, without loss of generality assume j = 1 and observe that π i,I t ∈ C for I = {1, 3}, {1, 2} and {1}. So applying Lemma 7 twice bounds the derivative w.r.t x2 and x3 by ρ. (3) For any j, 1 ≤ j ≤ 3, let S = {Sz : 1 ≤ z ≤ 3, z ̸= j} and consider the polytope where C = (∩S∈{S∗,Sj}H− S ) ∩ (∩S∈SH+ S ). Without loss of generality, assume j = 1. Now, derivative w.r.t x2 is the same as when (∩S∈{S∗,S1,S2}H− S ) ∩ (H+ S2 ) and and derivative w.r.t x3 is the same as when C3 = (∩S∈{S∗,S1,S3}H− S ) ∩ (H+ S2 ). Hence, we only need to bound derivative w.r.t x1. Consider some points p on the hyperplane HS3 and take the derivative in the direction of u = (1/ √ 2, −1/ √ 2, 0), written as Dp u . Note that Dp u is defined because ˆf is a linear at p in the direction of u, since p+εu ∈ HS for small enough positive ε. Furthermore, since p ∈ C and p ∈ C3 and that both C and C3 are linear pieces, for any point p ′ ∈ C ∪ C3, Dp ′ u = Dp u . This shows that the directional derivative in the direction of u is the same for all points in both C and C3. Thus, bounding Dp u with gradient of ˆf in C3, we get Dp u ≤ 3ρ. At the same time, for points p ′ in C we can write |Dp ′ u | = |∇p′ ·u| ≤ √ 1 2 |∂x1 −∂x2 |. Therefore, √ 1 2 ||∂x1 | − |∂x2 || ≤ |∂x1 − ∂x2 | ≤ 3ρ. So that |∂x1 | ≤ 3ρ + |∂x2 |. Given that |∂x2 | ≤ ρ, we have that derivative w.r.t x1 is at most 4ρ. 78 (4) Let S = {1, 2, 3} and consider the polytope (∩S∈SH+ S )∩H− S∗ . The derivative w.r.t x1 is the same as when (∩S∈{S∗,S1}H− S ) ∩ (∩S∈{S2,S3}H+ S ), and the derivative w.r.t x2 and x3 can similarly be calculated based on previously bounded derivatives. (5) Finally, note that when x ∈ H+ S∗ , we have that x ∈ H+ Si for 1 ≤ i ≤ 3 and thus all cases are considered. In this final case, the polytope contains the points πi,I t when I ⊆ {1, ..., 3} and |I| ≤ 2, so the gradient is bounded by 3ρ by applying Lemma 8 three times. Putting all cases together, the magnitude of the gradient is at most 12ρ. Thus, using Lemma 7 with B = 12ρ and observing that every pair of points in a cell are at most 3 t apart, we obtain |( ˆf(x) − ˆf(x ′ ))| ≤ 36ρ t which proves the lemma in this case. Proof of Bounded Error Lemma 4 Lemma 4 (b) directly follows from Lemma 3 (c). For x ∈ C ∗ i for any i, | ˆf(x) − f(x)| = | ˆf(x) − ˆf( π i t ) − (f(x) − ˆf( π i t ))| ≤ | ˆf(x) − ˆf( π i t )| + |f(x) − ˆf( π i t ))| = | ˆf(x) − ˆf( π i t )| + |f(x) − f( π i t ))| ≤ 36 ρd t + ρd t 79 Next, we prove Lemma 4 (a). By Lemma. 3 (a) and (b), ˆf is either constant or non-constant. We bound the error separately for each part of the space. In the constant region, that is x ∈ Ci for any i, by Lemma. 3 (a), ˆf(x) = f( πi t ), so | ˆf(x) − f(x)| = |f( π i t ) − f(x)| ≤ ρ∥x − π i t ∥ ≤ ρd t Next, consider an x ∈ C ′ i for any i and let x ∗ be the closest point in Ci to x. We have | ˆf(x) − f(x)| = | ˆf(x) − ˆf(x ∗ ) − (f(x) − ˆf(x ∗ ))| ≤ | ˆf(x) − ˆf(x ∗ )| + |f(x) − ˆf(x ∗ )| First, consider |f(x) − ˆf(x ∗ )|. We have |f(x) − ˆf(x ∗ )| = |f(x) − f(x ∗ ) − ( ˆf(x ∗ ) − f(x ∗ ))| ≤ |f(x) − f(x ∗ )| + | ˆf(x ∗ ) − f(x ∗ )| ≤ 2ρd t 80 Moreover, | ˆf(x) − ˆf(x ∗ )| ≤ kd3ρ2 d−1 t by Lemma 3 (b), so that |f(x) − ˆf(x)| ≤ kd3ρ2 d−1 t + 2ρd t = ρd t (kd2 2 d−1 + 2) Thus, the 1-norm error is Z q |f(q) − ˆf(q)| = Z q∈∪iCi |f(q) − ˆf(q)| + Z q∈∪iC′ i |f(q) − ˆf(q)| ≤ Z q∈∪iCi ρd t + Z q∈∪iC′ i ρd t (kd2 2 d−1 + 2) = ρd t Z q∈∪iCi + ρd t (kd2 2 d−1 + 2) Z q∈∪iC′ i = ρd t (1 − 1 M ) d + ρd t (kd2 2 d−1 + 2)(1 − (1 − 1 M ) d ) = ρd t ((1 − 1 M ) d + (kd2 2 d−1 + 2)(1 − (1 − 1 M ) d )) Finally, we set M so that kd2 2 d−1 (1 − (1 − 1 M ) d ) = 1 and thus M = 1 1 − (1 − 1 kd22 d−1 ) 1 d Which yields R q |f(q) − ˆf(q)| ≤ ρd t ((1 − 1 M ) d + 1 + 2(1 − (1 − 1 M ) d )) ≤ 3ρd t . 81 Proof of Space/Time Complexity Lemma 5 Number of operations for a forward pass is proportional to the number of neural network parameters, which is O(kd). Next we study space complexity. Note that we only need to store ai , for 1 ≤ i ≤ k, b and M. Assuming a number C can be stored in O(log C) number of bits and using Lemma6 to bound the magnitude of ai , the total space consumption is k log(2d−1dρ) + log(M) + kd + log(f(0)) = O(kd(log ρ) + log M). To study log M, Note that kd22 d−1 ≤ (d 2 t) d for d ≥ 2. So we study 1 1 − (1 − ( 1 td2 ) d ) 1/d = 1 1 − ( (td2) d−1 (td2) d ) 1/d = 1 1 − ((td2) d−1)1/d td2 = td2 td2 − ((td2 ) d − 1)1/d So log M ≤ log(td2 ) + log( 1 td2 − ((td2 ) d − 1)1/d ) Next, for ease of notation we consider, 1 x−(xd−1)1/d for x = td2 . Assume d = 2s for an integer s (or otherwise increase d by a constant factor so that it can be written as a power of 2). By repeated multiplication of numerator and denominator we have 82 1 x − (x d − 1)1/d = x + (x d − 1)1/d (x + (x d − 1)1/d)(x − (x d − 1)1/d) = x + (x d − 1)1/d x 2 − (x d − 1)2/d = (x 2 + (x d − 1)2/d)(x + (x d − 1)1/d) (x 2 + (x d − 1)2/d)(x 2 − (x d − 1)2/d) = (x 2 + (x d − 1)2/d)(x + (x d − 1)1/d) x 4 − (x d − 1)4/d . . . = Π s−1 i=0 (x 2 i + (x d − 1)2 i/d) 1 Taking the log, we obtain log( 1 x − (x d − 1)1/d ) =Xs−1 i=0 log(x 2 i + (x d − 1)2 i/d) ≤ Xs−1 i=0 log(2x 2 i ) = Xs−1 i=0 (2i log x + log 2) = log x Xs−1 i=0 (2i ) + s log 2 ≤2 s log x + s log 2 = log(d) log(2) + log(d 2 t)d 83 So log M ≤ O(d log dt) = O(d log d + log k). Thus, the total size is O(kd log ρ + d log d + log k) = O˜(kd). 4.8.1.3 Proof of Theorem 3 Consider a query with COUNT aggregation function. Define the indicator function h as h C c,r (p) = 1 if ∀i, ci ≤ pi < ci + ri 0 otherwise. We can write fD(c, r) = P p∈D h C c,r (p), and that fχ(c, r) = nEp∼χ[hc,r(p)] Thus, to study the error 1 n |fD(c, r) − fχ(c, r)|, we consider sup c,r | 1 n X p∈D h C c,r (p) − Ep∼χ[h C c,r (p)]|. We define the class of function HC = {h C c,r , ∀c, r} and rewrite the above expression as sup h∈HC | 1 n X p∈D h(p) − Ep∼χ[h(p)]| (4.13) Now, we can bound the above error in terms of properties of HC . Observe that we can repeat the procedure for SUM aggregation function. Assume we would like to take the sum of the attribute at location ∗, and define 84 h S c,r (p) = p∗ if ∀i, ci ≤ pi < ci + ri 0 otherwise Observe that fD(c, r) = P p∈D h S c,r (p), and define HS = {h S c,r , ∀c, r}. Thus, we can similarly write the error for the SUM aggregation function as in Eq. 4.13 by replacing HC with HS . Note that HC and HS depend both on the aggregation function and the range predicates. Next, we present some definition and results from VC theory that allows us to provide the required bounds. Definition 1 (Pseudo-shattering [11]). Let I be a countable subset of [0, 1]d . I is said to be pseudoshattered by H if for some function g : I → R, for every J ⊆ I there exists hJ ∈ H such that hJ (x) ≤ g(x) for x ∈ J, hJ (x) > g(x) for x ∈ I \ J. Definition 2 (Pseudo-dimension [11]). The pseudo-dimension of H is defined as vc(H) = sup{|I| : I is pseudo-shattered by H}. Theorem 4 (VC-Theorem [11]). For a class of functions, H, where h : R d → [0, 1] for all h ∈ H, and a set D consisting of n i.i.d samples from a distribution χ, then P suph∈H| 1 n P p∈D h(p) − E p∼χ h(p)| ≥ ε ≤ 8ed ( 32e/ε) d e − ε 2n 32 Where d = vc(H). 85 We are interested in bounding Eq. 4.13, which can readily be done using the above VCTheorem, after finding vc(H). This is done in the following lemma. Lemma 9. For HS and HC defined as above, vc(HS ) ≤ 2d and vc(HC) ≤ 2d. Proof. We note that HC is the class of axis-parallel rectangle classifiers, whose VC-dimension is well-known to be 2d [121]. Our proof below uses a similar but slightly more general argument to account for both HC and HS . We show that no set of size 2d+ 1 can pseudo-shatter HS . Let I = {p 1 , ..., p 2d+1}. First, note that if p i ∗ = 0 for some i, the set cannot be pseudo-shattered. To see this, consider J2 = {p i} and J1 = I \ J2. For any h ∈ HS , h(p i ) = 0. Now, for some g, we need to have that h J1 (p i ) > g(p i ) and that h J2 (p i ) ≤ g(p i ). Implying h J1 (p i ) > hJ2 (p i ), which is a contradiction because h J1 (p i ) = h J2 (p i ) = 0. Define S = {p : ∃i, pi = minp ′∈I p ′ i or pi = maxp ′∈I p ′ i}. Note that 1 ≤ |S| ≤ 2d. For the purpose of contradiction, assume that there exists some g that satisfies the conditions of Def. 1. Specifically, that there exists some g such that conditions are satisfied for J1 = S and J2 = I \ S simultaneously. Note that by definition, h(p) is either zero or p∗ for h ∈ HS . Since, |S| ≤ 2d, |J2 ∩ I| ≥ 1 so let p ′ ∈ J2∩I. We have that h J1 (p ′ ) > g(p ′ ), and that h J2 (p ′ ) ≤ g(p ′ ), so that h J1 (p ′ ) > hJ2 (p ′ ). Since 0 < p′ ∗ ≤ 1 (and specifically p ′ ∗ is positive), the only solution to the inequality is h J1 (p ′ ) = p ′ ∗ and h J2 (p ′ ) = 0. A similar argument for all p ∈ J1 shows that h J1 (p) = 0 and h J2 (p) = p∗. Now since h J2 (p) = p∗ is true ∀p ∈ J1, it must be true that h J2 (p ′ ) = p ′ ∗ (this is because if a range predicate contains all the points in S it must contain all the points in I). However. this 86 contradicts h J2 (p ′ ) = 0, which completes the proof for vc(HS ). To bound vc(HC), repeat the same argument with p∗ = 1. Theorem 3 follows directly from the above lemma and the VC-theorem. 4.8.1.4 Proof of Lemma 1 Let f C D (q) = f C χ (q) + ε q c and f S D(q) = f S χ (q) + ε q s . Then, for any q, | ¯f A χ (q) − f A D (q)| = | f S χ (q) f C χ (q) − f S χ (q) + ε q s f C χ (q) + ε q c | = | ε q c f S χ (q) − ε q s f C χ (q) f C χ (q)(f C χ (q) + ε q c ) | ≤ | f S χ (q) f C χ (q) || ε q s f C χ (q) + ε q c | + | ε q s (f C χ (q) + ε q c ) | = | f S χ (q) f C χ (q) || ε q s f C D (q) | + | ε q s f C D (q) | For any ε, by Theorem 3 and union bound, P[supq |ε q c | ≥ ε or supq |ε q c | ≥ ε] ≤ 16ed ( 32e/ε) d e − ε 2n 32 . Define Qξ = {q, f C χ (q) ≥ ξ}. Note that the event A = {∀q ∈ Qξ, |ε q c | < ε and |ε q c | < ε} implies ∀q ∈ Qξ, f C D (q) > ξ − ε and thus the event {∀q ∈ Qξ, | ε q c f C D (q) | < ϵ ξ−ϵ and | ε q s f C D (q) | < ϵ ξ−ϵ }. Therefore, event A implies the event B = {∀q ∈ Qξ, | f S χ (q) f C χ (q) || ε q c f C D (q) | + | ε q s f C D (q) | < | f S χ (q) f C χ (q) | ε ξ−ε + ε ξ−ε }. So P[A] ≤ P[B]. Considering the complement of events A and B, we obtain P[∃q ∈ Qξ : | f S χ (q) f C χ (q) || ε q c f C D (q) | + | ε q s f C D (q) | ≥ | f S χ (q) f C χ (q) | ε ξ − ε + ε ξ − ε ] ≤16ed ( 32e/ε) d e − ε 2n 32 . 87 "% !#!" ! "$!" "% !#!" ! "$!" Figure 4.19: Construction vs. SGD. Therefore, P[ sup q∈Qξ | ¯f A χ (q) − f A D (q)| | ¯f A χ (q)| + 1 ≥ ε ξ − ε ] ≤ 16ed ( 32e/ε) d e − ε 2n 32 P[ sup q∈Qξ | ¯f A χ (q) − f A D (q)| | ¯f A χ (q)| + 1 ≥ ε] ≤ 16ed 32e 1 + ε ξε d e − (ξε) 2n (1+ε)232 . 4.8.2 Utilizing Construction in Practice We study the benefits of using the theoretical construct of Sec. 4.3.2.2 in practice. We consider two variations. First, referred to as CS, we use the construct exactly as in Sec. 4.3.2.2. Second, referred to as CS+SGD, we consider the construct as an initialization for the SGD algorithm. That is, we first construct the neural network and further optimize its parameters using the SGD algorithm. This replaces line 1 of Alg. 4 with calling Alg. 1 to initialize the parameters. Fig. 4.19 shows how the above two algorithms compare with training fully connected neural networks with different depths. Lines labeled FNN+SGD (x) refer to a randomly initialized fully connected neural network (FNN) of depth x trained with SGD. Number of parameters per model is fixed for each setting, so that as depth increases the width of the FNNs decreases. We consider 2 and 4 dimensional queries in this experiment. The 2-dimensional query asks a duration for the fixed range of 0.2. Thus, the query function only takes latitude and longitude as inputs, and outputs average visit duration. The 4-dimensional query is the usual query of average visit duration, where the query function takes minimum and maximum latitude and longitude as its 4 inputs, and outputs average visit duration. None of the algorithms use partitioning. Fig. 4.19 shows that for the 2-dimensional query, CS+SGD performs better than all other architectures, while CS’s accuracy is close to FNNs. However, for the 4-dimensional queries, CS is much worse than FNNs and although CS+SGD performs similar to FNNs, it is always outperformed by them. This shows that for low dimensional queries, CS can be useful in practice as an initialization for SGD. 89 Chapter 5 A Neural Database for Differentially Private Spatial Range Queries 5.1 Introduction Model Trust barrier publishing Mobile users Trusted-data aggregator Research & business use Location updates . . . ε-DP mechanism Data Augmentation . . . Stage 1: Data Stage 2: Training Collection Trained Neural Networks ParamSelect Public Datasets Figure 5.1: Spatial Neural Histogram System Mobile apps collect large amounts of individual location data used to optimize traffic, study disease spread, or improve point-of-interest placement. When using such data, preserving location privacy is essential, since even aggregate statistics can leak details about individual whereabouts. Existing solutions publish a noisy version of the dataset, transformed according to differential privacy (DP) [35], the de-facto standard for releasing statistical data. The goal of DP mechanisms is to ensure privacy while keeping the query answers as accurate as possible. For 90 spatial data, range queries are the most popular query type, used as building blocks in most processing tasks. A DP-compliant representation of a spatial dataset is created by partitioning the data domain into bins, and then publishing a histogram with the noisy count of points that fall within each bin. Domain partitioning is commonly adopted [67, 108, 167, 47, 145, 28], e.g., uniform and adaptive grids [108] or hierarchical partitioning [167, 28]. At query time, the noisy histogram is used to compute answers, by considering the counts in all bins that overlap the query. When a query partially overlaps with a bin, the uniformity assumption is used to estimate what fraction of the bin’s count should be added to the answer. Since DP mechanisms release only the (noisy) count for each bin, it is assumed that data points are distributed uniformly within the partition, hence the estimate is calculated as the product of the bin count and the ratio of the overlapping area to the total area of the bin. This is often a poor estimate, since location datasets tend to be highly skewed in space (e.g., a shopping mall in a suburb increases mobile user density in an otherwise sparse region). Thus, in addition to DP sanitization noise, uniformity error is a major cause of inaccuracy for existing work on DP release of spatial data. We propose a paradigm shift towards learned representations of data, which have been shown to accurately capture data distribution in non-private approximate query processing [75, 51, 162]. Such results show that learning exploits data patterns to accurately and compactly represent the data. As such, learning can be used to combat data modelling errors, also present in DP setting. Nonetheless, due to the impact of DP noise on the process of learning, creating learned differentially private data representations is non-trivial. 91 Recent attempts at creating learned DP data representations [166, 80] propose the use of learned models to answer queries in non-spatial domains (e.g., categorical data). While these approaches perform well in the case of categorical data, they cannot model the intrinsic properties of location datasets, which exhibit both high skewness, as well as strong correlation among regions with similar designations. For instance, two busy city areas (e.g., a stadium and a street bar area) will exhibit similar density patterns, while the regions in between may be sparse. These busy areas may also be correlated, since people are likely to congregate at bars after they see a game at the stadium. Models with strong representational power in the continuous domain are necessary to learn such patterns. Meanwhile, training complex models while preserving DP is difficult. For neural networks, existing techniques [1] utilize gradient perturbation to train differentially private models. However, the sensitivity of this process, defined as the influence a single input record may have on the output (see Section 5.2 for a formal definition), is high. DP-added noise is proportional to sensitivity, and as a result meaningful information encoded in the gradients is obliterated. The learning process has to be carefully crafted to the unique properties of spatial data, or accuracy will deteriorate. We propose Spatial Neural Histograms (SNH), a neural network system specifically designed to answer differentially private spatial range queries. SNH models range queries as a function approximation task, where we learn a function approximator that takes as input a spatial range and outputs the number of points that fall within that range. Training SNH consists of two stages (Figure 5.1): the first perturbs training query answers according to DP, while the second trains neural networks from noisy answers. The first stage is called data collection. It prepares a differentially private training set for our model while ensuring low sensitivity, such that the 92 signal-to-noise ratio is good. However, due to the privacy constraints imposed by DP, we can only collect a limited amount of training data. Thus, in the second stage, we synthesize more training samples based on the collected data to boost learning accuracy, in a step called data augmentation. Then, we employ a supervised learning training process with a carefully selected set of training samples comprising of spatial ranges and their answers. SNH learns from training queries at varying granularity and placement to capture subtle correlations present within the data. Finally, an extensive private parameter tuning process (ParamSelect) is performed using publicly available data, without the need to consume valuable privacy budget. The fully trained SNH can then be released publicly and only requires a single forward pass to answer a query, making it highly efficient at runtime. SNH is able to learn complex density variation patterns that are specific to spatial datasets, and reduces the negative impact of noise and uniformity assumption when answering range queries, significantly boosting accuracy. Use of machine learning when answering test queries (i.e., at runtime) is beneficial because, through learning, SNH combines evidence from multiple training queries over distinct regions. In fact, gradient computation during training can be seen as a novel means of aggregating information across the space. We show that neural networks can learn the underlying patterns in location data from imprecise observations (e.g., observations collected with noise and uniformity error), use those patterns to answer queries accurately and thereby mitigate noise and uniformity errors. In contrast, existing approaches are limited to using imprecise local information only (i.e., within a single bin). When the noise introduced by differential privacy or the error caused by the uniformity assumption are large for a particular bin, the answer to queries evaluated using that bin will be inaccurate. Contributions and organization. In this chapter, we 93 • Formulate the problem of answering spatial range count queries as a function approximation task (Sec. 5.2); • Propose a novel system that leverages neural networks to represent spatial datasets while accurately capturing location-specific density and correlation patterns (Sec. 5.3, 5.4); • Introduce a comprehensive framework for tuning system parameters on public data (Sec. 5.5); and • Conduct an extensive experimental evaluation on a broad array of public and private realworld location datasets with heterogeneous properties and show that SNH outperforms all the state-of-the-art solutions (Sec. 5.6). We survey related work in Section 5.7 and conclude in Section 5.8. 5.2 Preliminaries 5.2.1 Differential Privacy ε-differential privacy [35] provides a rigorous privacy framework with formal protection guarantees. Given privacy budget parameter ε ∈ (0, +∞), a randomized mechanism M satisfies ε-differential privacy iff for all datasets D and D′ , where D′ can be obtained from D by either adding or removing one tuple, and for all E ⊆ Range(M) Pr[M(D) ∈ E] ≤ e ε Pr[M(D ′ ) ∈ E] (5.1) 94 Pr[M(D) ∈ E] denotes the probability of mechanism M outputting an outcome in the set E for a database D and Range(M) is the co-domain of M. M hides the presence of an individual in the data, since the difference in probability of any set of outcomes obtained on two datasets differing in a single tuple never exceeds e ε . The protection provided by DP is stronger when ε approaches 0. The sensitivity of a function (e.g., a query) f, denoted by Zf , is the maximum amount the value of f can change when adding or removing a single individual’s records from the data. The ε-DP guarantee can be achieved by adding random noise derived from the Laplace distribution Lap(Zf /ε). For a query f : D → R, the Laplace mechanism M returns f(D) + Lap(Zf /ε), where Lap(Zf /ε) is a sample drawn from the probability density function Lap(x|(Zf /ε)) = (ε/2Zf )exp(−|x|ε/Zf ) [35]. The composability property of DP helps quantify the amount of privacy attained when multiple functions are evaluated on the data. Specifically, when mechanisms M1, M2 with privacy budgets ε1, ε2 are applied in succession on overlapping data partitions, the sequential composition property [35] states that the budget consumption is (ε1 +ε2). Conversely, when M1, M2 are applied on disjoint data partitions, the parallel composition property states that the resulting budget consumption is max(ε1, ε2). The post-processing property of differential privacy [35] states that given any arbitrary function h and an ε-DP mechanism M, the mechanism h(M) is ε-DP. Lastly, we note that DP is robust to side-channel information [35], that is, the privacy guarantee on the DP-release of D is irrespective of any publicly available information about the users in D. 95 Table 5.1: Summary of Notations Notation Definition ε DP Privacy Budget Q, QW Query distribution and workload query set QD, YD Data collection query set and its answers QA, YA Augmented query set and its answers R, k Set and number of query sizes for training l, u Lower and upper bound on query sizes f(q), ( ˆf(q; θ)) Count of records in q calculated from D (estimated from θ) ¯f(q) f(q) + Lap(1/ε) ρ, C Grid granularity, Set of bottom-left corners of grid cells ψ Smoothing factor in relative error Φ,ϕ ParamSelect Model, Dataset features D, DT , DI All public datasets, ParamSelect training and inference datasets πα(D, ε) Function denoting best value of system parameter α for dataset D and budget ε πˆα(D, ε) Empirical estimate of πα(D, ε) 5.2.2 Problem Definition Consider a database D that covers a spatial region SR ⊆ R 2 , and contains n records each describing an individual’s geo-coordinate. Given a privacy budget ε, the problem studied in this chapter is to return the answer to an unbounded number of spatial range count queries (RCQs). An RCQ consists of a spatial range predicate and its answer is the number of records in D that satisfy the range predicate. We consider spatial range queries that are axis-parallel and squareshaped, defined by their bottom-left corner c (where c is a vector in SR), and their side length r. An RCQ, q, is then defined by the pair q = (c, r). We say r is the query size and c is its location coordinate. For a database D, the answer to the RCQ q = (c, r) can be written as a function f(q) = |{p|p ∈ D, c[i] ≤ p[i] < c[i]+r, ∀i ∈ {0, 1}}|, where z[0] and z[1] denote the latitude and longitude of any coordinate z, respectively. We assume RCQs follow a distribution Q and for any RCQ q, we measure the utility of its estimated answer, y, using the relative error metric, defined as ∆(y, f(q)) = |y−f(q)| max{f(q),ψ} , where ψ is a smoothing factor necessary to avoid division by zero. 96 The typical way to solve the problem of answering an unbounded number of RCQs is to design an ε-DP mechanism M and a function ˆf such that (1) M takes as an input the database D and outputs a differentially private representation of the data, θ; and (2) the function ˆf(q; θ) takes the representation θ, together with any input query q, and outputs an estimate of f(q). In practice, M is used exactly once to generate the representation θ. Given such a representation, ˆf(q; θ) answers any RCQ, q, without further access to the database. For instance, in [108], M is a mechanism that outputs noisy counts of cells of a 2-dimensional grid overlaid on D. Then, to answer an RCQ q, ˆf(q; θ) takes the noisy grid, θ, and the RCQ, q, as inputs and returns an estimate of f(q) using the grid. The objective is to design M and ˆf such that the relative error between ˆf(q; θ) and f(q) is minimized, that is, to minimize Eθ∼MEq∼Q[∆( ˆf(q; θ), f(q))]. Let ˆf be a function approximator and define M to be a mechanism that learns its parameters. The learning objective of M is to find a θ such that ˆf(q; θ) closely mimics f(q) for different RCQs, q. The representation of the data, θ, is the set of learned parameters of a function approximator. Mechanism M outputs a representation θ, and any RCQ, q, is answered by evaluating the function ˆf(q; θ). However, M is now defined as a learning algorithm and ˆf as a function approximator. Our problem is formally defined as follows: Problem 2. Given a privacy budget ε, design a function approximator, ˆf, (let the set of possible parameters of ˆf be Θ) and a learning algorithm, M, such that M satisfies ε-DP and finds arg min θ∈Θ Eq∈Q[∆( ˆf(q; θ), f(q))] 97 5.3 Spatial Neural Histograms (SNH) Our goal is to utilize models that can learn patterns within the data in order to answer RCQs accurately. We employ neural networks as the function approximator ˆf, due to their ability to learn complex patterns effectively. Prior work [1] introduced a differentially private stochastic gradient descent (DP-SGD) approach to privately train a neural network. Thus, a seemingly straightforward solution to Problem 2 is using a simple fully connected neural network and learning its parameters with DP-SGD. Sec. 5.3.1 discusses this trivial approach and outlines the limitations of using DP-SGD in our setting, which leads to poor accuracy. Next, in Sec.5.3.2, we discuss how we improve the training process to achieve good accuracy. In Sec.5.3.3 we provide an overview of our proposed Spatial Neural Histogram (SNH) solution. Table 5.1 summarizes the notations. 5.3.1 Baseline Solution using DP-SGD Learning Setup. We define ˆf(.; θ) to be a fully connected neural network with parameter set θ. We train the neural network so that for an RCQ q, its output ˆf(q; θ) is similar to f(q). A training set, T, is created, consisting of (q, f(q)) pairs, where q is the input to the neural network and f(q) is the training label for the input q (we call RCQs in the training set training RCQs). To create the training set, similar to [67, 79], we assume we have access to a set of workload RCQs, QW , that resembles RCQs a query issuer would ask (e.g., are sampled from Q or a similar distribution) and is assumed to be public. Thus, we can define our training set T to be {(q, f(q))|q ∈ QW }. We define the training loss as L = X q∈QW ( ˆf(q; θ) − f(q))2 (5.2) 98 In a non-private setting, a model can be learned by directly optimizing Eq. equation 5.2 using a gradient descent approach. The model can answer any new RCQ q similar to the ground truth f(q). Incorporating Privacy. DP-SGD [1] incorporates differential privacy for training neural networks. It modifies SGD by clipping each sample gradient to have norm at most equal to a given clipping threshold, B, and obfuscating them with Gaussian noise. Intuitively, the clipping threshold, B, disallows learning more information than a set quantity from any given training sample (no matter how different it is from the rest) and the standard deviation of the Gaussian noise added is scaled with B to ensure obfuscation is proportional to the amount of information gained per sample. Specifically, in each iteration: (1) a subset, S, of the training set is sampled; (2) for each sample, s = (x, y) ∈ S, the gradient gs = ∇θ( ˆf(x; θ) − y) 2 is computed, and clipped (i.e., truncated) to a maximum ℓ2-norm of B as g¯s = min(∥gs∥2, B) gs ∥gs∥2 ; (3) the average clipped gradient value for samples in S is obfuscated with Gaussian noise as g = X s∈S (¯gs) + N(0, σ2B 2 ) (5.3) (4) the parameters are updated in the direction opposite to g. DP-SGD Challenges. In our problem setting, the training set is created by querying D to obtain the training labels, and our goal is to ensure the privacy of records in D. On the other hand, DP-SGD considers the training set itself to be the dataset whose privacy needs to be secured. This changes the sensitivity analysis of DP-SGD. In our setting, to compute the sensitivity of the gradient sum P s∈S (¯gs) in step (3) of DP-SGD, we have to consider the worst-case effect the presence or absence of a single geo-coordinate record p can have on the sum (as opposed 99 to the worst-case effect of the presence or absence of a single training sample). Removing p can potentially affect every g¯s for all s ∈ S, so sensitivity of the gradient sum is |2S| × B and Gaussian noise of N(0, σ24|S| 2B2 ) must be added to the gradient sum to achieve DP (cf. noise in step (3) above). After this adjustment, per-iteration and total privacy consumption of DP-SGD is amplified, impairing learning. We experimentally observed that, for any reasonable privacy budget, training loss does not improve at all during training due to the large added noise. 5.3.2 A different learning paradigm for RCQs Next, we introduce three design principles (P1-P3) we follow when training neural networks to answer RCQs. These principles are then used in Sec. 5.3.3 to build our solution. P1: Separation of noise addition from training. The main reason DP-SGD fails in our problem setting is that too much noise needs to be added when calculating gradients privately. Recall that DP-SGD uses the quantity g, defined in Eq. equation 5.3, as the differentially private estimate of the gradient of the loss function. Here, we investigate the private gradient computation in more details to provide an alternative method to calculate the gradient with differential privacy. Recall that the goal is to obtain the gradient of the loss function, L, defined in Eq. equation 5.2 with respect to the model parameters. We thus differentiate L and obtain: ∇θL = X q∈QW 2 × ( ˆf(q; θ) | {z } data indep. − f(q) |{z} data dep. ) × ∇ ˆf(q; θ) | {z } data indep. (5.4) In Eq. equation 5.4, only f(q) accesses the database. This is because the training RCQs in QW (i.e., the inputs to the neural network), are created independently of the database. The data dependent 100 term requires computing private answers to f(q) for an RCQ q, hence must consume budget, while the data-independent terms can be calculated without spending any privacy budget. This decomposition of the gradient into data dependent and independent terms is possible because, different from typical machine learning settings, the differential privacy is defined with respect to the database D and not the training set (as discussed in Sec. 5.3.1). Instead of directly using g (Eq. equation 5.3) as the differentially private estimate of the gradient (where the gradients are clipped and noise is added to the clipped gradients), we calculate a differentially private value of the training label f(q), called ¯f(q), by adding noise to the label (define ¯f(q) = f(q) + Lap(1/ε)) and calculate the gradient from that. The differentially private estimate of the gradient is then g = X q∈QW 2 × ( ˆf(q; θ) − ¯f(q)) × ∇ ˆf(q; θ) (5.5) A crucial benefit is that ¯f(q), does not change over successive learning iterations. That is, the differentially private value ¯f(q) can be computed once and used for all training iterations. This motivates our first design principle of separating noise addition and training. This way, training becomes a two step process: first, for all q ∈ QW , we calculate the differentially private training label ¯f(q). We call this step data collection. Then, we use a training set consisting of pairs(q, ¯f(q)) for all q ∈ QW for training. Since DP-compliant data measurements are obtained, all future operations that use as input these measurements are also ε-differentially private according to the post-processing property of differential privacy [35]. Thus, the training process is done as in a non-private setting, where a conventional SGD algorithm can be applied (i.e., we need not add noise to gradients), and differential privacy is still satisfied. 101 P2: Spatial data augmentation through partitioning. Following principle P1, privacy accounting is only needed when answering training queries to collect training labels. Meanwhile, in our experiments, we observed that training accurate neural networks requires a training set containing queries of different sizes (see Sec. 5.6.3.2). Such queries may overlap and, if we answer them directly from the database, sequential composition theorem would apply to account for the total privacy budget consumption. This way, the more such queries we answer, the more budget needs to be spent. Instead, to avoid spending extra privacy budget while creating more training samples with multiple query sizes, we propose spatial data augmentation through partitioning. First, we use a data collection query set, QD, chosen such that RCQs in QD don’t overlap (i.e., a space partitioning). This ensures parallel composition can be used for privacy accounting, instead of sequential composition, which allows answering all RCQs in QD by spending budget equal to one RCQ. Then, using the partitioning QD, we create and answer new queries, q, of different sizes without spending any more privacy budget but by making uniformity assumption across cells in QD that partially overlap q. Even though this approach introduces uniformity error in our training set, it avoids adding the otherwise required large scale noise, and boosts accuracy. Thus, it allows us to optimize the uniformity/noise trade-off [108, 28] when creating our training set (we present experiments in Sec. 5.9.2 of our technical report [159] to show that data augmentation reduces error). P3: Learning at multiple granularities. We employ in our solution multiple models that learn at different granularities, each designed to answer RCQs of a specific size. Intuitively, it is more difficult for a model to learn patterns when both query size and locations change. Using multiple models allows each model to learn the patterns relevant to the granularity they operate on. 102 Figure 5.2: SHN Overview 5.3.3 Proposed approach: SNH Our Spatial Neural Histograms (SNH) design, illustrated in Figure 5.2, consists of three steps: (1) Data collection, (2) Model Training, and (3) Model Utilization. We provide a summary of each step below, and defer details until Sec. 5.4. Data Collection. This step partitions the space into non-overlapping RCQs that are directly answered with DP-added noise. The output of this step is a data collection query set, QD, and a set YD which consists of the differentially private answers to RCQs in QD. This is the only step in SNH that accesses the database. In Fig. 5.2 for example, the query space is partitioned into four RCQs, and a differentially private answer is computed for each. Training. Our training process consists of two stages. First, we use spatial data augmentation to create more training samples based on QD. An example is shown in Fig. 5.2, where an RCQ covering both the red and yellow squares is not present in the set QD, but it is obtained by aggregating its composing sub-queries (both in QD). Second, the augmented training set is used to train a function approximator ˆf that captures f well. ˆf consists of a set of neural networks, each trained to answer different query sizes. 103 Figure 5.3: Data Collection: map view (left), true cell count heatmap (middle), ε-DP heatmap with noisy counts (right) Model Utilization. This step decides how any previously unseen RCQ can be answered using the learned function approximator, and how different neural networks are utilized to answer an RCQ. 5.4 Technical Details 5.4.1 Step 1: Data Collection This step creates a partitioning of the space into non-overlapping bins, and computes for each bin a differentially private answer. We opt for a simple equi-width grid of cell width ρ as our partitioning method. As illustrated in Fig. 5.3, (1) we overlay a grid on top of the data domain; (2) we calculate the true count for each cell in the grid, and (3) we add noise sampled from Lap( 1 ε ) to each cell count. We represent a cell by the coordinates of its bottom left corner, c, so that getting the count of records in each cell is an RCQ, q = (c, ρ). Let C be the set of bottom left coordinates of all the cells in the grid. Furthermore, recall that for a query q, ¯f(q) = f(q) + Lap( 1 ε ). Thus, the data collection query set is defined as QD = {(c, ρ), c ∈ C}, and their answers are the set YD = { ¯f(c, ρ), c ∈ C}. We use YD[c] to refer to the answer for the query located at c in YD. The output of the data collection step consists of sets QD and YD. 104 Even though more complex partitioning structures have been used previosuly for privately answering RCQs [108, 167], we chose a simple regular grid, for two reasons. First, our focus is on a novel neural database approach to answering RCQs, which can be used in conjunction with any partitioning type – using a simple grid allows us to isolate the benefits of the neural approach. Second, using more complex structures in the data collection step may increase the impact of uniformity error, which we attempt to suppress through our approach. The neural learning step captures density variations well, and conducting more complex DP-compliant operations in the data collection step can have a negative effect on overall accuracy. In our experiments, we observed significant improvements in accuracy with the simple grid approach. While it may be possible to improve the accuracy of SNH by using more advanced data collection methods, we leave that study for future work. The challenge in data collection is choosing the value of ρ to minimize induced errors. We address this thoroughly in Sec. 5.5.1 and present a method to determine the best granularity of the grid. 5.4.2 Step 2: SNH Training Given query set QD and its sanitized answers, we can perform any operation on this set without privacy leakage due to the post-processing property of DP. As discussed in Sec. 5.3.3, we first perform a data augmentation step using QD to create an augmented training set QA. Then, QA is used for training our function approximator. Data Augmentation is a common machine learning technique to increase the number of samples for training based on the existing (often limited) available samples [169, 64]. We propose spatial 105 Figure 5.4: Model Training: Augmented query sets of size r1 to rk (top) are used to learn neural network models (bottom) data augmentation for learning to answer RCQs. Our proposed data augmentation approach is based on our design principle P2, discussed in Sec. 5.3.2, where we motivate augmenting the training set through partitioning. In the data augmentation step, we create new queries of different sizes, answer them using the partitioning, and add the answers to our training set, as detailed in the following. We use the partitioning defined by QD and corresponding answers YD to answer queries at the same locations as in QD but of other sizes. Consider a query location c ∈ C and a query size r, r ̸= ρ. We estimate the answer for RCQ q = (c, r) as P c ′∈C |(c,r)∩(c ′ ,ρ)| ρ 2 × YD[c], where |(c, r) ∩ (c ′ , ρ)| is the overlapping area of RCQs (c, r) and (c ′ , r). In this estimate, noisy counts of cells in QD fully covered by q are added as-is (since |(c, r) ∩ (c ′ , ρ)| = ρ 2 ), whereas fractional counts for partially-covered cells are estimated using the uniformity assumption. Fig. 5.4 shows how we perform data augmentation for a query (c, r1) with size r1 at location c. Also observe that, by using queries at the same locations as in QD, the bottom-left corners of all queries in the augmented query set are aligned with the grid. We repeat this procedure for k different query sizes to generate sufficient training data. To ensure coverage for all expected query sizes, we define the set of k sizes to be uniformly spaced. 106 Algorithm 6 Spatial data augmentation Input: Query set QD with answers YD, k query sizes Output: Augmented training set QA with labels YA 1: R ← {l + (u−l) k × (i + 1 2 ), ∀i, 0 ≤ i < k} 2: for all r ∈ R do 3: Qr A, Y r A ← ∅ 4: for (c, ρ) ∈ QD do 5: Qr A.append((c, r)) 6: Y r A[c] ← P (c ′ ,ρ)∈QD |(c,r)∩(c ′ ,ρ)| ρ 2 × YD[c ′ ] 7: return QA, YA ← {Qr A, ∀r ∈ R}, {Y r A, ∀r ∈ R} Specifically, assuming the test RCQs have size between l and u, we define the set R as the set of k uniformly spaced values between l and u, and we create an augmented training set for each query size in R. This procedure is shown in Alg.6. We define Qr A for r ∈ R to be the set of RCQs located at C but with query size r, that is Qr A = {(c, r), c ∈ C}, and define Y r A to be the set of the estimates for queries in Qr A obtained from QD and YD. The output of Alg. 6 is the augmented training set containing training samples for different query sizes. Note that, as seen in the definition above, Qr A, for any r, only contains queries whose bottom-left corner is aligned with the grid used for data collection to minimize the use of the uniformity assumption. However, uniformity errors can still be present in our answers in Y r A. We discuss in Sec. 5.4.3 how training of neural networks on top of these answers allows us to mitigate the uniformity error through learning. Model architecture. We find that using multiple neural networks, each trained for a specific query size, performs better than using a single neural network to answer queries of all sizes. Thus, we train k different neural networks, one for each r ∈ R. Meaning that a single neural network trained for query size r can only answer queries of size r (we discuss in Sec. 5.4.3 how the neural networks are used to answer other query sizes), accordingly the input dimensionality 107 of each neural network is two, i.e., lat. and lon. of the location of the query. We use k identical fully-connected neural networks (specifics of the network architecture are discussed in Sec. 5.6). Loss function and Optimization. We train each of the k neural networks independently. We denote by Qr A the training set for a neural network ˆf(.; θr), trained for query size r, and we denote the resulting labels by Y r A. We use a mean squared error loss function to train the model, but propose two adjustments to capitalize on the workload information available. First, note that for a query size r ∈ R, Qr A is comprised of queries at uniformly spaced intervals, which may not follow the query distribution Q. However, we can exploit properties of the workload queries, QW to tune the model for queries from Q. Specifically, for any (c, r) ∈ Qr A, let w(c,r) = |{q ′ ∈ QW ,(c, r)∩q ′ ̸= ∅}|, that is, w(c,r) is the number of workload queries that overlap a training query. In our loss function, we weight every query (c, r) by w(c,r) . This workload-adaptive modification to the loss function emphasizes the regions that are more popular for a potential query issuer. Second, we aim at answering queries with low relative error, whereas a mean square loss puts more emphasis on absolute error. Thus, for a training query (c, r), we also weight the sample by 1/ max{Y r A[c], ψ}. Putting these together, the loss function optimized for each neural network is X (c,r)∈Qr A w(c,r) max{Y r A [c], ψ} ( ˆf(c, θr) − Y r A[c])2 (5.6) 5.4.3 Model Utilization To answer a new query (c, r), the model that is trained to answer queries with size most similar to r is accessed. That is, we find r ∗ = arg minr ′∈R |r−r ′ | and we answer the query using network 108 Figure 5.5: Model utilization: 30m query answered from 25m network (left), 90m query from 100m network (right) ˆf(c, θr ∗ ). The output answer is scaled to r according to a uniformity assumption, and the scaled answer is returned, i.e., ( r r ∗ ) 2 ˆf(c, θr ∗ ). Fig. 5.5 shows this procedure for two different RCQs. It is important to differentiate the use of uniformity assumption before learning (i.e., in data augmentation), called uniformity assumption pre-learning, from the use of uniformity assumption after learning (during model utilization), called uniformity assumption post-learning. The parameter k allows exploring the spectrum between the two cases. Specifically, when k is larger, we train more models and each model is trained for a different query size. For each query size, data augmentation uses uniformity assumption to generate training samples. Thus, more training samples are created using uniformity assumption. We call this increasing uniformity assumption pre-learning. On the other hand, since more models are trained, the output of each model will be scaled by a factor closer to one (i.e., in the above paragraph, r ∗ becomes closer to r so that ( r r ∗ ) 2 becomes closer to 1). We call this decreasing uniformity assumption post-learning. Our experimental results in Sec. 5.6.3.2 show that increasing k improves accuracy, and k should be set as large as possible so that uniformity assumption post-learning becomes negligible in practice. This follows the SNH motivation (and observations in Sec. 5.6.3.4) that learning can mitigate 109 the uniformity error. That is, the uniformity assumption should be made pre-learning so that its impact on final accuracy can be reduced through learning. 5.5 End-to-End System Aspects 5.5.1 System Tuning with ParamSelect Choosing a good grid granularity, ρ, is crucial for achieving high accuracy for DP spatial data publishing, and studied in previous work [108, 46]. Discretizing continuous domain geo-coordinates creates uniformity errors, and hence the granularity of the grid must be carefully tuned to compensate for the effect of discretization. Existing work [108, 46] makes simplifying assumptions to analytically model the impact of grid granularity on the accuracy of answering queries. However, modelling data and query specific factors is difficult and the simplifying assumptions are often not true in practice, as our experiments show (see Sec. 5.9.2 of our technical report [159]). Instead, we learn a model that is able to predict an advantageous grid granularity for the specific dataset, query distribution and privacy budget. Sec. 5.5.1.1, discusses ParamSelect, our approach to determine ρ. In Sec. 5.5.1.2 we show how to extend ParamSelect to tune other system parameters. 5.5.1.1 ParamSelect for ρ The impact of grid granularity on privacy-accuracy trade-offs when answering queries is wellunderstood in the literature [108]. In SNH, the grid granularity in data collection phase impacts the performance as follows. On the one hand, smaller grid cells increase the resolution at which the data are collected, thereby reducing the uniformity error. Learning is also improved, due to 110 more training samples being extracted. On the other hand, creating too fine grids can diminish the signal-to-noise ratio for cells with small counts, since at a given ε the magnitude of noise added to any cell count is fixed. Moreover, during data augmentation, aggregating multiple cells leads to increase in the total noise variance, since the errors of individual cells are summed. SNH is impacted by cell width in multiple ways, and determining a good cell width, ρ, is important to achieve good accuracy. Capturing an analytical dependence may not be possible, since numerous data, query and modelling factors determine the ideal cell width. If data points are concentrated in some area where the queries fall, a finer grid can more accurately answer queries for the query distribution (even though signal-to-noise ratio may be poor for parts of the space where queries are not often asked). This factor can be measured only by looking at the actual data and the distribution of queries, and would require spending privacy budget. The best value of ρ depends on the privacy budget ε, the distribution of points in D and the query distribution Q. Define δ(ρ, D, ε) to be the error of SNH with cell width ρ and define π(D, ε) = arg minρ∈R δ(ρ, D, ε), that is, the function that outputs the ideal cell width. We learn a model, Φ, to approximate π(D, ε). We refer to Φ as regressor to distinguish it from the SNH model, ˆf, discussed in Sec. 5.4. The learning process is similar to any supervised learning task, where for different dataset and privacy budget pairs, (D, ε), we use the label π(D, ε) to train Φ. The input to the regressor is (D, ε) and the training objective is to get the output, Φ(D, ε), to be close to the label π(D, ε). Feature engineering. Learning a regressor that takes a raw database D as input is infeasible, due to the high sensitivity of learning with privacy constraints. Instead, we introduce a feature engineering step that, for the dataset D, outputs a set of features, ϕD. Training then replaces 111 D with ϕD. Let the spatial region of D be SRD. First, as one of our features, we measure the skewness in the spread of individuals over SRD, since this value directly correlates with the expected error induced by using the uniformity assumption. In particular, we (1) discretize SRD using an equi-width partitioning, (2) for each cell, calculate the probability of a point falling into a cell as the count of points in the cell normalized by total number of points in D, and (3) take the Shannon’s Entropy hD over the probabilities in the flattened grid. However, calculating hD on a private dataset violates differential privacy. Instead, we utilize publicly available location datasets as an auxiliary source to approximately describe the private data distribution for the same spatial region. We posit that there exist high-level similarities in distribution of people’s locations in a city across different private and public datasets for the same spatial regions and thus, the public dataset can be used as a surrogate. Let D be the set of public datasets that we have access to, and let DI ∈ D be a public dataset covering the same spatial region as D. We estimate hD for a dataset with hDI . We call DI public ParamSelect Inference dataset. Second, we use data-independent features: ε, 1 n×ε and √ 1 n×ε , where the product of n × ε accounts for the fact that decreasing the scale of the input dataset and increasing epsilon have equivalent effects on the error. This is also understood as epsilon-scale exchangeability [46]. We calculate ϕD,ε = (n, ε, 1 nε , √ 1 nε , hDI ) as the set of features for the dataset D without consuming any privacy budget in the process. Lastly, we remark that for regions where an auxiliary source of information is unavailable, we may still utilize the data-independent features to good effect. In our technical report [159], we show that our proposed features achieve reliable accuracy across datasets; particularly, we chose hD amongst several alternative data-dependent features for that reason. 112 Algorithm 7 ParamSelect training Input: A set of public training datasets DT ⊆ D and privacy budgets E for training to predict a system parameter α Output: Regressor Φα for system parameter α 1: procedure ϕ(D, n, ε) 2: hD ← entropy of D 3: return (n, ε, 1 nε , √ 1 nε , hD) 4: procedure Train_ParamSelect(DT , E) 5: T ← {(ϕ(D, |D|, ε), πˆα(D, ε))|ε ∈ E, D ∈ DT } 6: Φα ← Train regressor using T 7: return Φα Training Sample Collection. Generating training samples for Φ is not straightforward since we do not have an analytical formulation for δ(ρ, D, ε) and thus π(D, ε). Since the exact value of π(D, ε) is unknown, we use an empirical estimate. We run SNH with various grid granularities of data collection and return the grid size, ρD,ε, for which SNH achieves the lowest error. Our experimental results in Sec. 5.6.3 show that δ(ρ, D, ε) is only marginally affected with small changes in ρ (so evaluating δ(ρ, D, ε) at different values of ρ five meters apart and selecting the best ρ provides a good estimate of π(D, ε)). Intuitively, one expects the error in the training set to remain the same if the cell width of data collection grid changes by a few meters, since the uniformity errors induced are similar. Thus, we use this approach to obtain ρD,ε as our training label. Note that the empirically determined value of ρD,ε is dependent on—and hence accounts for—the query distribution on which SNH error is measured. Moreover, when D contains sensitive data, obtaining ρD,ε would require spending privacy budget. Instead, we generate training records from a set of datasets, DT ⊆ D that have already been publicly released (see Sec. 5.6 for details of public datasets). We call datasets in DT public ParamSelect Training datasets. Put together, our training set is {(ϕD,ε, ρD,ε)|ε ∈ E, D ∈ DT }, where E is the range of different privacy budgets chosen for training. 113 Algorithm 8 ParamSelect usage Input: Spatial extent SR and size n of a sensitive dataset D and privacy budget ε Output: System parameter value α for private dataset D 1: procedure ParamSelect(SR, n, ε) 2: DI ← Public dataset with spatial extent SR 3: α ← Φα(ϕ(DI , n, ε)) 4: return α Predicting Grid Width with ParamSelect. The training phase of ParamSelect builds regressor Φ using the training set described above. We observed that models from the decision tree family perform the best for this task. Once the regressor is trained, its utilization for any unseen dataset is straightforward and only requires calculating the corresponding features and evaluating Φ. 5.5.1.2 Generalizing ParamSelect to any system parameter We can easily generalize the approach in Sec. 5.5.1.1 to any system parameter. Define function πα(D, ε) that given a query distribution, outputs the best value of α for a certain database and privacy budget. The goal of ParamSelect is to learn a regressor, using public datasets DT ∈ D, that mimics the function πα(.). ParamSelect functionality is summarized in Alg. 7. First, during a pre-processing step, it defines the feature extraction function ϕ(D, n, ε), that extracts the features described in Sec. 5.5.1.1 from the public dataset D with n records, and a privacy budget ε. Second, it creates the training set {(ϕ(D, |D|, ε), πˆα(D, ε)), ε ∈ E, D ∈ DT }, where πˆα(D, ε) estimates the value of πα(D, ε) with an empirical search (i.e., by trying different values of α and selecting the one with the highest accuracy), and DT and E are different public datasets and values of privacy budget, respectively, used to collect training samples. Lastly, it trains a regressor Φα that takes extracted features as an input and outputs a value for α. 114 At inference stage (Alg. 8) ParamSelect uses a public dataset DI that covers the same spatial region as D, as well as size of D, n, and privacy budget ε to extract features ϕ(DI , n, ε). The predicted system parameter value for D is then Φα(ϕ(DI , n, ε)). 5.5.2 Privacy and Security Discussion Let D be a private dataset covering a spatial region SR and D be a set of public datasets. The SNH end-to-end privacy mechanism M is comprised of two parts that compose sequentially: mechanism Mf , that models range count queries using the neural networks, and mechanism MΦ, that trains a regressor to determine the system parameters. Mf operates over D, ε, SR and D. MΦ operates over D and SR for ParamSelect training and inference. Hence, we write the end-to-end system as the SNH mechanism M(D|ε, SR, D) = Mf (D|ε, D, SR, MΦ(D, SR)). Theorem 5. Mechanism M(D|ε, SR, D) satisfies ε-DP. Sec. 5.9.1 of our technical report [159] contains a proof of the above theorem and a qualitative discussion on DP privacy guarantees. 5.6 Experimental Evaluation Sec. 5.6.1 describes the experimental testbed. Sec. 5.6.2 evaluates SHN in comparison with stateof-the-art approaches. Sec. 5.6.3 provides an ablation study of various design choices. Sec. 5.9.2 of our technical reports [159] contains complementary experimental results. 115 Table 5.2: Urban datasets characteristics. Low Pop. density Medium Pop. density High Pop. density Fargo [46.877, -96.789] Phoenix [33.448, -112.073] Miami [25.801, -80.256] Kansas City [39.09, -94.59] Los Angeles [34.02, -118.29] Chicago [41.880, -87.70] Salt Lake [40.73, -111.926] Houston [29.747, -95.365] SF [37.764, -122.43] Tulsa [36.153, -95.992] Milwaukee [43.038, -87.910] Boston [42.360, -71.058] 5.6.1 Experimental Settings 5.6.1.1 Datasets We first describe all the datasets and then specify how they are utilized in our experiments. Dataset Description. All datasets comprise of user check-ins specified as tuples of: user identifier, latitude and longitude of check-in location, and timestamp. Our first dataset is a subset of the user check-ins collected by the SNAP project [24] from the Gowalla (GW) network. It contains 6.4 million records from 200k unique users during a time period between February 2009 and October 2010. Our second dataset, SF-CABS-S (CABS) [105], is derived from the GPS coordinates of approximately 250 taxis collected over 30 days in San Francisco. Following [46, 108], we keep only the start point of the mobility traces, for a total of 217k records. The third dataset is proprietary, obtained from Veraset [133] (VS), a data-as-a-service company that provides anonymized movement data from 10% of the cellphones in the U.S [134]. For a single day in December 2019, there were 2.6 billion readings from 28 million distinct devices. From VS we generate the fourth dataset called SPD-VS. We perform Stay Point Detection (SPD) [153] on the data to remove location signals when a person is moving, and to extract POI visits when a user is stationary. SPD is useful for POI services [101], and results in a data distribution consisting of user visits (i.e., fewer points on roads and more at POIs). Following [153], we consider as location visit a region 100 meters wide where a user spends at least 30 minutes. 116 To simulate a realistic urban environment, we focus on check-ins from several cities in the U.S. We group cities into three categories based on their population densities [69], measured in people per square mile: low density (lower than 1000/sq mi), medium density (between 1000 and 4000/sq mi) and high density (greater than 4000/sq mi). A total of twelve cities are selected, four in each population density category as listed in Table 5.2. For each city, we consider a large spatial region covering a 20×20km2 area centered at [lat, lon]. From each density category we randomly select a test city (highlighted in bold in Table 5.2), while the remaining cities are used as training cities. We use the notation <city> (<dataset>) to refer to the subset of a dataset for a particular city, e.g., Milwaukee (VS) refers to the subset of VS datasets for the city of Milwaukee. Experiments on VS. Private dataset: Our experiments on Veraset can be seen as a case-study of answering RCQs on a proprietary dataset while preserving differential privacy. We evaluate RCQs on the Veraset dataset for the test cities. Due to the enormous volume of data, we sample at random sets of n check-ins, for n ∈ {25k, 50k, 100k, 200k, 400k} for the test cities and report the results on these datasets. Auxiliary Datasets: For each test city in VS, we set QW and DI to be the GW dataset from the corresponding city. GW and VS datasets are completely disjoint (they are collected almost a decade apart). The public datasets DT are the set of all the training cities of the GW dataset. Experiments on GW. Private dataset: We present the results on the complete set of records for the test cities of Miami, Milwaukee and Kansas City with 27k, 32k and 54k data points, respectively. Auxiliary Datasets: For each test city, we set QW and DI to be the VS counterpart dataset for that city. DT contains all the training cities in the GW dataset. None of the test cities, which are considered sensitive data, are included in DT . 117 Experiments on CABS. Private dataset: Since CABS consists of 217k records within the city of San Francisco only, we treat it as the sensitive test city for publishing. Auxiliary Datasets: We set QW and DI to be the GW dataset for San Francisco. DT contains all the training cities in the GW dataset. Once again, collecting auxiliary information from an entirely different dataset ensures no privacy leakage on the considered private dataset. 5.6.1.2 SNH system parameters We use the GW dataset to train the ParamSelect regression model. For the nine training cities and five values of privacy budget ε, we obtain 45 training samples. We utilize an AutoML pipeline (such as [41, 146]) to find out a suitable model from among a wide range of ML algorithms. The pipelines use cross-validation to evaluate goodness-of-fit for possible algorithm and hyperparameter combinations. The final model is an Extremely Randomized Tree (ExtraTrees) [45]. ExtraTrees create an ensemble of random forests [53], where each tree is trained using the whole learning sample (rather than a bootstrap sample). The model ensembles 150 trees having a maximum depth of 7. For other system parameters, we observed that their best value for SNH remain stable over various dataset and privacy budget combinations. Sec. 5.6.3.2 and Sec. 5.9.2 of our technical report [159] present this result for parameter k and Sec. 5.6.3.4 and Sec. 5.9.2 of our technical report [159] for the model depth. We observed no benefit in using ParamSelect to set these parameters and merely selected a value that performed well on our public datasets for the system parameter k and neural network hyper-parameters. The fully connected neural networks contain 20 layers of 80 unit each and are trained with Adam [60] optimizer with learning rate 0.001. 118 5.6.1.3 Other experimental settings Evaluation Metric. We construct query sets of 5,000 RCQs centered at uniformly random positions. Each query has side length that varies uniformly from 25 meters to 100 meters. We evaluate the relative error for a query q as defined in Sec. 5.2, and set smoothing factor ψ to 0.1% of the dataset cardinality n, as in [167, 28, 108]. Baselines. We evaluate our proposed SNH approach in comparison to state-of-the-art DP solutions: PrivTree [167], Uniform Grid (UG) [108], Adaptive Grid (AG) [108] and Data and Workload Aware Algorithm (DAWA) [67]. Brief summaries of each method are provided in Sec. 5.7. DAWA requires the input data to be represented over a discrete 1D domain, which can be obtained by applying a Hilbert transformation. To this end, we discretize the domain of each dataset into a uniform grid with 2 20 cells, following the work of [67, 167]. DAWA also uses the workload query set, QW , as specified in Sec. 5.6.1.1. For PrivTree, we set its fanout to 4, following [167]. We also considered Hierarchical methods in 2D (HB2D) [109, 47] and QuadTree [28], but the results were far worse than the above approaches and thus are not reported (we report the results of all the baselines in Sec. 5.9.2 of our technical report [159]). As an additional baseline, we modify STHoles [18], a non-private workload-aware algorithm, to satisfy DP. STHoles builds nested buckets in regions where the workload requires finer granularity. We incorporate differential privacy by (1) adding the required sanitization noise to the frequency counts in STHoles’ buckets and (2) implementing the algorithm so that it avoids asking overlapping queries from the database to minimize the magnitude of noise added. Details of our DP-compliant adoption of STHoles are available in the Appendix 5.9.3 of our technical report [159] and our implementation 119 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (a) Kansas City (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 0.20 relative error (b) Milwaukee (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 relative error (c) Miami (VS) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (d) Milwaukee (SPD VS) 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 0.20 relative error (e) SF (CABS) SNH AG UG PrivTree DAWA STHoles Figure 5.6: Impact of privacy budget: VS, SPD-VS and CABS datasets 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 0.8 relative error (a) Kansas City (GW) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 relative error (b) Milwaukee (GW) 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.2 0.4 0.6 relative error (c) Miami (GW) Figure 5.7: Impact of privacy budget: GW dataset 25 50 100 200 400 n (×1000) 0.0 0.1 0.2 relative error (a) Impact of n 25 50 75 100 query size (m) 0.0 0.1 0.2 relative error (b) Impact of Query size Figure 5.8: Impact of data and query size is publicly available at [157]. Similar to DAWA and SNH, STHoles uses the workload query set, QW , as specified in Sec. 5.6.1.1. Implementation. All algorithms were implemented in Python, and executed on a Linux machine with an Intel i9-9980XE CPU, 128GB RAM and a RTX2080 Ti GPU. Neural networks are implemented in JAX [17]. Given this setup, SNH took up to 20 minutes to train in our experiments, depending on the value of ρ. The average query time of SNH is 329µs and a model takes 4 MB of space. We publicly release the source code at [158]. Default Values. Unless otherwise stated, we present the results on the medium population density city, Milwaukee (VS), with data cardinality n = 100k. Privacy budget ε is set to 0.2. 5.6.2 Comparison with Baselines Impact of privacy budget. Figs. 5.6 and 5.7 present the error of SNH and competitors when varying ε for test datasets VS, SPD-VS, CABS and GW. Recall that a smaller ε means stronger privacy protection. 120 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (a) n=25,000 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (b) n=100,000 0.05 0.1 0.2 0.4 0.8 ε 0.0 0.1 0.2 0.3 relative error (c) n=400,000 SNH PGM@ParamSelect IDENTITY@ParamSelect Figure 5.9: Study of modeling choice 0.05 0.1 0.2 0.4 0.8 ε 0.00 0.05 0.10 0.15 relative error SNH,k=1 SNH,k=8 SNH+QS,k=1 SNH+QS,k=8 Figure 5.10: Impact of uniformity assumption 15 20 25 30 35 40 45 ρ (m) 0.00 0.05 0.10 0.15 relative error ε=0.05 ε=0.2 ε=0.8 SNH Figure 5.11: Impact of ρ and ParamSelect Figure 5.12: SNH learns patterns on GMM dataset of 16 components. Color shows number of data points. 2 5 10 20 40 σ 0.00 0.02 0.04 relative error (a) Noisy Obs. 2 5 10 20 40 σ 0.00 0.05 0.10 relative error (b) Obs. with Uniformity SNH (s1) SNH (s2) SNH (s3) No Learning Figure 5.13: Impact of data skewness (ε = 0.2) For our proprietary datasets, VS and SPD-VS, we observe that SNH outperforms the stateof-the-art by up to 50% at all privacy levels (Fig. 5.6 (a)-(d)). This shows that SNH is effective in utilizing machine learning and publicly available data to improve accuracy of privately releasing proprietary datasets. Fig. 5.6 (e) and Fig. 5.7 show that SNH also outperforms for CABS and GW datasets in almost all settings, the advantage of SNH being more pronounced for smaller ε values. Stricter privacy regimes are particularly important for location data, since such datasets are often released at multiple time instances with smaller privacy budget per release. Impact of data cardinality. Fig. 5.8 (a) shows the impact of data cardinality on relative error for Milwaukee (VS). For all algorithms, the accuracy improves as data cardinality increases. This is a direct consequence of the signal-to-noise ratio improving as cell counts are less impacted by DP noise. SNH consistently outperforms competitor approaches at a wide range of data cardinality settings. 121 Impact of query size. We evaluate the impact of query size on accuracy by considering test queries of four different sizes in Milwaukee (VS). Fig. 5.8 (b) shows that the error for all the algorithms increases when query size grows, with SNH outperforming the baselines at all sizes. There are two competing effects when increasing query size: on the one hand, each query is less affected by noise, since actual counts are larger; on the other hand, the error from more grid cells is aggregated in a single answer. The second effect is stronger, so the overall error steadily increases with query size. 5.6.3 Ablation Study for SNH 5.6.3.1 Modeling choices Recall that SNH first creates a uniform grid, with granularity decided by ParamSelect. It then performs data augmentation and learning using the data collected on top of the grid. Next, we study the importance of each component of SNH to its overall performance. We create two new baselines to show how our choice of using neural networks to learn the patterns in the data improves performance. The first, called IDENTITY@ParamSelect, ablates SNH, utilizing only the uniform grid created in SNH at data collection. The second baseline, called PGM@ParamSelect, employs Private Probabilistic Graph Models (PGM) [80], a learning algorithm specifically designed for high-dimensional categorical data. We extend PGM to 2D spatial datasets by feeding it a DP uniform grid at the granularity selected by ParamSelect. Fig. 5.9 (a) shows SNH outperforming both these baselines. SNH outperforming IDENTITY shows the benefit of learning, since both SNH and IDENTITY use the same grid for data collection 122 but SNH learns neural networks using data generated from the grid, while IDENTITY directly uses the grid to answer queries. This benefit diminishes when the privacy budget and the data cardinality increase (note that both n and ε are in log scale), where a simple uniform grid chosen at the correct granularity outperforms all existing methods (comparing Fig. 5.9 (b) with Fig. 5.6 (b) shows IDENTITY@ParamSelect outperforms the state-of-the-art for ε = 0.4 and 0.8). For such ranges of privacy budget and data cardinality, ParamSelect recommends a very fine grid granularity. Thus, the uniformity error incurred by IDENTITY@ParamSelect becomes lower than that introduced by the modelling choices of SNH and PGM. This also shows the importance of a good granularity selection algorithm, as UG in Fig. 5.6 performs worse than IDENTITY@ParamSelect for larger ε. 5.6.3.2 Balancing Uniformity Errors We discuss how the use of the uniformity assumption at different stages of SNH impacts accuracy. Recall from Sec. 5.4.3 that the value of k balances the use of the uniformity assumption pre- and post-learning. We empirically study how uniformity assumption pre- and post-learning influence SNH’s accuracy by varying k. Furthermore, we study how removing the uniformity assumption post-learning and replacing it with a neural network affects accuracy. Specifically, we consider a variant of SNH where we train the neural networks to also take as an input the query size. Each neural network is still responsible for a particular set of query sizes, [rl , ru], where we use data augmentation to create query samples with different query sizes falling in [rl , ru]. Instead of scaling the output of the trained neural networks, now each neural network also takes the query size as an input, and thus, the answer to a query is just the forward pass of the neural network. We call this variant SNH with query size, or SNH+QS . 123 Fig. 5.10 shows that, first, removing the uniformity assumption post-learning has almost no impact on accuracy when k is large. However, for a small value of k, it provides more stable accuracy. Note that when k = 1, SNH trains only one neural network for query size r ∗ and answers the queries of size r by scaling the output of the neural network by r r ∗ . The error is expected to be lower when ρ and r ∗ have similar values, since there will be less uniformity error when performing data augmentation. This aspect is captured in Fig. 5.10, where at ε = 0.2, r ∗ and ρ are almost the same values and thus, the error is the lowest. Sec. 5.9.2 of our technical report [159] evaluates more comprehensively the impact of k. 5.6.3.3 ParamSelect and ρ Fig.5.11 shows the performance of SNH with varying cell width ρ at multiple values of ε. A coarser grid first improves accuracy by improving signal-to-noise ratio at each cell, but a grid too coarse hampers accuracy by reducing the number of samples extracted for training SNH. This creates a U-shaped trend which shifts to smaller values of ρ for larger values of ε as the lower DP noise impacts the cell counts less aggressively. The red line in Fig.5.11 labelled SNH shows the result of SNH at the granularity chosen by ParamSelect. SNH performing close to the best possible proves that ParamSelect finds an advantageous cell width for SNH. 5.6.3.4 SNH Learning Ability in Non-Uniform Datasets We study the ability of neural networks to learn patterns from skewed datasets through imprecise observations, where imprecision is due to noise or uniformity assumption. Setup. We synthesize 100k points from a Gaussian Mixture Model (GMM) [114] with 16 components. The means of the components are placed uniformly over the data space. All components 124 are equally weighted and have the covariance matrix I × σ 2 , where I is the identity matrix. GMMs allow controlling data skewness via the parameter σ. We partition the data space into a grid of 200×200 cells and report σ in terms of number of cells. The query set, Q, consists of queries asking for the number of points inside each cell. Fig. 5.12(a) plots the true answers to this query set when σ = 7. Learning from Noisy Observations. We consider two scenarios. First, we obtain the DP answers, A˜, to the queries in Q by adding noise to the true answers. We call this algorithm No Learning. For ε = 0.05, Fig. 5.12 (b1) shows the noisy answers reported by No Learning. Comparing Figs. 5.12 (a) and 5.12 (b1) we observe that the sanitization noise severely distorts the existing patterns in the data. Second, we train a neural network using only the noisy answers shown in Fig. 5.12 (b1), that is, the inputs to the neural network are queries in Q and training labels are the answers in A˜. After training, we ask the same queries, Q. The result in Fig. 5.12 (c1) shows the output of the neural network. SNH has a strong ability to recover the underlying patterns of GMMs from even highly distorted observations. Additional visualizations for several values of ε and σ can be found in Sec. 5.9.2 of our technical report [159]. Next, we compare the error in the neural network predictions to that in the noisy answers it was trained with. The latter is represented with the line labelled ‘No Learning’ in Fig. 5.13 (a) and is the error in A˜. Lines labeled SNH show the error of SNH at varying model sizes (s1, s2 and s3 correspond to models with depth 5, 10 and 20 and width 15, 25 and 80 respectively) on the same query set. When σ is large, the data is closer to being uniformly distributed and there are fewer patterns to learn, whereas when σ is small, the data becomes more skewed towards the mean of each GMM component. The results in Fig. 5.13 (a) show that when data is skewed, SNH is especially capable of extracting patterns in the data where present, utilizing them to boost 125 accuracy. However, when data is uniform-like, SNH performs similar to ‘No Learning’ as there are few patterns to be learned. Lastly, by varying model size (lines s1, s2 and s3) we show that it is beneficial to use a larger neural network for more skewed datasets. A larger network exhibits stronger representation power and hence captures the skewness better. Learning from Observations with Uniformity Error. We generate the training data by purposefully inducing uniformity error when answering queries in our training set, Q. We first superimpose a coarse partitioning of 20×20 blocks over the original 200×200 cell grid, with each block covering exactly 100 cells. To answer the queries in Q, we first obtain the true answer for each block, and then divide that value by 100 to obtain the answer for each cell within the block (assuming uniformity within the block). The result is shown in Fig. 5.12 (b2). Note that the set of queries that fall within the same block (in the 20×20 grid) all receive the same answers due to the uniformity assumption. Next, we train a neural network with queries in Q (corresponding to the cells in the 200×200 grid). The result in Fig. 5.12 (c2) shows that the neural network smoothens the observations and brings them closer to the true answers. In Fig. 5.13 (b) we evaluate the effect of increasing skewness (i.e., decreasing σ): “No Learning” yields larger errors, whereas SNH, through learning, keeps the error steady for different skewness levels. 5.7 Related Work Privacy preserving machine learning. A learned model can leak information about the data it was trained on [125, 52]. Recent efforts have developed differentially private versions of ML algorithms, e.g., empirical risk minimization [21, 59] and deep neural networks [120, 1]. For DP sanitization, existing approaches add noise to the output of the trained model [144], add a 126 random regularization term to the objective function [21, 59], or add noise to the gradient of the loss function during training [1]. Our approach is different in that we sanitize the training data before learning. Furthermore, the work of [1] achieves (ε, δ)-DP [90, 36, 3], a weaker privacy guarantee. Answering RCQs. In the one dimensional case, the data-independent Hierarchical method [47] uses a strategy consisting of hierarchically structured range queries typically arranged as a tree. Similar methods (e.g., HB [109]) differ in their approach to determining the tree’s branching factor and allocating appropriate budget to each of its levels. Data-dependent techniques, on the other hand, exploit the redundancy in real-world datasets to boost the accuracy of histograms. The main idea is to first lossily compress the data. For example, EFPA [4] applies the Discrete Fourier Transform whereas DAWA [67] uses dynamic programming to compute the least cost partitioning. The compressed data is then sanitized, for example, directly with Laplace noise [4] or with a greedy algorithm that tunes the privacy budget to an expected workload [67]). While some approaches such as DAWA and HB extend to 2D naturally, others specialize to answer spatial range queries. Uniform Grid (UG) [108] partitions the domain into a m × m grid and releases a noisy count for each cell. The value of m is chosen in a data-dependent way, based on dataset cardinality. Adaptive Grid (AG) [108] builds a two-level hierarchy: the top-level partitioning utilizes a granularity coarser than UG. For each bucket of the top-level partition, a second partition is chosen in a data-adaptive way, using a finer granularity for regions with a larger count. QuadTree [28] first generates a quadtree, and then employs the Laplace mechanism to inject noise into the point count of each node. Range-count queries are answered via a topdown traversal of the tree. Privtree [167] is another hierarchical method that allows variable node 127 depth in the indexing tree (as opposed to fixed tree heights in AG, QuadTree and HB). It utilizes the Sparse-Vector Technique [73] to determine a cell’s density prior to splitting the node. The case of high-dimensional data was addressed by [79, 145, 166]. The most accurate algorithm in this class is High-Dimensional Matrix Mechanism (HDMM) [79] which represents queries and data as vectors, and uses optimization and inference techniques to answer RCQs. PrivBayes [166] is a mechanism that privately learns a Bayesian network over the data that generates a synthetic dataset which can consistently answer workload queries. Due to the use of sampling to estimate data distribution, it is a poor fit for skewed spatial datasets. Most similar to our work is PGM [80], which utilizes Probabilistic Graphical Models to measure a compact representation of the data distribution, while minimizing a loss function. Data projections over user-specified subgroups of attributes are sanitized and used to learn the model parameters. PGM is best used in the inference stage of privacy mechanisms (such as HDMM and PrivBayes) that can already capture a good model of the data. Private parameter tuning. Determining the system parameters of a private data representation must also be DP-compliant. Several approaches utilize the data themselves to tune system parameters such as depth of a hierarchical structure (e.g., in QuadTree or HB) or spatial partition size (e.g. k-d trees), without privacy consideration [47]. Using public datasets to tune system parameters is a better strategy [21]. Our strategy to determine a good cell width for a differentiallyprivate grid is similar to that in UG [108]. However, our proposed strategy for parameter selection vastly improves generalization ability over UG [108] by exploiting additional dataset features and their non-linear relationships. 128 5.8 Conclusion We proposed SNH: a novel method for answering range count queries on location datasets while preserving differential privacy. To address the shortcomings of existing methods (i.e., overreliance on the uniformity assumption and noisy local information when answering queries), SNH utilizes the power of neural networks to learn patterns from location datasets. We proposed a two stage learning process: first, noisy training data is collected from the database while preserving differential privacy; second, models are trained using this sanitized dataset, after a data augmentation step. In addition, we devised effective machine learning strategies for tuning system parameters using only public data. Our results show SNH outperforms the state-of-the-art on a broad set of input data with diverse characteristics. In future work, we plan to extend SNH to releasing high data dimensional user trajectories datasets. 5.9 Appendix 5.9.1 DP Proof and Security Discussion Proof of Theorem 5. SNH, represented as the mechanism M is the composition of mechanisms Mϕ and Mf . Furthermore, Mf can be written as a composition of the data collection mechanism, denoted as MD, which outputs the data collection grid, and a function ⟨ that performs the arbitrary transformations on this grid during data augmentation and training. That is M(D|ε, SR, D) = ⟨(MD(D|ε, SR, MΦ(D, SR)), D). MΦ(D, SR)) is the ParamSelect mechanism that obtains system parameters utilizing only public information D and SR to predict the system parameters. Thus it does not access private records in D, consequently, it also does 129 not consume privacy budget. Note that ParamSelect mechanism does use the size of the private dataset for prediction, which we assume is publicly available and, if not, an estimate can be obtained by spending negligible privacy budget. Next, MD is called, which creates a grid of cell width ρ, where ρ is the output of Mϕ, on the spatial extent SR. For each cell in the grid created, it then access the database to obtain the number of records in the cell and adds noise Lap( 1 ε ) to the true count. Thus, a noisy count for each cell is obtained with ε-DP. Furthermore, since cells do not overlap parallel composition theorem of DP applies, and the computation of noisy count for all the cells is still ε-DP. Finally, the transformation ⟨ is applied to the output of MD, which due to the post processing property of DP does not consume any privacy budget. Thus, the mechanism M, which is a composition of Mϕ, MD and ⟨ is ε-differentially private. □ Security Discussion. DP has different requirements and guarantees compared to alternative security models such as encryption. With encryption, one protects the data values of an individual (i.e., locations visited by a person), whereas the presence of an individual in the data is known (either a real identity or a pseudo-identity). In the context of cryptography, leaking the distribution of visited locations is not permitted. In contrast, DP allows statistical information (including density distribution) to be released, as long as an adversary cannot pinpoint the presence of a targeted individual in the data. The purpose of SNH is to publish DP-compliant density statistics while protecting against individual presence inference. In this context, density information is actually needed by the application (e.g., identifying hotspots), and leakage of DP-sanitized density information is desired and permitted. Moreover, due to the robustness of DP to side-channel information, this privacy guarantee is independent of available public information in D. 130 00 0 0 0 0 ε 0$ 00 0 !# # " #! " Figure 5.14: Milwaukee (VS) ϵ = 0.2, n = 100k ε Figure 5.15: Replacing uniformity error with noise 5.9.2 Complementary Experimental Results Comparison against all baselines We compare our method against all existing baselines in Figure 5.14 (note the log scale). To the best of our knowledge, the figure contains all differentially private algorithms applicable to 2D location datasets. Existing methods are predominantly domain partitioning methods that utilize traditional data structures. For instance, DPCube[32] exploits a kd-tree structure, QuadTree[28] uses a full quadtree, HB[109] invokes a hierarchical tree with variable fanout, Privtree[167] also uses a hierarchical tree but without height constraints, UG [108] is a single level grid and AG[108] is a two level grid. A detailed description of each method is available in Section 5.7. Missing from the excerpt is DPCube [145], which in particular is a method best sutiable to high-dimensional data. DPCube searches for dense ‘subcubes’ of the datacube representation to release privately. A part of the privacy budget is used to obtain noisy counts using Laplace mechanism over a straightforward partitioning, which is then improved to a standard kd-tree. Fresh noisy counts for the partitions are obtaining with the remaining budget and a final inference step resolves inconsistencies between the two sets of counts, and improves accuracy. Methods that perform worse than Unifrom Grid (UG), have been omitted in our experiments in Section 5.6 due to their poor performance. ε Figure 5.16: Study of ParamSelect k Figure 5.17: Impact of k ε ε ε Figure 5.18: Impact of model depth Data Augmentation: Uniformity error or Large Scale Noise In this section, we present our empirical results motivating our design principle “P2: Spatial Data Augmentation through Partitioning”. Recall that, as discussed in Sec. 5.3.2, neural networks perform best when trained with queries of different sizes (as shown in experiments in Sec. 5.6.3.2, Figs 5.17 and 5.10). However, queries of different sizes may overlap. Hence, due to DP constraints, answering such queries can either be done by introducing more noise or more uniformity error due to sequential composition property of DP (see Sec. 5.2.1). Here, we present our results that show a considerable advantage in adding noise once and collecting more training data through data augmentation (and thereby using uniformity assumption) compared with adding more noise but avoiding uniformity assumption. To substantiate this claim we design an experiment (in Fig. 5.15), where for any location we generate training queries with 8 different sizes, creating 8 overlapping queries per location. Lines labelled “SNH No Unif.” and “SNH” both use the same query set for training, however the answers (i.e., labels) to the training queries are generated differently. “SNH No Unif.” answers all queries directly from the database records (and thus more noise is added per query due to sequential decomposition of DP, but avoids completely uniformity assumption). On the other hand, SNH as presented in the chapter and (discussed in Sec. 5.3.3) first uses a grid for data collection and then answers queries based on the grid (so it incurs uniformity error, but adds less noise per query than “SNH No Unif.”). The result shows that it is better to use uniformity assumption than to increase noise, justifying its use in data augmentation. However note that the uniformity error is introduced in the training set before learning, and mitigated through learning. Benefit of ParamSelect ParamSelect selects the best grid granularity ρ for SNH. An existing alternative for setting the grid granularity is using the guideline of UG [108], which, by making assumptions about the query and data distribution, analytically formulates the error for using a uniform grid. It then proposes creating an m × m grid, setting m = p nε/c for a constant c empirically set to c = 10. We call SNH with grid granularity chosen this way SNH@UG. We compare this method with SNH (referred to SNH@ParamSelect to emphasize the use of ParamSelect to set ρ). We compare the error in the ρ predicted by ParamSelect to that by the UG guideline. To do so, we first empirically find ρ ∗ , the cell width at which SNH achieves highest accuracy. Then we calculate the mean absolute error (MAE), |ρ − ρ ∗ |, of the suggested cell width ρ by either UG or ParamSelect. Averaged across several privacy budgets, ParamSelect achieves MAE of 3.3m while UG results in MAE of 281.3m. That is, UG recommends a cell width far from the optimal cell width. Fig. 5.16 shows how cell width impacts the accuracy of SNH. We observe a significant difference between SNH@UG and SNH@ParamSelect, establishing the benefits of ParamSelect. Overall, the results of this ablation study, and the ablation study in Sec. 5.6.3.2, show that both good modelling choices and system parameter selection are imperative in order to achieve high accuracy. 133 System parameters analysis Impact of k. Fig. 5.17 shows the impact of k on the accuracy of the models. The result shows that for large values of ε, increasing k can substantially improve the performance. Fig. 5.17 also shows the need for having access to queries of multiple sizes during training, as this is required when k > 1. Impact of Model Depth. We study how the neural network architecture impacts SNH’s performance in Fig.5.18. Specifically, we vary the depth (i.e., the number of layers) of the network. Increasing model depth improves slightly the accuracy of SNH due to having better expressive power from deep neural networks. However, networks that are too deep quickly decrease accuracy as the gradients during model training diminish dramatically as they are propagated backward through the very deep network. Furthermore, larger ε values are able to benefit more from the increase in depth, as more complex patterns can be captured in the data when it is less noisy. Further GMM Visualizations We extend the discussion of Sec. 5.6.3.4 and visualize in various settings the ability of neural networks to reduce the errors by learning from imprecise observations. We study this behavior for ε = 0.05 (i.e. in high-privacy regime) in Figures 5.19, 5.21, 5.23, and for ε = 0.2 (i.e. low-privacy regime) in Figures 5.20, 5.22, 5.24 for different values of standard deviation, σ, of the GMM components. SNH is especially capable in the low-privacy regime, and when the data are heavily skewed or non-uniform, justifying their use in location datasets that exhibit similarly skewed distributions. To conclude, given a set of imprecise observations, by fitting a neural network to all such observations simultaneously, we obtain a neural network with lower error than in the observations themselves. 134 Figure 5.19: ε = 0.05, σ = 14 Figure 5.20: ε = 0.2, σ = 14 Figure 5.21: ε = 0.05, σ = 7 Figure 5.22: ε = 0.2, σ = 7 5.9.3 Differentially Private STHoles Implementation We describe the general structure of the STHoles histograms and the specific modifications that we make to achieve DP-compliance and good utility for answering RCQs. STHoles [18] is a histogram construction technique that exploits query workload. It generates a domain partitioning in the form of nested buckets assembled as a tree structure. In contrast to traditional domain partitioning methods, STHoles allows buckets to overlap by permitting inclusion relationships between ancestor nodes of the tree structure, i.e., some buckets can be completely included inside others. We defer the details of the histogram’s construction, and instead refer the reader to [18]. Our implementation is publicly available at [157]. Our DP-compliant STHoles implementation makes two adjustments to the original STHoles algorithm to allow for better accuracy when accounting for privacy. First, we allow the algorithm to use unlimited memory, so that it does not need to merge any of the buckets to reduce memory usage. This not only avoids incurring the merge penalty (discussed in the paper [18]) but also lowers the privacy budget consumption, since we can avoid calculating merge penalties that would require budget consuming accesses to D. Second, we separate the pro Figure 5.23: ε = 0.05, σ = 3.5 Figure 5.24: ε = 0.2, σ = 3.5 the frequency counts for each bucket from the process of building the nested bucket structure. That is, we first build the bucket structure based on the query workload and then calculate the frequency counts within each bucket. This separation significantly reduces the privacy budget consumption, since it allows us to avoid asking overlapping queries from the database and thus, final privacy budget accounting can be done with only parallel composition theorem. Next, we present how we build the buckets and calculate the frequency counts in more details. First, we generate the nested budget structure using the query workload QW (Algorithm 9). Modified from the original algorithm, in this step, we do not calculate database related statistics such as the number of records in each bucket b ∈ HSR as that would necessitate spending scarce privacy budget. For the same reason, we also skip the step which merges buckets together based on a penalty caculated from database records. From the privacy analysis perspective, the query workload is public and using information therein incurs no privacy leakage. Hence, Algorithm 9 doesn’t use any privacy budget. In the second step (Algorithm 10), we generate sanitized frequency counts for STHoles’ buckets in the data structure. For each bucket, we query the database for the number of records that fall within its extent, sanitizing these counts using the Laplace Privacy Mechanism (see Section 5.2.1 for details of the mech Algorithm 9 STHoles Domain Partitioning Input: Query Workload QW for the spatial region SR Output: Domain Partitioning θ 1: procedure BuildPartitioning(QW , SR) 2: HSR ← Initialize histogram with fixed size root bucket of 3: spatial extent SR. 4: for all q ∈ QW do 5: Identify b ∈ HSR that have q ∩ b ̸= ∅. 6: Shrink candidate holes according to Sec. 4.2.1 of [18] 7: Add new holes as buckets to histogram HSR 8: return HSR Algorithm 10 DP-compliant sanitization of STHoles Input: Private Dataset D, Buckets b ∈ HSR, privacy budget ε Output: DP-compliant STHoles model θST Holes 1: procedure SanitizeHistogram(HSR, D, ε) 2: for all b ∈ HSR do 3: Set frequency of b to be ¯f(b) (i.e., true count + Lap(1/ε)) 4: return θST Holes 5.9.4 ParamSelect Feature Engineering and Feature Selection We present experimental results supporting that we have carefully selected features for ParamSelect that accurately capture the privacy-utility trade-off across spatial datasets and allow for reliable system parameter estimation. Since the training data is comprised of public datasets D, the feature extraction process is a typical ML problem. Our feature extraction process follows two steps: (I) feature engineering where we transform raw data into a number of features that better represent the dataset for learning our regression model, and (II) feature selection, where we select a subset of the engineered features that provide reliable accuracy across datasets. Feature Engineering. We engineered various features according to relations (such as epsilon-scale exchangeability) well studied in the literature [46] and proposed novel features to capture data distribution in location datasets. While data-independent features were straightforward, data region specific features posed a challenged 137 Feature Set ϕ Relative Error of regressor Φ on cross-validation set ϕ(n) 0.312 ϕ(n, ε) 0.237 ϕ(n, ε, 1/nε, p 1/nε) 0.193 ϕ(n, ε, 1/nε, p 1/nε, P OPD) 0.207 ϕ(n, ε, 1/nε, p 1/nε, ANND) 0.225 ϕ(n, ε, 1/nε, p 1/nε, SNRD) 0.187 ϕ(n, ε, 1/nε, p 1/nε, hD) 0.151 Table 5.3: Validation set error of ParamSelect in predicting ρ since they need to summarize location datasets while capturing the differences in the pattern of originating location signal (e.g., cell phone location signals vs user-checkins in geo-social networks), and differences in skewness between cities (e.g., dense sprawls of New York vs the spares expanses in Kansas City). We generated the following features; (1) Population density (P OPD), calculated as the number of people resident per square mile (as reported by the US Census) (2) Entropy profile (hD), which computes over a flattened grid representation of the region the Shannon’s Entropy of the probabilities of counts in each cell; (3) Average Nearest Neighbor (ANND) distance feature averages the distance to the nearest neighbor for all users in the city; and (4) Signal-to-Noise ratio (SNRD) evaluates how many cells in an overlaid grid have enough signal (in terms of number of user counts in a cell) to not be obliterated by DP noise (average noise is 2/ε when sampled from distribution Lap(1/ε)). Feature Selection. The proposed features are filtered through a feature selection process that evaluates the accuracy achieved by candidate feature subset across different datasets. This step finds a subset of the engineered features that can help genearlize the model across datasets. This selection process is conducted on a validation set (J-K cross-validation folds [82] in our case, with 138 J=3 and K=5). We utilize an iterative feature selection technique that incrementally adds features one at a time and evaluates the subset’s validation performance, ignoring features that do not contribute. In Table 5.3 we report the validation performance (relative error) for the evaluated feature subset. The proposed data region specific feature, entropy hD, is the most valuable for ParamSelect (relative error of 0.151). In brief, features used in ParamSelect are highly performant. In Section 5.9.2, we show that ParamSelect with its use of our feature extraction function vastly improves generalization ability over existing method for system parameter selection by exploiting the additional dataset features and, with the use of ML, their non-linear relationships. We conclude with a discussion on potential future work pertaining to datasets used in ParamSelect module. Recall that, data region specific features (such as hD) are obtained from a proxy dataset. This comprises public domain auxiliary information that is, at a very high-level, similar to our private dataset. In our empirical evaluation we use data sources that were collected a decade apart. While not included in the evaluation, we report that static datasets too perform well such as the positions of points of interests in a city. Other DP-compliant public releases of location datasets, such as that from “Facebook Data For Good” initiative, are also viable. Nevertheless, for regions where an auxiliary source of public information is unavailable, the data-independent features can be utilized to good effect (relative error of 0.193 for feature set ϕ(n, ε, 1/nε, p 1/nε)). 139 Chapter 6 A Neural Database for Queries on Incomplete Relational Data 6.1 Introduction Real-world databases are often incomplete [94, 70, 37, 137, 49, 130, 93]. One reason is data collection cost. To know housing prices in an area, collecting information for every house is costly, if not impossible, (US Census spends $1.505 billion yearly for door-to-door data collection [110]), but Airbnb already provides a sample for free [54] (dataset is a sample because it only contains Airbnb prices and not other housing sources). Another reason is privacy. Studies show lower response rate to questions regarding sensitive attributes, e.g., income, in surveys [119, 115]. A landlord may provide their demographic information in a survey, but is less likely to list their properties and prices. Another reason is data integration across databases with schema mismatch [49, 58, 20, 19]. Two different agencies may track housing prices in two different regions. One region may track both housing and landlord information while the other only stores housing information. After integrating the databases, landlord information will be incomplete. 140 In all such scenarios, some records are entirely missing from the datasets. Given a dataset, one often knows whether data is incomplete by comparing aggregate statistics [94, 93, 70, 37] (e.g., Census population counts), by inspecting the mismatch of records within the database [137] (e.g., when an individual’s record does not appear in certain tables but exists in others), or through knowledge of schema mismatches [49, 20, 19] known during data integration. Meanwhile, OLAP applications require answering aggregate queries on such incomplete datasets, yielding inaccurate answers. Consider the example of average housing price in an area. Richer landlords may be less willing to share the cost of their houses, leading to an overall underestimation of the housing prices in a region. In this chapter, we assume the underlying incomplete data is stored in a relational database, where some records are entirely missing from some tables. We focus on answering aggregate queries, that is, SQL queries that ask for aggregation of some attribute, optionally with WHERE, JOIN and GROUP BY clauses on other attributes. Relational datasets cover many (if not all) of the discussed applications. If data is missing due to data integration across databases, the data is already likely from an OLTP system and in a relational format. If one wants to use public data sources (e.g., information about a city) in a query, such information can also be added as a table to a relational database. A relational setting allows for a systematic study of answering aggregate queries on incomplete datasets. In such a setting, a table is systematically missing some of its records. Recent work studies answering queries on incomplete datasets [94, 164, 49]. The only existing approach for relational datasets, ReStore [49], generates new data to complete the existing database based on the existing foreign key relationship. ReStore’s data generation step can be seen as an extension of data imputation methods (that impute missing attributes [154, 143, 112, 141 25]) to impute entire missing records. For instance, given a complete table of landlords but incomplete table of apartments, ReStore [49] generates synthetic apartments for landlords whose apartments are missing. However, synthetic data generation is challenging. (1) The model needs to learn fine-grained and record-level information from an often small and biased training set. (2) Real-world datasets often contain missing attributes, based on which generating synthetic data can be inaccurate. For example, landlord’s gender might be missing for some landlords, making it more difficult to accurately create synthetic apartments for them. (3) Generated data that respects foreign key relationships is challenging, since multiple foreign key relationships per table are possible. In [49], since only one foreign key relationship (or path [49]) is used to generate data, information available in other tables that can potentially improve accuracy is effectively ignored. We propose a paradigm shift from generating synthetic data to learning a model that directly estimates the query answers. Such a model takes queries as input and directly outputs the query answer, bypassing the data generation step. This approach avoids the above short-comings: (1) since the goal is answering aggregate queries, such a model will be able to learn aggregate information of interest without being hampered down by record-level details, (2) which also makes it less sensitive to missing attributes. (3) Since it does not generate new data, foreign key relationships and the relational structure impose no constraints. Nonetheless, accurately learning the query answers is non-trivial. An approach that learns to mimic observed query answers will fail, since the model learns the wrong answers from the biased observed query answers. We introduce NeuroComplete, an approach that utilizes query embedding and neural networks to accurately estimate query answers. NeuroComplete learns to answer queries in three steps. First, it generates a set of training queries for which accurate answers can be computed 142 given the incomplete dataset. Intuitively, any query that is “restricted” such that its answer only depends on the data in the incomplete database can be answered accurately. Next, NeuroComplete extracts a set of features for each of these queries. Each feature corresponds to the contextual information available about the query answers in the database, and is computed based on how related a database record is to the query. Finally, NeuroComplete trains a neural network in a supervised learning fashion to learn a mapping from the embedding space (i.e., query features) to query answers. The learned model then generates accurate answers to new queries at test time, exploiting the generalizability of the learned model in the embedding space. Our experimental results on real-world datasets show that NeuroComplete provides up to 4x and 10x reduction in error for AVG and COUNT queries, respectively, compared with state-ofthe-art, ReStore [49]. The amount of data required for accurate answers depends on how biased the observed data is. Our results show that NeuroComplete provides accurate answers when 5% (or more) of the data is available, and the data is less biased, while we see that 40% of the data needs to be observed in more biased settings. Specifically, our contributions are as follows. • We present NeuroComplete, a query modeling approach that estimates query answers on incomplete databases without synthesizing new data • NeuroComplete is the first approach that uses generalization in the query embedding space as an effective method to address data bias and incompleteness. • We present novel training set generation and query embedding techniques to train a model whose query answers generalize to the complete database 143 Figure 6.1: Running Example of Apartments Dataset • Our experiments on real-world datasets show that NeuroComplete provides up to 4x and 10x reduction in error for AVG and COUNT queries, respectively, compared with state-ofthe-art, ReStore [49] 6.2 Definitions and Overview Aggregate Queries on Relational Database. Consider a relational database, D, with k tables, T1, ..., Tk. Foreign key relationships connect (some of) the tables. Each table has a primary key, which we assume to be a column named id and uniquely identifies the rows within each table. We consider analytical queries, q, on this database. Informally, q asks for an aggregation of an attribute in some table, where the records in the table are filtered based on some predicate. Formally, q consists of an aggregation function, AGGq, on an attribute Mq of a table Tiq , where M is called the measure attribute. It furthermore consists of a predicate function Pq(D) that, when applied to D, returns a subset of Tiq . We call the set of rows that satisfy a predicate the matching rows of the predicate. Such a query can be represented as a SQL statement that asks for aggregation of some attribute, with WHERE and optionally JOIN and GROUP BY clauses on other attributes.. The answer to the query q is AGGq(Pq(D).Mq). We define the query function f(q) as f(q) = AGG((P(D)).M). We drop the dependence of AGG, P and M on q when it is understood 144 from the context. The predicate P can be based on the attributes in Ti or Tj , for j ̸= i, and applied to Ti through JOIN of the tables. To simplify the discussion, we do not consider the GROUP-BY clause for now, but, in Sec. 6.5.1, we show how it can be incorporated into queries. Our experiments include queries with GROUP-BY and JOIN clauses (see Sec. 6.6.1). We focus on aggregate queries. Our approach estimates query answers by learning patterns of the query answers. Answering non-aggregate queries requires memorizing specific data points, and thus cannot be supported by our approach. We use Fig. 6.1 as our running example. The figure shows a database of apartments, their landlord and the zip code for the apartments. An analytical query on this database can ask for average rent for apartments whose landlord is female. Incomplete Database. We consider the case when we only have access to a subset of records, T¯ i of the table Ti for some i ∈ {1, ..., k} (tables Tj , j ̸= i being incomplete is discussed in Sec. 6.5). We refer to table Ti as incomplete or partially observed and refer to tables Tj , j ̸= i as complete or fully observed. We let the incomplete database D¯ be the database consisting of T¯ i and Tj for all j. We often refer to D (respectively, T) as the true database (resp., true table) and D¯ as observed database (resp., observed table). Finally, we define the observed query function, ¯f(q), as ¯f(q) = AGG((P(D¯)).M). We consider the case when the observed database is a biased sample of the true database, i.e., ED¯∼D[ ¯f(q)] ̸= f(q). Thus the error in answering queries on the observed database isn’t only due to the variance in sampling, but also due to its bias. We denote by n = |Ti | and n¯ = |T¯ i |, the size of the observed table and true table, respectively. In our running example, we assume apartments table is incomplete, where data records missing are marked with a different colour in Fig. 6.1. Answering the average rent query on the observed database will lead to incorrect answers. 145 Problem Definition. The goal of this chapter is to, given the observed database, D¯, answer a query q so that its answer is similar to f(q). However, performing the query on the observed database, D¯, provides an inaccurate answer ¯f(q). Using D¯, we train a model ˆf(.; θ) that takes the query as an input and outputs an estimate of its answer. The model is trained given only D¯, but its answer is expected to be similar to performing queries on D. The asked queries can have arbitrary predicates (our approach makes no assumption on the form of the predicates, and in practice, we’ve evaluated our approach on common predicates with equality and inequality across multiple attributes), a fixed aggregation function AGG and a fixed measure attribute M (different models can be learned for different AGG and M values, as discussed in Sec. 6.5). Let Q be the set of all such queries from a query workload. Formally, we study Problem 3. Given access only to an observed incomplete database D¯, train a model, ˆf, so that 1 |Q| P q∈Q | ˆf(q; θ)−f(q)| is minimized, where f is the query function corresponding to the complete database D. In our running example, the goal is to train a model that can utilize the observed database to answer queries that ask for AVG(rent) (for any query predicate) more accurately than merely calculating the answer on the database. System Setup. We follow the setup of [49] and ask the users to (1) annotate tables with missing records and (2) annotate rows that have complete foreign key relationships, where for such rows, the foreign keys are not missing. If data incompleteness is due to schema mismatch [49, 58, 20] during data integration (e.g., because a table that exists in one database does not exist in another), such annotations are known and do not add any manual overhead. In our running example, we can mark landlords stored in the LA dataset to have complete foreign key 146 :WHERE =7 WHERE = 7 WHERE > 0 WHERE = 2 Return Answer train calc. true answers perform forward pass, 1 Prepare Training Set 2 3 1 2 3 Create Embedding Create Embedding Train Model Train Test Receive New Query labels: features: Embedding Space, Figure 6.2: NeuroComplete Framework relationships (recall that LA dataset contained landlord and apartment tables and thus the foreign key relationships in LA database are complete, while NY database only contained landlord table they do not have complete foreign key relationships). Furthermore, such annotations can be provided by inspecting available aggregate statistics [94, 70, 37]. For instance, if the number of users in an area is lower than the available Census population, the records in that area will be incomplete. Finally, the incompleteness can be known from means of data collection, e.g., collected dataset might be for a certain region (such as Foursquare dataset collected in New York and Tokyo [147]), so one can readily infer data incompleteness (see case-study in Sec. 6.6.7). For ease of discussion, for now, we also assume that the size, n, of true table Ti is known. We relax this assumption in Sec. 6.5.2. 6.2.1 NeuroComplete Framework NeuroComplete embeds queries into a space Z and trains a model, ˆf, from Z to query answers. To do so, NeuroComplete defines an embedding function ρ that takes a query q as an input and outputs an embedding z. To answer any query, q, we first find z = ρ(q) and then provide the estimate ˆf(z; θ) for the query answer. The input to the neural network is a query embedding (detailed in Sec. 6.4), which represents the query in terms of the observed information related to the query. Intuitively, the embedding function ρ (formally defined in Sec. 6.4) aggregates the observed database rows based on how related they are to the query, to represent the query in terms 147 Algorithm 11 NeuroComplete Framework Input: Observed database D¯, query function ¯f, training size s Output: Neural network ˆ¯f 1: procedure TrainNeuroComplete(D¯, ¯f, s) 2: Q ← GenerateQueries(D, s ¯ ) ▷ Generate training set 3: Z ← {zi = ρ(qi , D¯), 1 ≤ i ≤ s} ▷ Create embedding 4: Y ← {yi = ¯f(qi), 1 ≤ i ≤ s} 5: Initialize the parameters, θ, of ˆf(.; θ) 6: repeat 7: Sample a set of indexes, I, up to at most |Y | 8: Update θ in direction −∇θ P i∈I (fˆ(zi;θ)−yi) 2 |I| 9: until convergence 10: return ˆf Input: Test time query q ∗ on database D¯ Output: Estimated answer for query q ∗ 1: procedure UseNeuroComplete(q ∗ , D¯) 2: z ∗ ← ρ(q ∗ , D¯) 3: if q ∗ is count-sensitive then 4: return n n¯ × ˆf(z ∗ ; θ) 5: else 6: return ˆf(z ∗ ; θ) of such relevant information.. This process is shown in Fig. 6.2. During training, NeuroComplete (1) creates a set, Q, of queries for the purpose of training, (2) uses the embedding function, ρ, to find the query embedding for the queries in Q, and (3) uses the queries together with their answer (computed on the observed database) to train a neural network ˆf in a supervised learning setting. The neural network learns a mapping from the embedding space to query answers. To answer a query, NeuroComplete first finds its query embedding and performs a forward pass of the trained neural network with the embedding as its input to provide an estimate of the query answer. Because the database is incomplete, it is non-trivial to generate a training set with accurate labels or to define an embedding function that allows for the desired model generalizability, challenges that are addressed in the remaining of this chapter. We first use Alg. 11 to concretely 148 present NeuroComplete framework. Secs. 6.3 and 6.4, respectively, present training set generation and query embedding in details and Sec. 6.5 discusses the final NeuroComplete system. Our discussion makes a distinction between count-sensitive and count-insensitive aggregations. Count-sensitive aggregations are aggregation functions where scale of the answers changes with the size of the database. COUNT and SUM belong to this category because the answer to such queries increases with data size. On the other hand, count-insensitive aggregation functions are queries where the scale of the answer does not depend on the number of data points, e.g., AVG and MEDIAN. We make this distinction to improve our modeling, because, when answering countsensitive queries, one needs to take into account the size of the database, while count-insensitive queries can be answered without explicitly accounting for database size. NeuroComplete Training. TrainNeuroComplete in Alg. 11 shows the NeuroComplete training procedure. Line 2 corresponds to training set generation where a set of queries, Q, are created for the purpose of model training. The functionGenerateQueries(D¯)takes the observed database D¯ as an input and generates queries for the purpose of training. We present how to define this query generation function for accurate training on incomplete databases in Sec. 6.3. After training set generation, line 3 creates query embeddings for the generated training set using the embedding function ρ. We present the embedding function in Sec. 6.4. Finally, lines 4-9 correspond to model training where the training labels are calculated and a neural network is trained using stochastic gradient descent and with mean squared loss. Line 7 in the algorithm samples a set of indexes I to generate the current batch for training, which are the indexes of queries used in training for the current batch. That is, after sampling I, the current training batch is {(zi , yi), i ∈ I}. 149 Algorithm 12 Training Query Generation Input: The observed database D¯ and training size s Output: A query set, Q 1: procedure GenerateQueries(D¯, s) 2: Q ← ∅ 3: I ← set of ids of rows in T¯ i 4: for i ← 1 to s do 5: A ← a randomly selected attribute from Ti 6: v ← a value in range of A 7: op ← one of ≤, ≥ or = 8: q ← "SELECT AGG(M) FROM Ti WHERE A op v" 9: q += "AND Ti.id IN I" 10: Q.append(q) return Q Answering Queries. After the model is trained, for a test query q, we first find its embedding, by calling embedding function ρ and then performing a forward pass of the trained model with the embedding as an input. If the query is count-insensitive, the estimate for the query answer is the output of the model. Otherwise, the query answer is scaled based on the ratio of the observed data size to the true data size to account for the scale of the answers. 6.3 Training Set Creation This step generates the training queries. Since the observed database is incomplete, the answer to most queries on the observed database will be inaccurate and training a model using such queries can lead to an inaccurate model. Consider a training query q. If Pq(D) contains rows in T but not in T¯, then ¯f(q) ̸= f(q) and thus, the training label created for query q will be wrong. The challenge is creating queries for which we can calculate correct training labels. Restricted Queries. Our main insight is to learn from restricted queries. We define restricted queries as queries whose answers are the same in both D¯ and D. Intuitively, if we restrict the database to the observed database the answer to restricted queries does not change. Formally, 150 define Qr = {q ∈ Q, ¯f(q) = f(q)}. Most real-world queries are not restricted. For instance, in our running example (Fig. 6.1), the query of AVG(rent) of apartments whose landlord is female is not restricted (its answer on the observed database is different from the answer on the true database). However, the query of AVG(rent) of apartments whose id is equal to 1 or 2 is a restricted query (since apartment ids 1 and 2 are in the observed database, and therefore, the correct answer can be evaluated by only using the observed database). Training labels created based on restricted queries are accurate, so that learning from restricted queries creates a model that learns an accurate mapping from queries to their true answers. However, it is difficult to verify if a given query belongs to Qr, without access to D. Nonetheless, it is easy to generate restricted queries. Given any query q we can create a restricted query q ′ by adding a conjunctive clause to the predicate of q. Let I be the set of id values of the rows in the observed incomplete table T¯ i . We can create a conjunction between the predicate of q and the statement Ti .id IN I. Since primary keys are unique, such a query will only match records whose id is in I and thus are in T¯ i . Example. In our running example (Fig. 6.1), consider the query of AVG(rent) of apartments whose landlord is female. Performing this query on the observed database results in a wrong answer, because the apartment with id=7 matches the predicate but is not in the observed database. Nonetheless, we can turn this query into a restricted query. The query of AVG(rent) of apartments whose landlord is female and whose apartment.id is one of 1, 2, 3 or 4 is a restricted query and can be answered accurately from data. Query Generation. Any query can be turned into a restricted query, so the query generation process can use any existing query. For instance, if a query workload is available, each query in the workload can first be restricted to the observed database and then used for training. 151 In the absence of a query workload, our query generation process creates synthetic predicates by randomly picking an attribute, a value for the attribute and an operation among ≤, ≥ and =. The generated query is then modified to be restricted to the observed database. This process is shown in Alg. 12, where for a desired number of queries s, the algorithm defines a predicate in lines 5-8. In line 7, we use ‘=’ for categorical attributes and ‘≤’ or ‘≥’ for numerical attributes. Finally, line 9 turns the query into a restricted query by ensuring that it only matches the records in the observed database. We note that both more sophisticated query generation approaches, such as [168] or extending Alg. 12 to generate more predicate clauses per query, or contain joins, are possible. Nonetheless, we observed this query generation process to be sufficient. In fact, due to our embedding approach described in Sec. 6.4, we expect the complexity of the WHERE clauses used for training not to have a significant impact on the accuracy of the learned model. This is because our query embedding only depends on the distribution of matching rows to the query, and not the complexity of finding those matching rows. 6.4 Query Embedding We discuss the query embedding function ρ. We first present the approach in a two table setting (i.e., assuming database only has two tables, one fully observed and one with missing records) in Secs. 6.4.1-6.4.3. For ease of notation, in the two table setup, we call the table Ti that contains missing records T (and T¯ is the observed subset of T) and refer as O to the complete table in the database (i.e., all records in O are observed). During query embedding, we have access to O and T¯, but not T. Thus, the incomplete (or observed) database contains tables O and T¯. The goal is to answer queries on T (which we do not have access to) using the information available in O and T¯. We discuss multi-table setting in Sec. 6.4.4. 152 Figure 6.3: Query Embedding Example 6.4.1 Overview Query embeddings are created based on the observed database (we do not have access to the complete database). To do so, we utilize rows in the fully observed table O (and not the incomplete table T¯). This is done to avoid biases in the incomplete table T¯ affecting our query embedding. In this section, we present an overview of this approach. To better illustrate the main concepts, here, we assume we have access to the complete database. We discuss, in detail, how we generate query embeddings while having access only to the observed database in Secs. 6.4.2 and 6.4.3. We define query embedding as a summary of rows in O that are relevant to the query q. We propose a two step process, where we (1) for each row in O find their row relevance (RR), a weight that quantifies how related each row is to the query q and (2) aggregate the rows in O based on the calculated row relevance to represent q in terms of rows of O. An example assuming access to the complete database, is shown in Fig. 6.3. For the apartment table and a given query, we calculate row relevance of records in Landlord table, and thus the query embedding is based on records in Landlord table and uses its schema (even though the query asks for apartment rent information). 153 Row Relevance. Row relevance (RR) of a row in O to a query q captures how related the row is to the query answer. Let Tq be the set of matching rows in T for the query q. We define, for a row in O with O.id = i for an integer i, its row relevance αi to be αi = COUNT(σO.id=i(Tq ▷◁ O)). The above expressions considers the weight of the i-th row as how many times the row appears when O is joined with the matching rows Tq. Intuitively, if αi is large, it means the i-th row of O has a strong relationship to the set of rows that match the query. If αi is zero, it means deleting the i-th row, and its related rows in T (i.e., delete with cascade) will have no impact on the query q, and thus, the i-th row should not impact the representation of q. In practice, we cannot calculate row relevance exactly, because we do not have access to the complete database. We discuss in Sec. 6.4.2 how row-relevance is calculated in practice. Fig. 6.3 shows how the row relevance values are calculated in our running example (based on the complete database). We see that for the query shown in Fig. 6.3, the row relevance for landlord with id 2 and 3 is 0, while landlord with id 1 has RR equal to 3. Intuitively, removing Landlords with id 2 and 3 does not change the query answer (and thus, RR=0) while landlord with id 1 has a significant impact on the query answer (so larger RR). Row Aggregation. To summarize information in O that relate to the query q, we perform a weighted aggregation of the values in O, weighted according to their RR values. Fig. 6.3 shows how the rows are aggregated to create the final query embedding in our example. The embedding contains the weighted average of the income of the landlords (i.e., (2 × 3 + 4 × 1)/(3 + 1) = 2.5) and the distribution of the gender of the landlords (in this case, they are all female). 154 Figure 6.4: Row Relevance Calculation 6.4.2 Row Relevance Calculation Row relevance of a record in O is defined as COUNT(σO.id=i(Tq ▷◁ O)), where i is the id of the record in O. In practice, we do not have access to the true database, but only the observed database. Because we only see T¯, we will not know all the records in Tq ▷◁ O and cannot directly evaluate their row relevance. Instead, we estimate the row relevance when it cannot be evaluated exactly based on the observed data. To do so, we divide the rows into two sets: (1) rows with Known Row Relevance (KRR rows) which are rows for which row relevance can be accurately calculated on the observed data, and (2) rows with Unknown Row Relevance (URR rows) which are rows for which row relevance cannot be calculated on observed data. Known and Unknown Row Relevance. More formally, KRR rows are defined as rows for which σO.id=i(Tq ▷◁ O) = σO.id=i(T¯ q ▷◁ O) and URR rows are the remainder of the table. Given that we do not have access to T, we cannot evaluate if a row is KRR by checking the definition. Here, we describe two conditions used to decide if a row is KRR. Condition 1. q is a restricted query. If q is restricted, by definition, Tq ▷◁ O and T¯ q ▷◁ O are the same. Thus, row relevance for all rows in O can be exactly calculated. 155 Condition 2. Oi has complete foreign key relationship. By definition, if Oi (the row in O with id=i) has complete foreign key relationship, then σO.id=i(T ▷◁ O) = σO.id=i(T ▷◁ O ¯ ). This implies that σO.id=i(Tq ▷◁ O) = σO.id=i(T¯ q ▷◁ O), since Tq and T¯ q are subsets of T¯ and T respectively. Condition 1 implies that for training queries all rows are KRR, so row relevance is exactly calculated based on observed data. Condition 2 means at test time, for some records we can exactly calculate row relevance but for others we need to estimate it. This process is described below. Row Relevance Calculation for KRR. Row relevance calculation for KRR rows is straightforward. we calculate it exactly by evaluating the expression COUNT(σO.id=i(Tq ▷◁ O)). For example, in Fig. 6.4, this expression can be exactly calculated for landlord with ids 1 and 2. We see that landlord 1 appears three time and landlord 2 appears zero times in Tq ▷◁ O, so that their RR are 3 and 0 respectively. Row Relevance Calculation for URR. We learn to estimate the row relevance for URR rows using the calculated row relevance of KRR rows. For a query, let OKRR be the set of KRR rows in O, YKRR their calculated row relevance and OURR the URR rows. We train a neural network in supervised learning fashion, where OKRR are the training features and YKRR the training labels. We call this model row relevance model to distinguish it from the model that is trained to predict query answers (i.e., in Alg. 11). After training row relevance model, a forward pass of the model estimates row relevance of URR rows. Fig. 6.4 shows row relevance calculation in our running example. (1) Row relevance is calculated for the two KRR rows. Then (2) each KRR row is used as a training sample to train a neural network that estimates row relevance. The model takes gender and income as input and outputs an estimate RR. After the model is trained (3) we input the gender and income of the URR 156 rows into the model and (4) obtain RR estimates for the URR rows. Fig. 6.4 shows that the model estimates RR for landlord 3 to be 1 (while true RR is 0) and RR for landlord 4 to be 2 (while true RR is 1). 6.4.3 Row Aggregation We aggregate the rows in O according to the row relevance values. If categorical attributes are present in O, we one-hot encode them before aggregation. We aggregate rows for countinsensitive aggregation functions (e.g., AVG, MEDIAN, STD) and count-sensitive aggregation functions (e.g., COUNT, SUM) differently. Count-insensitive aggregations are aggregation functions where the scale of the answers does not change with the size of the database. Thus embedding does not need to contain information about the number of matching rows. On the other hand, for count-sensitive aggregation functions, the embedding needs to contain information about the number of matching rows to allow the model to adjust to the scale of the answers. For count-insensitive aggregations, we use the weighted average of the features in O as the query embedding, where the weights are based on row relevance values. For count-sensitive aggregations, we use weighted sum of features in O, normalized by n¯ if the queries are restricted or by n if they aren’t. By incorporating the total row relevance values in count-sensitive aggregations, we allow the embedding to contain information about the number of matching rows. At the same time, we normalize the embedding by table size to ensure the number of matching rows is considered as a proportion of the table size. This creates an embedding that adjusts to data size while also containing information about the number of matching rows to a query. 157 Figure 6.5: Multi table query embedding Row aggregation creates a semantically meaningful summary of the matching rows in O. For numerical values, the summary is the sum or average of the values. For categorical columns (that are one-hot encoded) the summary shows the distribution of the categories existing in the rows. 6.4.4 Multiple Tables and Final Embedding Algorithm Our approach simply extends to multiple tables by considering each table separately. We iterate over the tables in the database, and for every table Tj , j ̸= i, and given that the incomplete table is Ti , we consider every Ti and Tj pair. For every pair we repeat the same algorithm as before, which yields a query embedding based on the table Tj . Finally, the embeddings based on each Tj are concatenated together to provide the final query embedding. Fig. 6.5 shows the process for our running example, now with all three tables. We first find an embedding using the landlord table, as discussed before. Next, the same process is repeated for the zip code table, to obtain a zip code embedding. The two embeddings are then concatenated to create a single embedding vector shown in the figure. Final Algorithm. Alg. 13 presents the final query embedding algorithm. The algorithm iterates over the tables and calculates the row embedding by finding row relevance and then 158 Algorithm 13 Complete Query Embedding Algorithm Input: A query q on observed database D¯ Output: Query embedding 1: procedure ρ(q, D¯) 2: for all tables Tj in {T1, ..., Tk} \ {Ti} do 3: for all KRR rows with id=x in Tj do 4: αx ← COUNT(σTj .id=x(Pq(D¯) ▷◁ Tj )) 5: if Any URR row exists in Tj then 6: gˆ(.; θ) ← Trained row relevance model 7: for all URR rows with id=x in Tj do 8: αx ← gˆ(σTj .id=x(Tj ); θ) 9: zj ← P x αxσTj .id=x(Tj ) ▷ Column-wise sum 10: if AGG is not count-sensitive then 11: zj ← Pzj x αx 12: else 13: if q is a restricted query then 14: zj ← zj n¯ 15: else 16: zj ← zj n 17: return [z1z2...zk] performing row aggregation. Finally, all the embedding are concatenated ([x1, ..., xn] denotes concatenating x1, ..., xn) to create the final query embedding. Performing Joins and Choosing Tables. The notion of joins in the algorithm is overloaded when referring to tables without explicit foreign key relationships with each other. We call it a join between two tables if there exists a non-empty set of foreign key relationships connecting Ti and Tj . We can limit the number of tables used to generate the embedding based on the length of the path (i.e., number of foreign key relationships connecting Ti and Tj ). That is, we can only consider the set of tables that are joinable with Ti through at most a limited number of other tables. This can be beneficial because often the longer the join path is, the less relevant the table to the information in Ti will be. Overall, we let τ be the number of fully observed tables used to create the embedding. 159 Embedding Time Complexity. For each fully observed table, O (among the τ used for embedding in total), the algorithm goes over rows in OKRR ▷◁ Tq to calculate row relevance for the KRR rows. Let O′ = OURR ∪ (OKRR \ OKRR ▷◁ Tq), where OKRR \ OKRR ▷◁ Tq are the KRR rows that don’t match the predicate q (so their row relevance is 0). The algorithm then goes over the rows in O′ , where for KRR rows in O′ it sets row relevance to zero, while for URR rows in O′ it performs a forward pass of the row relevance model. Assuming training a row relevance model takes tT , model forward pass takes time tF , and finding the result of the join OKRR ▷◁ Tq takes tJ , the embedding computation takes O(tT + tF × |OURR| + |OKRR| + |OKRR ▷◁ Tq| + tJ ). This process is repeated τ times, each time for a different fully observed table O. We perform the process in parallel across the τ tables. In our experiments, this process takes 4-15 seconds across all settings (see Sec. 6.6.5), which is comparable to performing queries on the true (much larger) database, where the cost of performing joins is higher. 6.5 End-to-End System and Discussion 6.5.1 End-To-End System Setup. NeuroComplete setup requires minimal effort to (1) annotate tables with missing records and (2) annotate rows for which complete foreign key relationship is available. As discussed in Sec. 6.2, such information is often readily available as a result of the database integration processes. In this setup, NeuroComplete will accompany a relational database system for table with missing data. Supported Queries. The query answering process follows Alg. 11, where a model is first trained and then used to answer the query. A NeuroComplete model is trained to answer queries 160 with aggregation AGG of an attribute T.M, where M is an attribute in table T. Thus, after a NeuroComplete model is trained, it can answer queries with any predicate that ask for AGG(T.M). Such queries can contain JOIN or GROUP BY clauses as well as any SQL predicates (in fact, NeuroComplete supports general predicates, as defined in Sec. 6.2, e.g., arbitrary polygons). NeuroComplete supports GROUP BY by iteratively estimating the query answer for each group in the GROUP BY by adding the group membership as a predicate to the query. 6.5.2 Further Considerations Efficiency Considerations. Recall that to answer a query, we first obtain a query embedding (where we utilize row relevance models) and perform a forward pass of the NeuroComplete to obtain the query answer estimate. For efficient querying, we train NeuroComplete models at a pre-processing step and use them at query time. A single NeuroComplete model answers all queries for a fixed measure attribute and aggregation function. When measure attribute and/or aggregation function changes for different queries, multiple models may need to be trained to answer different queries. We decide which queries to build a model for based on the incomplete tables and query workload. We build a NeuroComplete model for queries in the workload where measure attribute is in an incomplete table. NeuroComplete models are small (less than 1 MB in all our experiments), and storing several models based on workload is practical. As discussed in Sec. 6.4.4, query embedding (including row relevance model training) is fast and is done at query time. More Missing Data. In a database we may have (1) multiple tables with missing records or (2) some records may contain missing attributes. For case (1), our approach can be used without modification, if, in addition to Ti any other table Tj , j ̸= i is also incomplete. Nonetheless, given 161 that NeuroComplete relies on Tj tables for query embedding, enough information needs to be available in those tables to allow for accurate predictions. In practice, especially when systematic bias exists in multiple tables, one can choose to exclude tables with missing records from being used in embedding of other queries. For case (2), we need to ensure that row aggregation supports missing values. This is achieved by simply ignoring the massing values when performing row aggregation. True Data Size, n. So far, we’ve assumed true data size n, used to scale NeuroComplete answers for count-sensitive queries, is known. In practice, this is often true: such information may be publicly available (e.g., we know population of an area based on census data), data owners may be willing to share such aggregate information (e.g., a house rental agency may release number of apartments they have in an area but not the detailed apartment information) or may be known based on domain knowledge (e.g., a rental agency may be able to estimate the number of apartments they have but there may not be a detailed record of the apartments in the database). If n is not known, we can estimate it using methods similar to those in [49, 94]. We observed that [49] does estimate the true table size accurately, so we use their method for our estimation of true table size. Note that estimating true table size does not require generating accurate synthetic records, and only requires correctly estimating how many records are missing. Thus, if true data size is not known, estimating it is added as an extra step to the NeuroComplete system. 162 Figure 6.6: Dataset information [49] 6.6 Empirical Study 6.6.1 Experimental Setup Our experimental setup largely follows [49]. Each experiment uses a real-world dataset. We remove a set of records to obtain a biased subset which is provided to the algorithms to answer a set of queries. The goal is to answer queries accurately. Experiments were performed on a machine with Ubuntu 18.04 LTS, an Intel i9-9980XE CPU (3GHz), 128GB RAM and a GeForce RTX 2080 Ti NVIDIA GPU. Complete datasets. We use two real datasets, Housing and Movies, whose schema and size is shown in Fig. 6.6 (image from [49]). Housing contains information about different Airbnb listings (such as the apartment type, its neighbourhood and landlord) and is obtained from [54]. Movies contains information about movies listed on IMDB (such as their genre, production year, their directors and actors and company that made them) and is obtained from [83]. We use datasets as pre-processed by [49]. Incomplete dataset generation. The incomplete dataset generation is done as follows. First, we pick a table, as the incomplete table, and an attribute from the table, as the biased attribute. For a keep rate parameter x, we keep x% of the total records in the incomplete table, i.e., |T¯| = x×|T|. 163 Setup Dataset Incomplete Table Biased Attribute H1 Housing Apartment Price H2 Housing Landlord Response rate M1 Movies Movie Production year M2 Movies Director Birth year Table 6.1: Incomplete dataset generation setup We select this subset T¯ based on a bias factor parameter, b ∈ [0, 1]. To choose the records, we (1) sort T based on the biased attribute and select the top |T| × x × b records (i.e., records with the highest biased attribute value) and (2) select |T| × x × (1 − b) records from the remaining records of T (i.e., from records not selected in step (1)) uniformly at random. If b = 1, the sample is completely biased and if b = 0 the sample is unbiased. Based on the above procedure, we create 2 setups for each dataset, as shown in Table 6.1. Test Queries. We consider test queries with COUNT and AVG aggregation functions and with JOIN, GROUP BY and/or WHERE clauses. None of the test queries are restricted queries, and thus test queries do not overlap with our training queries, all of which are restricted queries. For AVG queries, to be able to study the impact of bias on query answers, we let the measure attribute be the same as the biased attribute for each setup (e.g., queries in H1 all ask for AVG(price)). We use the same GROUP BY and/or WHERE clauses as [49]. Each query has a GROUP BY and/or WHERE clause on a subset of columns shown in Table 6.2. For example, an AVG query in H1 asks for AVG(price) WHERE room_type=1. The query involves a JOIN if WHERE/GROUP BY is on a column from a different table than the measure attribute. For COUNT queries, we report results on predicates on the biased attribute. This is to isolate the impact of bias on query answers as otherwise a query answer can be unaffected by our sampling procedure. 164 Setup Predicate and Group By attribute H1 Apt.room_type, Apt.price, LL.host_- since, Apt.property_type, Apt.accommodates, H2 LL.host_since, LL.response_time, LL.response_rate, Apt.room_type M1 Movie.genre, Movie.production_year, Director.birth_country M2 Director.gender, Director.birth_year Table 6.2: Testing query predicate and group by attributes Metrics. As discussed above, each setup consists of a set of test AVG and test COUNT queries Q. For AVG queries, we report mean absolute error (MAE), calculated as 1 |Q| P q∈Q |f(q) − y|, where y is the estimated answer. As discussed in Sec.6.5.1, GROUP BY queries are considered as multiple queries, each query with a WHERE clause corresponding to a group membership. For COUNT queries, to evaluate whether a method de-biases the results (rather than just scaling up the answers), we compare the MAE in normalized counts. That is, if the estimated size of T is nˆ and the size of T¯ is n¯, then we report 1 |Q| P q∈Q | f(q) n − y n¯ |. For NeuroComplete, we set n¯ to be the same as in ReStore. We train NeuroComplete for 5 different random initialization and report the average and standard deviation of MAEs across runs. Compared with [49], we use absolute error instead of relative error due to its robustness when ground truth is close to zero, and we do not present bias reduction since bias reduction is only applicable to methods that generate synthetic data. Baselines. We compare NeuroComplete with the state-of-the-art, ReStore[49]. We used their implementation in [113]. ReStore trains a model to generate more data to complete the dataset and answers queries on the completed dataset. For ReStore, we spent a week on parameter tuning, performing extensive parameter search for each setup. For each setting, we ran the model with 165 Figure 6.7: Results for H1 AVG Queries Figure 6.8: Results for H2 AVG Queries Figure 6.9: Results for M1 AVG Queries Figure 6.10: Results for M2 AVG Queries various possible modeling choices (SSAR vs. AR) and various completion paths, evaluated it on the test set and chose the result with the best test set performance. This ensures that ReStore’s model hyperparameters are set to best possible, but is an unrealistic evaluation (showing better performance than possible in practice, since in practice we do not know the ground truth for test set queries). Therefore, we call it ReStore+ as a reminder of this unfair advantage. We also use Sample as a baseline, which answers queries only based on the observed samples. NeuroComplete Implementation. We implemented NeuroComplete in python and JAX (code available at [161]). The model is a 10 layer fully connected neural network with width 60 in each layer, trained with mean squared error loss function (as shown in Alg. 11 line 8) and Adam optimizer. Training consists of 1,000 iterations, and the model with smallest training error is used to perform test queries. Row relevance models have the same architecture as above. We use between 1,000-2,000 training samples across the settings. 166 Figure 6.11: Results for H1 COUNT Queries Figure 6.12: Results for H2 COUNT Queries Figure 6.13: Results for M1 COUNT Queries Figure 6.14: Results for M2 COUNT Queries 6.6.2 Comparison Results Results for AVG. Figs. 6.7-6.10 compare NeuroComplete with other methods across settings for AVG queries. Each figure shows, for a setting, how the error changes for different keep rates and bias factors. For NeuroComplete, the shaded area shows one standard deviation above/below error, where standard deviation is over 5 training runs. We observe that NeuroComplete outperforms the baselines across settings in almost all cases, improving accuracy of state-of-the-art by up to a factor of 4. For AVG queries, NeuroComplete provides large improvements in the housing dataset, while methods are comparable on Movies dataset for AVG queries. Furthermore, NeuroComplete is the most effective when bias factor is less than 1 and when keep rate is less than 80%. When bias factor is 1, NeuroComplete does not see enough variation in query answers during training to be able to accurately extrapolate to unseen queries. On the other hand, when keep rate is 80%, Sample itself is very accurate, and inherent modeling errors do not allow for much improvement for NeuroComplete over observed values. Interestingly, for bias less than 1, NeuroComplete’s error is only marginally impacted by change in keep rate For instance, Fig. 6.7 (a) shows NeuroComplete’s error changes from 40 at 167 5% to 20 at 80% keep rate, compared with Sample and Restore+ whose error changes from 150 to 20 in the same range of keep rates. This is because NeuroComplete, unlike ReStore+, does not directly use the observed data points for training (i.e., the training size of NeuroComplete is the same independent of the keep rate). On the other hand, NeuroComplete relies on the generalizability of learning based on the observed query embeddings. Thus, results in Fig. 6.7-6.10 suggest that generalization in query embedding space is robust to the number of observed data points. We also see that, in the cases where NeuroComplete error is not affected by increase in keep rate (e.g., Fig. 6.9 (a) or Fig. 6.10 (a)), NeuroComplete’s standard deviation goes down as keep rate increases. That is, often, more data increases the generalization robustness in NeuroComplete, reducing the reliance on initialization. Results for COUNT. Figs. 6.11-6.14 show the results for COUNT queries. Similar to AVG, NeuroComplete improves the accuracy by multiple factors across settings. Compared with AVG, NeuroComplete is able to improve the accuracy for COUNT aggregation functions even at the bias factor of 1. Compared with ReStore+, NeuroComplete is always better, up to a factor of 10. Our results show that ReStore+ often has larger error than Sample. To understand this result, recall that ReStore+ generates new records. In fact, in most reported settings, total number of records in the database synthesized by ReStore+ closely matches the true number of records. Nonetheless, the distribution of attribute values (measured by our error metric) is further from the ground-truth than in the observed database. For instance, in M2 setup (Fig. 6.14), we observed that ReStore+ generates many new records to match the number of records in the ground-truth. However, almost none of the newly generated records match the query predicate (while most of the true record do in fact match the predicate). That is, even though the number of records that match 168 Figure 6.15: (a) and (b): visualizing training and test distributions. (c): Avg. distance to the nearest training query from test queries. the predicate in ReStore+ is closer to ground-truth compared with Sample, the number of records that match the predicate as a proportion of data size is further away from ground-truth compared with Sample. Our error metric measures the latter which we believe to be more important (as it measures distribution of the records irrespective of data size). Finally, for high bias factor or low keep rate, NeuroComplete has higher standard deviation, i.e., not all random neural network initializations converge to a good minima. This shows difficulty of generalization when training queries are from a different distribution than test queries. 6.6.3 Training vs. Test Query Distribution Analysis We analyze impact of training distribution on NeuroComplete accuracy. We compare two settings: low bias defined as keep rate=0.8 and bias factor=0.6 and high bias defined as keep rate=0.05 and bias factor=1. In both settings, even though the observed data size is different, NeuroComplete creates the same number of training queries. However, the training queries are embedded differently, resulting in the embedding distribution used for training to be different. This impacts the accuracy of NeuroComplete, since answering test queries depends on how well the model generalizes in the embedding space to the unseen test query distribution. To investigate this, Fig. 6.15 (a) and (b) show the training and test query embeddings for AVG queries in H1 and 169 Figure 6.16: Robustness to Missing Attributes 25 50 100 200 400 Observed data size (x10 3 ) 0 5 10 15 Sec. (a) H1 18 36 72 144 288 Observed data size (x10 3 ) 0 5 10 Sec. (b) M1 Sample Restore + NeuroComplete Figure 6.17: Query Time Figure 6.18: Training size and duration 5 5 " 5 ' !" !# & & & "' #!#%"##" $! # Figure 6.19: Comparison of Sampling and Learning M1 settings. We use t-SNE [132] for visualization, which uses neighborhood graphs for dimensionality reduction to allow for visualizing the structure of the high-dimensional space. In this experiment, to isolate the impact of embedding distribution, row relevance for test queries is calculated based on the complete dataset (i.e., assuming a perfectly accurate row relevance model), so that test query embedding is not affected by the data bias. Fig. 6.15 (a) and (b) visually show that test embedding distribution is more similar to training embeddings distribution in the low bias setting compared with the high bias setting. Fig. 6.15 (c) quantifies this similarity. It plots dist. NTS, defined as the average distance to the nearest training sample from test samples. That is, for the test set Q and each test query, q ∈ Q, let dq be the distance from q to q’s nearest training query and define dist. NTS= 1 |Q| P q dq. We use Euclidean distance in the original embedding space (without dimensionality reduction). The lower dist. NTS, the more similar training and test query embeddings are. Fig. 6.15 (c) shows that across datasets, in the low bias setting, test queries are more similar to training queries. This 1 justifies the results in Figs. 6.7-6.14, where the increase in error from low bias to high bias setting can be attributed to the increase in distance between train and test embedding distribution. As this distance increases, generalization becomes more difficult, thus accuracy decreases. 6.6.4 Multiple Incomplete Tables We evaluate NeuroComplete when there is more missing data beyond a single incomplete table. We introduce missing attributes in tables that were assumed to be complete in previous experiments. Here, experiments are in the H1 setting, where previously Landlord was assumed to be complete. For every landlord attribute and for each record, we remove its value with a probability, dr, referred to as drop rate. In Fig. 6.16, we vary dr for COUNT (Fig. 6.16 (a)) and AVG (Fig. 6.16 (b)) queries to study its impact on the performance of the models. We observe that this parameter has little impact when keep rate is 20% or 80%, showing the robustness of our approach to missing values. At keep rate 5% for AVG, the error increases when drop rate increases, while for COUNT query the error first increases then decreases. This result suggests NeuroComplete is less robust to missing attributes when observed data is too small. 6.6.5 Scalability and Efficiency Anlysis Query Time. Fig. 6.17 studies query time of the various algorithms across two different settings and for different observed data sizes (each observed data size corresponds to a specific keep rate). We see that Sample is the fastest algorithm, as it performs no processing besides answering queries on the observed data. We see that NeuroCompletes’ query time varies between 4 and 15 seconds across settings. Row relevance model training accounts for most of the query time, 171 where models are trained for a fixed number of iterations. The difference in query time across settings is due to the difference in the dimensionality of the embedding space, where H1, which has the highest embedding dimensionality takes the longest. Compared with Restore+ we see that Restore+ answers queries faster in the setting H1 but slower in M1. This is because Restore+ synthesizes data when it receives a query, and how much data it needs to generate depends on the complexity of the relational schema. As a result, it becomes slower in M1, which has a more complex schema, compared with H1. Training Time. Fig. 6.18 (a) shows impact of training time on NeuroComplete error in H1 setup with bias factor 0.8 at various keep rates (kr). The lines show the error for different keep rates. Fig. 6.18 (a) shows average accuracy for 5 different runs, and the shaded area is the standard deviation of model error across runs. Overall, the results show that models fit within a few seconds of training, and more epochs, especially for smaller keep rates causes over-fitting. Furthermore, we see that standard deviation is larger for smaller keep rates and the model performance is more sensitive to initialization in that case. 6.6.6 Number of Training Samples Fig. 6.18 (b) shows how model accuracy changes based on number of training samples in H1 setup with bias factor 0.8 at various keep rates (kr), where error drops with more training samples used. Interestingly, even 75 training samples with keep rate 80% performs better than using 2025 training samples at keep rate 5%. This shows that, even with small number of samples, the model can adjust to the scale of required answers, thus providing reasonable estimates. 172 6.6.7 Case-Study: Estimating AVG Visit Duration We present an example where real-world contextual information is represented in relational format and used to provide query answers for queries where data is not complete. Dataset. We use a dataset of location report of individuals (i.e., latitude and longitude of user locations) which contains the time duration users spent at different locations in a city. Each record is a tuple of the format (lat., lon., duration). Furthermore, each city is divided into various neighbourhoods. The goal is to answer the query of AVG time spent in a neighbourhood by users (i.e., range predicate on lat. and lon, and duration is the measure attribute). We also have a dataset of Point-of-Interests (POIs), where each record consists of lat, lon and POI information. This information can be placed in a relation database with three tables: Visits, Neighbourhoods and POI. Visit table has schema (lat, lon, duration, neighbourhood id), POI table has schema (lat, lon, POI information, neighbourhood id) and neighbourhood table has schema (id, neighbourhood information). For the visit table, we use Veraset (VS) dataset, a proprietary dataset that contains anonymized location reports of cell-phones across the US collected by Veraset [133], a data-as-a-service company. Each location report contains an anonymized id, timestamp and the latitude and longitude of the location. We performed stay point detection [153] on this dataset (to, e.g., remove location signals when a person is driving), extracted location visits where a user spent at least 15 minutes, and recorded the duration of each visit. 527,932 location visits in downtown Houston were thus extracted to form the dataset used in our experiments, which contains three columns: latitude, longitude and visit duration. For POI table, we use Safegraph Places [117], a publicly available dataset containing POI information in the US. For Neighbourhood table, we partition city of Houston with a 20x20 grid, 173 yielding 400 neighbourhoods. We use grid location for each neighbourhood, without other neighbourhood specific information. 352 neighbourhoods had at least one visit in VS dataset and we only kept those neighbourhoods. Incomplete dataset generation. We let Visit table be the incomplete table. We assume we have visits data for some neighbourhoods and no data for others. For a parameter x, we randomly sample a set of, x neighborhoods, keep all visits that fall in those neighbourhoods, and remove the visits for all other neighbourhoods to generate our observed database. We expect to see such a geographical bias in data collection in practice. Many datasets are only available for a single area (e.g., Foursquare [147] covers New York and Tokyo and CABS dataset is only available for San Francisco [105]). Furthermore, for data collected from mobile apps, there is a bias based on who uses the app which translates into location (e.g., older people may not use the app, and there will be less data for areas with older population). Results. Given x neighbourhoods with data, we apply NeuroComplete to find AVG visit duration for neighbourhoods without data. We compare this with the alternative of collecting data for the neighbourhoods without data to answer queries. Fig. 6.19 (a) depicts these two alternatives. The two lines in Fig. 6.19 (a) are not directly comparable and they are plotted with different y-axes (NeuroComplete: left axis, Sampling: right axis). Sample size refers to number of new points sampled for neighbourhoods without data, to obtain an estimate for the query in those neighbourhoods. Fig. 6.19 (a) shows that given an error tolerance level, one has two alternatives in answering the query. For instance, for error of 75 mins, one can either sample around 2,000 points from the neighbourhood in question, or use 130 other neighbourhood’s information and train NeuroComplete to obtain an estimate. Fig. 6.19 (b) shows this trade-off, but from the perspective 174 of accuracy improvement per point sampled. If one has information of 30 neighbourhoods, one needs to sample at least 2,000 points for a new neighbourhood to be able to obtain an estimate better than what NeuroComplete provides using the known 30 neighbourhoods. 6.7 Related Work There has been recent effort in answering queries on incomplete datasets [49, 94, 154, 143, 112, 25]. Data imputation approaches [49, 154, 143, 112, 25] use the observed data to estimate the missing values. Except ReStore [49], other work only consider attribute values missing and aren’t applicable to our setting where entire records are missing. ReStore [49] utilizes foreign key relationships to synthesize new data records, and the synthetically generated data is added to the database. After data generation, the query is answered as in a typical relational database. NeuroComplete learns to directly predict query answers and is fundamentally different from such a data generation approach. Specifically, NeuroComplete learns a model that takes queries as input and outputs query answers; in contrast, Restore learns the probability distribution of the data in order to synthesize new data. To do so, NeuroComplete designs novel query embedding and training data generation steps to allow the models’ query answers to generalize to the complete dataset.. Our experiments show up to an order of magnitude accuracy gain in NeuroComplete over ReStore, showing the benefits of this approach. NeuroComplete is also related to [94, 164], but [94] only considers a single table setting and requires aggregate information to answer queries, and [164] considers answering spread queries on incomplete spatiotemporal datasets. Furthermore, [26, 65] study the impact of incompleteness on the query results which is orthogonal to our work. 175 Moreover, our work is related to uncertain and probabilistic databases, where attribute values or their presence in the database is uncertain [2, 127, 38, 31]. However, unlike NeuroComplete, such approaches cannot handle missing records directly and require manual insertion and annotation of records with probabilities, which is challenging since such information is often not available. 6.8 Conclusion We proposed NeuroComplete, the first query modeling approach for answering queries on incomplete data. By restricting queries to the observed database, NeuroComplete generates training queries whose correct query answers can be computed from the incomplete database. It uses row relevance to create query embeddings based on summary of relevant information to the query within the database. Experiments show NeuroComplete answers queries more accurately than state-of-the-art. Future work includes using our query embedding for complete databases and considering more robust training approaches (e.g., drop out). 176 Part III Theoretical Analysis of Learned Database Operations 177 Chapter 7 Overview of Theoretical Analysis Our discussion so far centered around on how to build well-optimized practical learned database systems for efficient and accurate query answering under real-world constraints. Here, we focus on developing a theoretical understanding of the learned database operations. We broadly study various learned database operations (specifically considering indexing, sorting, cardinality estimation, and range aggregate query answering). Our discussion is centered around two questions. Providing Performance Guarantees for A Model Choice. We study how to provide performance guarantees for learned database operations for specific modeling choices, where we show bounds on the time and space complexity of such learned database operations in static (i.e., fixed) datasets (Chapter 8) and dynamic datasets (i.e., when new data records can be inserted) in the presence of distribution shift (Chapter 9). The theoretical analysis here develops statistical tools to analyze the performance of learned database operations, showing why and when they perform better than non-learned methods. Our analysis first focuses on indexing in Chapter 8, and we extend the discussion to sorting and cardinality estimation in Chapter 9, developing the 178 distribution learnability analysis framework. The framework allows analyzing different database operations in dynamic datasets and under data distribution shift. Understanding Required Modeling Choices. We study what modeling choices are needed to be able to perform database operations well, where we specifically show lower bounds on the required model size to be able to perform various database operations to a desired accuracy (Chapter 10). Our analysis develops information theoretical tools to show lower bounds on model size needed to perform database operations using any model. We specifically study, indexing, cardinality estimation and range-sum estimation (i.e., range-aggregate queries with SUM aggregation function). 179 Chapter 8 Theoretical Performance Guarantees for Static Learned Indexes 8.1 Introduction It has been experimentally observed, but with little theoretical backing, that the problem of finding an element in an array has very efficient learned solutions [44, 62, 40, 33]. In this fundamental problem in data management, the goal is to find, given a query, the elements in the dataset that match the query (e.g., find the student with grade=q, for a number q, where “grade=q” is the query on a dataset of students). Assuming the query is on a single attribute (e.g., we filter students only based on grade), and that data is sorted based on this attribute, binary search finds the answer in O(log n) for an ordered dataset with n records. Experimental results, however, show that learning a model (called a learned index [62]) that predicts the location of the query in the array can provide accurate estimates of the query answers orders of magnitude faster than binary search (and other non-learned approaches). The goal of this chapter is to present a theoretical grounding for such empirical observations. 180 More specifically, we are interested in answering exact match and range queries over a sorted array A. Exact match queries ask for the elements in A exactly equal to the query q (e.g., grade=q), while range queries ask for elements that match a range [q, q′ ] (e.g., grade is between q and q ′ ). Both queries can be answered by finding the index of the largest element in A that is smaller than or equal to q, which we call the rank of q, rank(q). Range queries require the extra step, after obtaining rank(q), of scanning the array sequentially from q up to q ′ to obtain all results. The efficiency of methods answering range and exact match queries depends on the efficiency of answering rank(q), which is the operation analyzed in the rest of this chapter. In the worst-case, and without further assumption on the data, binary search finds rank(q) optimally, and in O(log n) operations. Materializing the binary search tree and variations of it, e.g., B-Tree [13] and CSS-trees [111], utilize caching and hardware properties to improve the performance in practice but theoretical number of operations remains O(log n) (we consider data in memory and not external storage). On the other hand, learned indexes have been empirically shown to outperform non-learned methods by orders of magnitude. Such approaches learn a model that predicts rank(q). At query time, a model inference provides an estimate of rank(q), and a local search is performed around the estimate to find the exact index. An example is shown in Fig. 8.1, where for the query 13, the model returns index 3, while the correct index is 5. Then, assuming the maximum model error is ϵ, a binary search on ϵ elements of A within the model prediction (i.e., the purple sub-array in Fig. 8.1) finds the correct answer. The success of learned models is attributed to exploiting patterns in the observed data to learn a small model that accurately estimates the rank of a query in the array. 181 Index 5 6 3 4 0 1 2 =13 5 10 15 20 2 4 10 11 12 17 21 at most max model err. estimate correct index rank( ) learned index elements Filter with binary search Figure 8.1: A learned index used to solve the rank problem. However, to date, no theoretical result has justified their superior practical performance. [40] shows a worst-case bound of O(log n) on query time, the same as traditional methods, but experimentally shows orders of magnitude difference. The only existing result that shows any theoretical benefit to learned indexing is [39], that shows constant factor better space utilization while achieving O(log n) query time under some assumptions on data distribution. The question remains whether theoretical differences, beyond merely constant factors, exist between learned and traditional approaches. We answer this question affirmatively. We show that 1. Using the same space overhead as traditional indexes (e.g., a B-tree), and under mild assumptions on the data distribution, a learned index can answer queries in O(log log n) operations on expectation, a significant and asymptotic improvement over the O(log n) of traditional indexes; 2. With the slightly higher but still near-linear space consumption O(n 1+ϵ ), for any ϵ > 0, a learned index can achieve O(1) expected query time; and 3. Under stronger assumptions on data distribution, we show that O(log log n) expected query time is also possible with O(1) space overhead (O(1) space overhead is similar to performing binary search without building any auxiliary data structure). We present experiments showing these asymptotic bounds are achieved in practice. 182 These results show order of magnitude benefit in terms of expected query time, where the expectation is over the sampling of the data, and not worst-case query time (which, unsurprisingly, is O(log n) in all cases). Intuitively, this means that although there may exist data instances where a learned index is as slow as binary search, for many data instances (and on expectation), it is fast and sub-logarithmic. Analyzing expected query time allows us to incorporate properties of the data distribution. Our results hold assuming certain distribution properties: query time in (i) and (ii) is achieved assuming bounded p.d.f of data distribution ((i) also assumes non-zero p.d.f), while (iii) assumes the c.d.f of data distribution is efficiently computable. Overall, data distribution had been previously hypothesized to be an important factor on the performance of a learned index (e.g., [62]). This chapter shows how such properties can be used to analyze the performance of a learned index. 8.2 Preliminaries and Related Work 8.2.1 Problem Definition Setup. We are given an array A ⊆ Dn , consisting of n elements, where D ⊆ R is the domain of the elements. Unless otherwise stated, assume D = [0, 1]; we discuss extensions to other bounded or unbounded domains in Sec. 8.3.4. A is sorted in ascending order, where ai refers to the i-th element in this sorted order and A[i : j] denotes the sorted subarray containing {ai , ..., aj }. We assume A is created by sampling n i.i.d random variables and then sorting them, where the random variables follow a continuous distribution χ, with p.d.f fχ and c.d.f Fχ. We use the notation A ∼ χ to describe the above sampling procedure. 183 Rank Problem. Our goal is to answer the rank problem: given the array A and a query q, return the index i ∗ = Pn i=1 IA[i]≤q, where I is the indicator function. i ∗ is the index of the largest element no greater than q and is 0 if no such element exists. Furthermore, if q ∈ A, q will be at index i ∗+1. We define the rank function of an array A as rA(q) = Pn i=1 IA[i]≤q. The rank function takes a query as an input and outputs the answer to the rank problem. We drop the dependence on A if it is clear from context and simply use r(q). The rank problem is therefore the problem of designing a computationally efficient method to evaluate the function r(q). Let Rˆ A(q; θ) be a function approximator, with parameters θ that correctly evaluates r(q). The parameters θ of Rˆ A are found at a preprocessing step and are used to perform inference at query time. Let T(Rˆ A, q) be the number of operations performed by Rˆ A to answer the query q and let S(Rˆ A) be the space overhead of Rˆ A, i.e., the number of bits required to store θ (note that S(Rˆ A) does not include storage required for the data itself, but only considers the overhead of indexing). We study the expected query time of any query q as EA∼χ[T(Rˆ A, q)], and the expected space overhead as EA∼χ[S(Rˆ A)]. In our analysis of space overhead, we assume integers are stored using their binary representation so that k integers that are at most M are stored in O(k log M) bits (i.e., assuming no compression). Learned indexing. A learned indexing approach solves the rank problem as follows. A function approximator (e.g., neural network or a piecewise linear approximator) rˆA(q; θ) is first learned that approximates r up to an error ϵ, i.e., ∥rˆA − r∥∞ ≤ ϵ. Then, at the step called error correction, another function, h(i, q), takes the estimate i = ˆrA(q; θ) and corrects the error, typically by performing a binary search (or exponential search when ϵ is not known a priori [33]) on the array, A. That is, given that the estimate rˆA is within ϵ of the true index of q in A, a binary search on the 2ϵ element in A that are within ϵ of rˆA(q; θ) finds the correct answer. Letting 184 Rˆ A(q; θ) = h(ˆrA(q, θ), q), we obtain that for any function approximator, rˆA with non-zero error ϵ, we can obtain an exact function with expected query time of EA∼χ[T(ˆrA, q)] + O(log ϵ) and space overhead of EA∼χ[S(ˆrA)] since binary search requires no additional storage space. In this chapter, we show the existence of function approximators, Rˆ A that can achieve sub-logarithmic query time with various space overheads. 8.2.2 Related Work Learned indexes. The only existing work theoretically studying a learned index is [39]. It shows, under assumptions on the gaps between the keys in the array, as n → ∞ and almost surely, one can achieve logarithmic query time with a learned index with a constant factor improvement in space consumption over non-learned indexes. We significantly strengthen this result, showing sub-logarithmic expected query time under various space overheads. Our assumptions are on the data distribution itself which is more natural than assumption on the gaps, and our results hold for any n (and not as n → ∞). Though scant in theory, learned indexes have been extensively utilized in practice, and various modeling choices have been proposed under different settings, e.g., [44, 62, 40, 33] to name a few. Our results use a hierarchical model architecture, similar to Recursive Model Index (RMI) [62] and piecewise approximation similar to Piecewise Geometric Model index (PGM) [40] to construct function approximators with sub-logarithmic query time. Non-Learned Methods. Binary search trees, B-Trees [13] and many other variants [111, 66, 12], exist that solve the problem in O(log n) query time, which is the best possible in the worst case in comparison based model [87]. The space overhead for such indexes is O(n log n) bits, as they have O(n) nodes and each node can be stored in O(log n) bits. We also note in passing that 185 if we limit the domain of elements to a finite integer universe and do not consider range queries, various other time/space trade-offs are possible [100], e.g., using hashing [43]. 8.3 Asymptotic Behaviour of Learned Indexing 8.3.1 Constant Time and Near-Linear Space We first consider the case of constant query time. Theorem 6. Suppose the p.d.f, fχ(x), is bounded, i.e., fχ(x) ≤ ρ for all x ∈ D, where ρ < ∞. There exists a learned index with space overhead O(ρ 1+ϵn 1+ϵ ), for any ϵ > 0, with expected query time of O(log 1 ϵ ) operations for any query. ρ is a constant independent of n, and for any constant ϵ, asymptotically in n, space overhead is O(n 1+ϵ ) and expected query time is O(1). The theorem shows the surprising result that we can in fact achieve constant query time with a learned index of size O(n 1+ϵ ). Although the space overhead is near-linear, this overhead is asymptotically larger than the overhead of traditional indexes (with overhead O(n log n)) and thus the query time complexities are not directly comparable. Interestingly, the function approximator that achieves the bound in Theorem 6 is a simple piecewise constant function approximator, which can be seen as a special case of the PGM model that uses piece-wise linear approximation [40]. Our function approximator is constructed by uniformly dividing the space into k intervals and for each interval finding a constant that best approximates the rank function in that interval. Such a function approximator is shown as rˆA(q; θ) in Fig. 8.2 for k = 5. Obtaining constant query time requires such a function approximator to have constant error. It is, however, non-obvious why and when only O(n 1+ϵ ) pieces will be sufficient on expectation to achieve constant error. In fact, for the worst-case (and not the expected 186 Index 0 6 3 5 10 15 20 max err. constant Figure 8.2: Approximation with a piecewise constant function Index 0 6 3 5 10 15 20 max err. recursively search %vspace-0.7cm Figure 8.3: Approximation with c.d.f case), for a heavily skewed dataset, achieving constant error would require an arbitrarily large k, as noted by [62]. However, Theorem 6 shows as long as the p.d.f. of the data distribution is bounded, O(n 1+ϵ ) pieces will be sufficient for constant query time on expectation. Intuitively, the bound on the p.d.f. is used to argue that the number of data points sampled in a small region is not too large, which is in turn used to bound the error of the function approximation. Finally, dependence on ρ in Theorem 6 is expected, as performance of learned indexes depends on the dataset characteristics. ρ captures such data dependencies, showing that such data dependencies only affect space overhead by a constant factor. From a practical perspective, our experiments in Sec. 8.5.2 show that for many commonly used real-world benchmarks for learned indexes, trends predicted by Theorem 6 hold with ρ = 1. However, Sec. 8.5.2 also shows that for datasets where learned indexes are known to perform poorly, we observe large values of ρ. Thus, ρ can be used to explain why and when learned indexes perform well or poorly in practice. 8.3.2 Log-Logarithmic Time and Constant Space Requiring constant query time, as in the previous theorem, can be too restrictive. Allowing for slightly larger query time, we have the following result. 187 Level 1 Model 1.1 piecewise model Model 2.1 Model 2.2 Model 3.1 Model 3.2 Model 3.3 Model 3.4 Model 3.5 Level 2 Level query Estimated position Figure 8.4: RMI of height log log n with piecewise constant models Theorem 7. Suppose c.d.f of data distribution Fχ(x) can be evaluated exactly with O(1) operations and O(1)space overhead. There exists a learned index with space overhead O(1), where for any query q, the expected query time is O(log log n) operations. The result shows that we can obtain O(log log n) query time if the c.d.f of the data distribution is easy to compute. This is the case for the uniform distribution (whose c.d.f is a straight line), or more generally any distribution with piece-wise polynomial c.d.f. In this regime, we only utilize constant space, and thus our bound is comparable with performing a binary search on the array, which takes O(log n) operations, showing that the learned approach enjoys an order of magnitude theoretical benefit. Our model of the rank function is n × Fχ, where Fχ is the c.d.f of the data distribution. As Fig. 8.3 shows, our search algorithm proceeds recursively, at each iteration reducing the search space by around √ n. Intuitively, the √ n is due to the Dvoretzky-Kiefer-Wolfowitz (DKW) bound [77], which is used to show that with high probability the answer to a query, q is within √ n of nFχ(q). Reducing the search space, s, by roughly √ s at every level by recursively applying DKW, we obtain the total search time of O(log log n) (note that binary search only reduces the search space by a factor of 2 at every iteration). 188 8.3.3 Log-Logarithmic Time and Quasi-Linear Space Finally, we show that the requirement of Theorem 7 on the c.d.f. is not necessary to achieve O(log log n) query time, provided quasi-linear space overhead is allowed. The following theorem shows that a learned index can achieve O(log log n) query time under mild assumptions on the data distribution and utilizing quasi-linear space. Theorem 8. Suppose p.d.f of data distribution fχ(x) is bounded and more than zero, i.e., ρ1 ≤ fχ(x) ≤ ρ2 for all x ∈ D, where ρ1 > 0 and ρ2 < ∞. There exists a learned index with expected query time equal to O(log log n) operations and space overhead O( ρ2 ρ1 n log n), for any query. Specifically, ρ2 ρ1 is a constant independent of n, so that, asymptotically in n, space overhead is O(n log n). This regime takes space similar to data size, and is where most traditional indexing approaches lie, e.g., binary trees and B-trees, where they need O(n log n) storage (the log n is due to the number of bits needed to store each node content) and achieve O(log n) query time. The learned index that achieves the bound in Theorem 8 is an instance of the Recursive Model Index (RMI) [62]. Such a learned index defines a hierarchy of models, as shown in Fig. 8.4. Each model is used to pick a model in the next level of the tree until a model in the leaf level is reached, whose prediction is the estimated position of the query in the array. Unlike RMI in [62], its height or size of the model within each node is not constant and set based on data size. Intuitively, the hierarchy of models is a materialization of a search tree based on the recursive search used to prove Theorem 7. At any level of the tree, if the search space is s elements (originally, s = n) a model is used to reduce the search space to roughly √ s. It is however non-trivial why and when such a model should exist across all levels and how large the model should be. We use the relationship between the rank function and the c.d.f (through DKW bound), and the 189 properties of the data distribution to show that a model of size around √ s is sufficient with high probability. Note that models at lower levels of the hierarchy approximate the rank function only over subsets of the array, but with increasingly higher accuracy. A challenge is to show that such an approximability result holds across all models and all subsets of the array, which is why a lower bound on the p.d.f. is needed in this theorem. Similar to ρ in Theorem 6, ρ1 and ρ2 capture data characteristics in Theorem 8, showing constant factor dependencies on the model size. Our experiments in Sec. 8.5.2 show that for most commonly used real-world benchmarks for learned indexes, trends predicted by Theorem 8 hold with ρ2 ρ1 = 1. However, Sec. 8.5.2 also shows that for datasets where learned indexes are known to perform poorly, ρ2 ρ1 is large, so that ρ2 ρ1 can be used to explain why and when learned indexes perform well or poorly in practice. 8.3.4 Distributions with Other Domains So far, our results assume that the domain of data distribution is [0, 1]. The result can be extended to distributions with other bounded domains, [r, s] for r, s ∈ R, r < s, by standardizing χ as χ−r s−r . This transformation scales p.d.f of χ by s−r. Note that scaling the p.d.f does not affect Theorem 8, since both ρ1 and ρ2 will be scaled by s − r, yielding the same ratio ρ2 ρ1 . On the other hand, ρ in Theorem 6 will be scaled by s − r. Overall, bounded domain can be true in many scenarios, as the data can be from some phenomenon that is bounded, e.g., age, grade, data over a period of time. Next, we extend our results to distributions with unbounded domains. Lemma 10. Suppose a learned index, Rˆ, achieves expected query time t(n) and space overhead s(n) on distributions with domain [0, 1] and bounded (and non-zero) p.d.f. There exists a learned index, 190 Rˆ′ with expected query time t(n) + 1 and space overhead O(s(n) log n) on any sub-exponential distribution with bounded (and non-zero) p.d.f. Combining Lemma 10 with Theorems 6 and 8, our results cover various well-known distributions, e.g., Gaussian, squared of Gaussian and exponential distributions. Proof of lemma 10 builds the known learned index for bounded domains on log n different bounded intervals. This achieves the desired outcome due to the tail behavior of sub-exponential distributions (i.e., for distributions with tail at most as heavy as exponential, see [136] for definition). The tail behaviour allows us to, roughly speaking, assume that the domain of the function is O(log n), because observing points outside this range is unlikely. We note that other distributions with unbounded domain can also be similarly analyzed based on their tail behaviour, with heavier tails leading to higher space consumption. 8.4 Proofs Proofs of the theorems are all constructive. PCA Index (Sec. 8.4.1) proves Theorem 6, RDS algorithm proves Theorem 7 and RDA Index proves Theorem 8. Without loss of generality, we assume the bounded domain D is [0, 1]. The proof for the unbounded domain case (i.e., Lemma 10) is deferred to Appendix 8.7. Proof of technical lemmas stated throughout this section can also be found in Appendix 8.7. 8.4.1 Proof of Theorem 6: PCA Index We present and analyze Piece-wise Constant Approximator (PCA) Index that proves Theorem 6. 191 8.4.1.1 Approximating Rank Function We show how to approximate the rank function r with a function approximator rˆ. To achieve constant query time, approximation error should be a constant independent of n with high probability, and we also should be able to evaluate rˆ in constant time. Lemma 11 shows these properties hold for a piece-wise constant approximation to r. Such a function is presented in Alg. 14 (and an example was shown in Fig. 8.2). Alg. 14 uniformly divides the function domain into k intervals, so that the i-th constant piece is responsible for the interval Ii = [i × 1 k ,(i + 1) × 1 k ]. Since r(q) is a non-decreasing function, the constant with the lowest infinity norm error approximating r over Ii is 1 2 (r( i k ) + r( i+1 k )) (line 6). Let rˆk be the function returned by PCF(A, k, 0, 1). Lemma 11. Under the conditions of Theorem 6 and for k ≥ n 1+ϵρ 1+ ϵ 2 , the error of rˆk is bounded as P(∥rˆk − r∥∞ ≥ 2 ϵ + 1) ≤ 1 n . Proof of Lemma 11. Let ei = supx∈Ii |rˆ(x; θ) − r(x)| be the maximum error in the i-th piece of rˆ. ei can be bounded by the number of points sampled in Ii as follows. Proposition 2. Let si = |{j|aj ∈ Ii}| be the number of points in A that are in Ii . We have ei ≤ si Using Prop. 2, we have ∥rˆ − r∥∞ ≤ maxi∈{1,...,k} si . Prop. 2 is a simple fact that relates approximation error to statistical properties of data distribution. Define smax = maxi∈{1,...,k} si and observe that smax is a random variable denoting the maximum number of points sampled per interval, across k equi-length intervals. The following lemma shows that we can bound smax with a constant and with probability 1 n , as long as k is near-linear in n. 192 Algorithm 14 PCA Index Construction Input: A sorted array A, number of pieces k, approximation domain lower and upper bounds l and u Output: Piecewise constant approximation of r over [l, u] 1: procedure PCF(A, k, l, u) 2: P ← array of length k storing the pieces 3: α ← (u−l) k 4: δ ← 0 5: for i ← 0 to k do 6: P[i] ← 1 2 (rA(l + αi) + rA(l + α(i + 1)) 7: δcurr ← 1 2 (rA(l + α(i + 1)) − rA(l + αi)) 8: δ ← max{δ, δcurr} return P, δ Lemma 12. For any c with c ≥ 3, and if k ≥ n 1+ 2 c−1 ρ 1+ 1 c−1 we have P(smax ≥ c) ≤ 1 n . Setting c = 2 ϵ + 1, we see k ≥ n 1+ϵρ 1+ ϵ 2 holds, so that Lemma 12 together with Prop. 2 prove Lemma 11. 8.4.1.2 Index Construction and Querying Let k = ⌈n 1+ ϵ 2 ρ 1+ ϵ 4 ⌉. We use PCF(A, k, 0, 1) to obtain rˆk and δ, where δ is the maximum observed approximation error. As Alg. 14 shows, rˆk can be stored as an array, P, with k elements. To perform a query, the interval, i, a query falls into is calculated as i = ⌊qk⌋ and the constant responsible for that interval, P[i], returns the estimate. Given maximum error δ, we perform a binary search on the subarray A[l : u], for l = P[i] − δ and u = P[i] + δ to obtain the answer. 8.4.1.3 Complexity Analysis P has O(n 1+ ϵ 2 ) entries, and each can be stored in O(n ϵ 2 ). Thus, total space complexity is O(n 1+ϵ ). Regarding query time, the number of operations needed to evaluate rˆk is constant. Thus, the total query time of the learned index is O(log δ). Lemma 11 bounds δ, so that the query time for any query is at most log( 4 ϵ + 1) with probability at least 1 − 1 n and at most log n with probability at 193 most 1 n . Consequently, the expected query time is at most O(log( 4 ϵ + 1) × (1 − 1 n ) + log n × 1 n ) which is O(1) for any constant ϵ > 0. 8.4.2 Proof of Theorem 7: RDS Algorithm We present and analyze Recursive Distribution Search (RDS) Algorithm that proves Theorem 7. 8.4.2.1 Approximating Rank Function We approximate the rank function using the c.d.f of the data distribution, which conditions of Theorem 7 imply is easy to compute. As noted by [62], rank(q) = nFn(q), where Fn is the empirical c.d.f. Using this together with DKW bound [77], we can establish that rank(q) is within error √ n of nFχ with high probability. However, error of √ n is too large: error correction to find rank(q) would require O(log √ n) = O(log n) operations. Instead, we recursively improve our estimate by utilizing information that becomes available from observing elements in the array. After observing two elements, ai and aj in A (i < j), we update our knowledge of the distribution of elements in A[i + 1 : j − 1] as follows. Define F i,j χ (x) = Fχ(x)−Fχ(ai) Fχ(aj )−Fχ(ai) . Informally, any element X in A[i+1 : j−1] is a random variable sampled from χ and knowing the value of ai and aj implies that X ∈ [ai , aj ], so that the conditional c.d.f of X is P X∼χ (X ≤ x|ai ≤ X ≤ aj ) = F i,j χ (x). We then use DKW bound to show F i,j χ is a good estimate of the rank function for the subarray A[i + 1 : j − 1], defining the rank function for the subarray A[i + 1 : j − 1] as r i,j (q) = Pj−1 z=i+1 Iaz≤q. Formally, the following lemma shows that given observations A[i] = ai and A[j] = aj the elements of A[i + 1 : j − 1] are i.i.d random variables with the conditional c.d.f F i,j χ (x) 194 Algorithm 15 Recursive Distribution Search Algorithm Input: A sorted array A of size n searched from index i to j, a query q Output: Rank of q in A[i : j] 1: procedure Search(A, q, i, j) 2: k ← j − i − 1 3: if k < 25 then 4: return i-1+BinarySearch(A, q, i, j) 5: if ai > q then return 0 6: if ai = q then return 1 7: if aj ≤ q then return j − i + 1 8: ˆi ← i + 1 + k × F i,j χ (q) 9: r ← √ 0.5k log log k 10: l ← ⌊ˆi − r⌋ 11: u ← ⌈ˆi + r⌉ 12: if al > q or ar < q then 13: return i − 1+BinarySearch(A, q, i, j) 14: return l − 1+ Search(A, q, l, u) and uses the DKW bound to bound the approximation error of using the conditional c.d.f to approximate the conditional rank function. Lemma 13. Consider two indexes i, j, where 1 ≤ i < j ≤ n and ai < aj . Let k = j − i − 1. For k ≥ 2, we have P(sup x |r i,j (x) − kFi,j χ (x)| ≥ p 0.5k log log k) ≤ 1 log k . 8.4.2.2 Querying We use Lemma 13 to recursively search the array. At every iteration, the search is over a subarray A[i : j] (initially, i=1 and j = n). We observe the values of ai and aj and use Lemma 13 to estimate which subarray is likely to contain the answer to the query. This process is shown in Alg. 15. In lines 5-7 the algorithm observes ai and aj and attempts to answer the query based on those two observations. If it cannot, lines 8-11 use Lemma 13 and the observed values of ai and aj to 195 estimate which subarray contains the answer. Line 12 then checks if the estimated subarray is correct, i.e., if the query does fall inside the estimated subarray. If the estimate is correct, the algorithm recursively searches the subarray. Otherwise, the algorithm exits and performs binary search on the current subarray. Finally, line 3 exits when the size of the dataset is too small. The constant 25 is chosen for convenience of analysis (see Sec. 8.4.2.3). 8.4.2.3 Complexity Analysis To prove Theorem 7, it is sufficient to show that expected query time of Alg. 15 is O(log log n) for any query. The algorithm recursively proceeds. At each recursion level, the algorithm performs a constant number of operations unless it exits to perform a binary search. Let the depth of recursion be h and let ki be the size of the subarray at the i-th level of recursion (so that binary search at i-th level takes O(log ki)). Let Bi denote the event that the algorithm exits to perform binary search at the i-th iteration. Thus, for any query q, the expected number of operations is EA∼χ[T(ˆr, q)] = X h i=1 c1 + c2P(Bi , B¯ i−1, ....B¯ 1) log ki for constants c1 and c2. Note that P(Bi , B¯ i−1, ....B¯ 1) ≤ P(Bi |B¯ i−1, ....B¯ 1), where P(Bi |B¯ i−1, ....B¯ 1) is the probability that the algorithm reaches i-th level of recursion and exits. By Lemma 13, this probability bounded by 1 log ki . Thus EA∼χ[T(ˆr, q)] is O(h). To analyze the depth of recursion, recall that at the last level, the size of the array is at most 25. Furthermore, at every iteration the size of the array is reduced to at most 2 √ 0.5n log log n+2. For n ≥ 25, 2 √ 0.5n log log n+2 ≤ n 3 4 , so that the size of the array at the i-th recursions is at most n ( 3 4 ) i and the depth of recursion is O(log log n). Thus, the expected total time is O(log log n) . 196 8.4.3 Proof of Theorem 8: RDA Index We present and analyze Recursive Distribution Approximator (RDA) Index that proves Theorem 8. 8.4.3.1 Approximating Rank Function We use ideas from Theorems 6 and 7 to approximate the rank function. We use Alg. 15 as a blueprint, but instead of the c.d.f, we use a piecewise constant approximation to the rank function. If we can efficiently approximate the rank function for subarray A[i − 1 : j + 1], r i,j , to within accuracy O( √ k log log k) where k = j − i − 1, we can merely replace line 8 of Alg. 15 with our function approximator and still enjoy the O(log log n) query time. Indeed, the following lemma shows that this is possible using the piecewise approximation of Alg. 14 and under mild assumptions on the data distribution. Let rˆ i,j t be the function returned by PCF(A[i + 1 : j − 1], t, ai , aj ) with t pieces. Lemma 14. Consider two indexes i, j, where 1 ≤ i < j ≤ n and ai < aj . Let k = j − i − 1. For k ≥ 2, under the conditions of Theorem 8 and for t ≥ ρ2 ρ1 √ k we have P(∥r i,j − rˆ i,j t ∥∞ ≥ ( p 0.5 log log k + 1)√ k) ≤ 1 log k . Proof of Lemma 14. Alg. 14 finds the piecewise constant approximator to r i,j with t pieces with the smallest infinity norm error. Thus, we only need to show the existence of an approximation with t pieces that satisfies conditions of the lemma. To do so, we use the relationship between r i,j and the conditional c.d.f. Intuitively, Lemma 13 shows that r i,j and the conditional c.d.f are similar 197 Algorithm 16 RDA Index Construction Input: A sorted array A of size n sampled from a distribution χ with CDF Fχ, a query q Output: The root node of the learned index 1: procedure BuildTree(A, i, j) 2: k ← j − i + 1 ▷ size of A[i : j] 3: if k ≤ 61 then 4: return Leaf node with content A[i : j] 5: r, ϵ ˆ ← PCF(A[i : j], ⌈ ρ2 ρ1 √ k⌉, ai , aj ) 6: k ′ ← ⌈2 √ k(1 + √ 0.5 log log k) + 2⌉ 7: if ϵ > k ′ 2 then 8: return Leaf node with content A[i : j] 9: C ← array of size ⌈ k k ′ ⌉ containing children 10: for z ← 0 to ⌈ k k ′ ⌉ do 11: C[z] ← BuildTree(A, zk′ , (z + 2)k ′ ) 12: return Non-leaf node with children C and model rˆ with max_err k ′ 2 to each other and thus, if we can approximate conditional c.d.f well, we can also approximate r i,j . Formally, by triangle inequality and for any function approximator rˆ we have ∥r i,j − rˆ∥∞ ≤ ∥r i,j − kFi,j χ ∥∞ + ∥kFi,j χ − rˆ∥∞. (8.1) Combining this with Lemma 13 we obtain P(∥r i,j − rˆ∥∞ ≥ p 0.5k log log k + ∥kFi,j χ − rˆ∥∞) ≤ 1 log k . Finally, Lemma 15 stated below shows how we can approximate the conditional c.d.f and completes the proof. Lemma 15. Under the conditions of Lemma 14, there exists a piecewise constant function approximator, rˆ, with ρ2 ρ1 √ k pieces such that ∥rˆ − kFi,j χ ∥∞ ≤ √ k. 198 10 2 10 3 10 4 10 5 10 6 10 7 n 1.0 1.5 2.0 No. Operations (a) Time Complexity 10 2 10 3 10 4 10 5 10 6 10 7 n 10 3 10 5 10 7 10 9 Index Size (b) Space Complexity ( =0.1) ( =0.01) Uniform = 0.1 = 0.01 Figure 8.5: Constant Query and Near-Linear Space 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations RDS log2(n) Figure 8.6: LogLogarithmic Query and Constant Space 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations (a) Time Complexity 10 2 10 3 10 4 10 5 10 6 10 7 n 10 3 10 5 10 7 Index Size (b) Space Complexity RDA log2(n) n Figure 8.7: Log-Logarithmic Query and Quasi-Linear Space 8.4.3.2 Index Construction and Querying Lemma 14 is an analog of Lemma 13, showing a function approximator enjoys similar properties as the c.d.f. However, different function approximators are needed for every subarray (for c.d.f.s we merely needed to scale and shift them differently for different subarrays). Given that there are O(n 2 ) different subarrays, a naive implementation that creates a function approximator for each subarray takes space quadratic in data size. Instead, we only approximate the conditional rank function for certain sub-arrays while still retaining the O( √ k log log k) error bound per subarray. Construction. Note that r(q) = 0 only if q < a1, so we can filter this case out and assume r(q) ∈ {1, ..., n}. RDA is a tree, shown in Fig. 8.4, where each node is associated with a model. When querying the index, we traverse the tree from the root, and at each node, we use the node’s model to choose the next node to traverse. Traversing down the tree narrows down the possible answers to r(q). We say that a node N covers a range SN , if we have r(q) ∈ SN for any query, q, that traverses the tree and reaches N. We call |SN | node N’s coverage size. Coverage size is the size of search space left to search after reaching a node. The root node, N, covers {1, ..., n} with coverage size n and the coverage size decreases as we traverse down the tree. Leaf nodes have coverage size independent of n with high probability, so that finding r(q) takes constant time 199 Algorithm 17 RDA Index Querying Input: The root node, N, of a learned index, a query q Output: Rank of query q 1: procedure Query(N, q) 2: if N is a leaf node then 3: return BinarySearch(N.content) 4: ˆi ← N.model(q) 5: k ← N.max_err(q) 6: z ← ⌊ˆi−k 2k ⌋ 7: return Query(N.children[z], q) after reaching a leaf. Each leaf node stores the subarray corresponding to the range it covers as its content. RDA is built by calling BuildTree(A, 1, n), as presented in Alg. 16. BuildTree(A, i, j) returns the root node, N, of a tree, where N covers {i, ..., j}. If the coverage size of N is smaller than some prespecified constant (line 3, analogous to line 3 in Alg. 15), the algorithm turns N into a leaf node. Otherwise, in line 5 it uses Lemma 14 to create the model rˆ for N, where rˆ approximates r i−1,j+1 (recall that r i,j is the index function for subarray A[i+1 : j −1]). If the error of rˆ is larger than predicted by Lemma 14, the algorithm turns N into a leaf node and discards the model (this is analogous to line 12 in Alg. 15). Finally, for k ′ as in line 6, the algorithm recursively builds ⌈ k k ′ ⌉ children for N. Each child has a coverage size of 2k ′ and the ranges are spread at k ′ intervals (line 11). This ensures that the set Rˆ = {rˆ − ϵ, rˆ − ϵ + 1, ..., rˆ + ϵ}, (with |Rˆ| ≤ k ′ ensured by line 7) is a subset of the range covered by one of N’s children. Furthermore, for any query q, ϵ is the maximum error of rˆ, so r(q) ∈ Rˆ. Thus, the construction ensures that for any query q that reaches N, r(q) is in the range covered by one of the children of N. Performing Queries. As Alg. 17 shows, to traverse the tree for a query q from a node N, we find the child of N whose covered range contains r(q). When ˆi is N.model estimate with maximum error k, z = ⌊ ˆi−k 2k ⌋ gives the index of the child whose range covers {⌊ˆi−k 2k ⌋2k, ...,(⌊ ˆi−k 2k ⌋ + 200 10 2 10 3 10 4 10 5 10 6 10 7 n 0 10 20 30 40 No. Operations (a) =0.5 10 2 10 3 10 4 10 5 10 6 10 7 n 0 10 20 30 No. Operations (b) =1 10 2 10 3 10 4 10 5 10 6 10 7 n 2 4 6 8 10 No. Operations (c) =5 10 2 10 3 10 4 10 5 10 6 10 7 n 1.5 2.0 2.5 No. Operations (d) =10 WL IOT BK FB OSM WK Figure 8.8: Constant Query Time on Real Datasets 10 2 10 3 10 4 10 5 10 6 10 7 n 10 4 10 7 Index Size =0.5 =1 =5 =10 n Figure 8.9: NearLinear Space on Real Datasets 2)2k} and contains {ˆi − k, ...,ˆi + k} as a subset and therefore contains r(q). Thus, the child at index z is recursively searched. 8.4.3.3 Complexity Analysis The query time analysis is very similar to the analysis in Sec. 8.4.2.3 and is thus deferred to Appendix 8.7. Here, we show the space overhead complexity. All nodes at a given tree level have the same coverage size. If the coverage size of nodes at level i is zi , then the number of pieces used for approximation per node is O( ρ2 ρ1 √ zi) and the total number of nodes at level i is at most O( n zi ). Thus, total number of pieces used at level i is c ρ2 ρ1 n√zi for some constant c. Note that if the coverage size at level i is k, the coverage size at level i + 1 is 4 √ k(1 + √ 0.5 log log k) which is more than k 1 2 . Thus, zi ≥ n ( 1 2 ) i and c ρ2 ρ1 n√zi ≤ c ρ2 ρ1 n n ( 1 2 ) i+1 . The total number of pieces is therefor at most cnρ2 ρ1 Pc ′ log log n i=0 n −( 1 2 ) i+1 ≤ 3cnρ2 ρ1 for some constant c ′ . Each piece has magnitude n and can be written in O(log n) bits, so total overhead is O( ρ2 ρ1 n log n) bits. 8.5 Experiments We empirically validate our theoretical results on synthetic and real datasets (specified in each experiment). 201 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations (a) 1 2 =0.5 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations (b) 1 2 =1 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations (c) 1 2 =10 10 2 10 3 10 4 10 5 10 6 10 7 n 10 15 20 No. Operations (d) 1 2 =20 WL IOT BK FB OSM WK log2(n) Figure 8.10: Log-Logarithmic Query on Real Datasets 10 2 10 3 10 4 10 5 10 6 10 7 n 10 2 10 4 10 6 Index Size 1 2 =0.5 1 2 =1 1 2 =10 1 2 =20 n Figure 8.11: QuasiLinear Space on Real Datasets For each experiment, we report index size and number of operations. Index size is the number of stored integers by each method. Number of operations is the total number of memory operations performed by the algorithm and is used as a proxy for the total number of instructions performed by CPU. The two metrics differ by a constant factors in our algorithm (our methods perform a constant number of operations between memory accesses), but the latter is compiler dependent and difficult to compute. To report the number of operations, we randomly sample a set of 1000 queries Q and a set of A of 100 different arrays from the distribution χ. Let nq,A be the number of operations for each query in q ∈ Q on an array A ∈ A. We report maxq∈Q P A∈A nq,A |A| , which is the maximum (across queries) of the average (across datasets) number of operations. 8.5.1 Results on Synthetic Datasets Constant Query Time and Near-Linear Space. We show that the construction presented in Sec. 8.4.1 achieves the bound of Theorem 6. We consider Uniform and two Gaussian (with σ = 0.1 and σ = 0.01) distributions. We vary Gaussian standard deviation to show the impact of the bound on p.d.f (as required by Theorem 6). Uniform p.d.f. has bound 1, and bound on Gaussian p.d.f with standard deviation σ is 1 σ √ 2π . We present results for ϵ = 0.1 and ϵ = 0.01, where ϵ is the space complexity parameter in Theorem 6. 202 Fig. 8.5 shows the results. It corroborates Thoerem 6, where Fig. 8.5 (a) shows constant query time achieved by near-linear space shown in Fig. 8.5 (b). We also see for larger ϵ, query time actually decreases, suggesting our bound on query time is less tight for larger ϵ. Furthermore, recall that PCA Index scales the number of pieces by ρ 1+ ϵ 2 to provide the same bound on query time for all distributions (where ρ is the bound on p.d.f). We see an artifact of this in Fig. 8.5 (b), where when ρ increases index size also increases. Log-Logarithmic Query Time and Constant Space. We show that the construction presented in Sec. 8.4.2 achieves the bound of Theorem 7. The theorem applies to distributions with efficiently computable c.d.f.s, so we consider distributions over [0, 1] with Fχ(x) = x t for t ∈ {1, 4, 16}. At t = 1, we have the uniform distribution and for larger t the distribution becomes more skewed. Fig. 8.6 corroborates the log-logarithmic bound of Theorem 7. Moreover, the results look identical across distributions (multiple lines are overlaid on top of each other in the figure), showing similar performance for distributions with different skew levels. Log-Logarithmic Query Time and Quasi-Linear Space. We show that the construction presented in Sec. 8.4.3.3 achieves the bound of Theorem 8. We consider Uniform and two Gaussian (σ = 0.1 and σ = 0.01) distributions. The results are presented in Fig. 8.7. It corroborates Theorem 8, where Fig. 8.7 (a) shows constant query time achieved by quasi-linear space shown in Fig. 8.7 (b). Similar to the previous case results look identical across distributions (multiple lines are overlaid on top of each other in the figure). Comparing Fig. 8.7 (a) and Fig. 8.6, we observe that using a piecewise function approximator achieves similar results as using c.d.f for rank function approximation. 203 8.5.2 Results on Real Datasets Setup. On real datasets, we do not have access to the data distribution and thus we do not know the value of ρ in Theorem 6 or ρ1 and ρ2 in Theorem 8. Thus, for each dataset, we perform the experiments for multiple values of ρ and ρ1 or ρ2 to see at what values the trends predicted by the theorems emerge. Since we do not have access to the c.d.f, Theorem 7 is not applicable. We use 6 real datasets commonly used for benchmarking learned indexes. For each real dataset, we sample n data points uniformly at random for different values of n from the original dataset, and queries are generated uniformly at random from the data range. The datasets are WL and IOT from [40, 62, 44] and BK, FB, OSM, WK from [76] described next. WL: Web Logs dataset containing 714M timestamps of requests to a web server. IOT: timestamps of 26M events recorded by IoT sensors installed throughout an academic building. BK: popularity of 200M books from Amazon, where each key represents the popularity of a particular book. FB: 200M randomly sampled Facebook user IDs, where each key uniquely identifies a user. OSM: cell IDs of 800M locations from Open Street Map, where each key represents an embedded location. WK: timestamps of 200M edits from Wikipedia, where each key represents the time an edit was committed. Results. Figs. 8.8 and 8.9 show time and space complexity of the PCA algorithm (Theorem 6) on the real datasets for various values of ρ. Note that value of ρ affects the number of pieces used, as described by Lemma 11. Furthermore, Figs. 8.10 and 8.11 show time and space complexity of the RDA algorithm (Theorem 8) on the real datasets for various ratios of ρ2 ρ1 . Note that value of ρ2 ρ1 affects the number of pieces used per node, as described by Lemma 14. For all except OSM datasets, trends described by Theorems 6 and 8 hold for values of ρ and ρ2 ρ1 as small as 1. This shows our theoretical results hold on real datasets, and the distribution 204 dependent factors, ρ and ρ2 ρ1 , are typically small in practice. However, on OSM dataset value of ρ and ρ2 ρ1 may be as large as 10 and 20 respectively. In fact, [76] shows that non-learned methods outperform learned methods on this dataset. As such, our results provide a possible explanation (large values of ρ) for learned methods not performing as well on this dataset. Indeed, OSM is a one-dimensional projection of a spatial dataset using Hilbert curves (see [76]), which distorts the spatial structure of the data and can thus lead to sharp changes to the c.d.f (and therefore large ρ). 8.6 Conclusion We theoretically showed and empirically verified that a learned index can achieve sub-logarithmic expected query time under various storage overheads and mild assumptions on data distribution. All our proofs are constructive, using piecewise and hierarchical models that are common in practice. Our results provide evidence why learned indexes perform better than traditional indexes in practice. Future work includes relaxing assumptions on data distribution, finding necessary conditions for sub-logarithmic query time and analyzing the trade-offs of different modeling approaches. 8.7 Proofs Proof of Lemma 10. Assume w.l.o.g that the sub-exponential distribution is centered. Let Zl be the event that any point in the array is larger than l or smaller than −l for a positive number l. Since the distribution is sub-exponential and using union bound, P(Zl) ≤ 2ne−lK for some constant K. To have 2ne−lK ≤ 1 log n we get that 2n log n ≤ e lK and l ≥ 1 K log(2n log n). So let l = ⌈ 2 k log 2n⌉ 205 and we have P(Zl) ≤ 1 log n . Now, to construct Rˆ′ we first check if any point in A is larger than l or smaller than −l. If so, we don’t build the index and only do binary search. Otherwise, create 2l instances of Rˆ index, with the i-th index covering the range [−l + i, −l + i + 1]. Note that interval [−l+i, −l+i+1] has length 1, so that scaling and translating the distribution to interval [0, 1] does not impact the p.d.f of the distribution. Queries use one of the learned models to find the answer. Thus, the query time is O(log n) with probability 1 log n , and it is t(n) with probability 1 − 1 log n , which is O(t(n)). The space overhead is now O(s(n) log n). Proof of Lemma 13. Recall that X1,..., Xn are i.i.d random variables sampled from χ. Furthermore, the array A is a reordering of the random variables so that A[i] = Xsi for some index si . That is, each element A[i] is itself a random variable and equal to one of X1, ..., Xn. A[i] = ai is a random event. For k = j − i − 1 ≥ 2, let {Xr1 , ..., Xrk } ⊆ {X1, ..., Xn} be the elements of the sub-array A[i + 1 : j − 1], which we denote by Xrz ∈ A[i + 1 : j − 1] for 1 ≤ z ≤ k. Note that Xr1 , ..., Xrk is not sorted, but the subarray A[i + 1 : j − 1] is the random variables Xr1 , ..., Xrk in a sorted order. For any random variable Xrz for some z ∈ {1, ..., k}, we first obtain it’s conditional c.f.d given the observations A[i] = ai and A[j] = aj . The conditional c.d.f can be written as P(Xrz < x|A[i] = ai , A[j] = aj ). (8.2) Let X¯ = {X1, ...Xn}\{Xrz }. Given that Xrz ∈ A[i+1 : j−1], the event A[i] = ai , A[j] = aj , is equivalent to the conjunction of the following events: (i) at most i−1 of r.v.s in X¯ are less than ai , (ii) at least i of r.v.s in X¯ are less than or equal to ai , (iii) at most j − 2 of r.v.s in X¯ are less than aj , (iv) at least j − 1 of r.v.s in X¯ are less than or equal to aj , and (v) ai ≤ Xrz ≤ aj . This 206 is because (i) and (ii) imply A[i] = ai , while (iii)-(v) imply A[j] = aj . Conversely, A[i] = ai and Xrz ∈ A[i + 1 : j − 1] imply (i) and (ii), A[j] = aj and Xrz ∈ A[i + 1 : j − 1] imply (iii) and (iv), and A[i] = ai , A[j] = aj , Xrz ∈ A[i + 1 : j − 1] imply (v). Now, denote by ϕ(X¯) the event described by (i)-(iv), so that Eq. 8.2 is P(Xrz < x|ϕ(X¯), ai ≤ Xrz ≤ aj ). Xrz is independent from all r.v.s in X¯, so that Eq. 8.2 simplifies to P(Xrz < x|ai ≤ Xrz ≤ aj ) = F i,j χ (x), For all Xrz ∈ {Xr1 , ..., Xrk }. Thus, the r.v.s in {Xr1 , ..., Xrk } have the conditional c.d.f F i,j χ (x). A similar argument to the above shows that r.v.s in any subset of Xr1 , ..., Xrk are independent, given A[i] = ai , A[j] = aj . Specifically, let X˜ ⊆ {Xr1 , ..., Xrk } for |X˜| ≥ 2. Then, the joint c.d.f of r.v.s in X˜ can be written as P(∀X∈X˜X < x|A[i] = ai , A[j] = aj ). (8.3) Similar to before, define X¯ = {X1, ..., Xn} \ X˜. Given that ∀X∈X˜X ∈ A[i + 1 : i − 1], the event A[i] = ai , A[j] = aj is equivalent to the conjunction of the following events: (i) at most i−1 of r.v.s in X¯ are less than ai , (ii) at least i of r.v.s in X¯ are less than or equal to ai , (iii) at most j − 1 − |X˜| of r.v.s in X¯ are less than aj , (iv) at least j − |X˜| of r.v.s in X¯ are less than or equal 207 to aj , and (v) ∀X ∈ X˜, we have ai ≤ X ≤ aj . Now let ϕ(X¯) be the event described by (i)-(iv), so that Eq. 8.3 can be written as P(∀X∈X˜X < x|ϕ(X¯), ∀X∈X˜ ai ≤ X ≤ aj ). All r.v.s in X˜ are independent from all r.v.s in X¯, so that Eq. 8.3 simplifies to P(∀X∈X˜X < x|∀X∈X˜ ai ≤ X ≤ aj ). Finally, all r.v.s in X˜ are also independent from each other, so we obtain that Eq. 8.3 is equal to ΠX∈X˜P(X < x|ai ≤ X ≤ aj ), (8.4) Proving the independence of r.v.s in X˜ conditioned on A[i] = ai , A[j] = aj . To summarize, we have shown that Xr1 , ..., Xrk r.v.s conditioned on A[i] = ai , A[j] = aj are k i.i.d random variables with the c.d.f F i,j χ (x). Moreover, 1 k r i,j A (x) is the empirical c.d.f of the k r.v.s. By DKW bound [77] and for t ≥ 0, we have P(sup x | 1 k r i,j A (x) − F i,j χ (x)| ≥ t √ k ) ≤ 2e −2t 2 . Rearranging and substituting t = √ 0.5 log log k proves the lemma. Proof of Lemma 15. k = j − i − 1. Divide the range [ai , aj ] into t uniformly spaced pieces, so that the z-th piece approximates kFi,j χ over Iz = [ai + z aj−ai t , ai + (z + 1) aj−ai t ], which is an 208 interval of length aj−ai t . Let P(x) be the constant in the Taylor expansion of kFi,j χ around some point in Iz. By Taylor’s remainder theorem, sup x∈Iz |P(x) − kFi,j χ (x)| ≤ k × fχ(c) Fχ(aj ) − Fχ(ai) × aj − ai t (8.5) for some c ∈ Iz, where we have used the fact that the derivative of the c.d.f is the p.d.f, and that any two point in Iz are at most aj−ai t apart. By mean value theorem, there exist a c ′ ∈ Iz so that Fχ(aj )−Fχ(ai) aj−ai = fχ(c ′ ). This together with Eq. 8.5 yields sup x∈R |P(x) − kFi,j χ (x)| ≤ k × fχ(c) fχ(c ′ ) × 1 t ≤ ρ2 ρ1 k t . Setting t ≥ ρ2 ρ1 √ k ensures ρ2 ρk k t ≤ √ k log log k so that ∥P(x) − kFi,j χ (x)∥∞ ≤ p k log log k Proof of Query Time Complexity of Theorem 8. The algorithm traverses the tree recursively proceeds. At each recursion level, the algorithm performs a constant number of operations unless it perform a binary search. Let the depth of recursion be h and let ki be the coverage size of the node at the i-th level of recursion (so that binary search at i-th level takes O(log ki)). Let Bi 209 denote the event that the algorithm performs binary search at the i-th iteration. Thus, for any query q, the expected number of operations is EA∼χ[T(ˆr, q)] = X h i=1 c1 + c2P(Bi , B¯ i−1, ....B¯ 1) log ki for constants c1 and c2. Note that P(Bi , B¯ i−1, ....B¯ 1) ≤ P(Bi |B¯ i−1, ....B¯ 1), where P(Bi |B¯ i−1, ....B¯ 1) is the probability that the algorithm reaches i-th level of tree and performs binary search. By Lemma 14, this probability bounded by 1 log ki . Thus EA∼χ[T(ˆr, q)] is O(h). To analyze the depth of recursion, recall that at the last level, the size of the array is at most 61. Furthermore, at every iteration the size of the array is reduced by 4(√ n)(1 + √ 0.5 log log n). For n ≥ 61, 4(√ n)(1 + √ 0.5 log log n) ≤ n c for some constant c < 1, so that the size of the array at the i-th recursions is at most n c i and the depth of recursion is O(log log n). Thus, the expected total time is O(log log n) . Proof of Prop. 2. Note that rA(x) is a non-decreasing step function, where each step has size 1. Let si be the number of steps of rA(x) in the interval Ii . Therefore, |rA(x) − rA(x ′ )| ≤ si , (8.6) for any x, x′ ∈ Ii . Therefore, for x ∈ Ii , substituting rˆ(x; θ) = rA(i × 1 k ) into Eq. 8.6 we get ei ≤ si . (8.7) 210 Furthermore, points of discontinuity (i.e., steps) of rA(x) occur when x = A[j] for j ∈ [n]. Therefore, si = |{j|A[j] ∈ Ii}|. That is, si is equal to the number of points in A that are sampled in the range Ii . Proof of Lemma 12. Specifically, we bound the probability that smax ≥ c, for some constant c. In other words, we bound the probability of the event, E, that any interval has more than c points sampled in it, for any c ≥ 3. Let δi = Fχ( i+1 k ) − Fχ( i k ) be the probability that a point falls inside Ii , so that δ c i is the probability that a set of c sampled points fall inside Ii . Taking the union bound over all possible subsets of size c, we get that the probability, pi , that the i-th interval has c points or more is at most pi ≤ n c δ c i ≤ (en) c c c δ c i , Where the second inequality follows from Sterling’s approximation. By mean value theorem, there exists c ∈ [ i k , i+1 k ] such that Fχ( i+1 k ) − Fχ( i k ) = fχ(c)( 1 k ). Therefore, δi ≤ ρ k . Thus, by union bound P(E) ≤ X k i=1 (en) c c c ( ρ k ) c = k (en) c c c ( ρ k ) c . (8.8) Now set k ≥ n 1+ 2 c−1 ρ 1+ 1 c−1 and substitute into Eq. 8.8, we obtain that P(E) ≤ 1 n . 211 Chapter 9 Theoretical Guarantees for Dynamic Learned Database Operations 9.1 Introduction Given a fixed dataset, learned database operations (machine learning models learned to perform database operations such as indexing, cardinality estimation and sorting) have been shown to outperform non-learned methods, providing speed-ups and space savings both empirically [62, 61, 63] and, for the case of indexing, theoretically [163, 39]. For dynamic datasets (e.g., when new points can be inserted into the dataset), significant empirical benefits are also often observed when using learned methods. However, an important caveat accompanying theses results is that, especially when data distribution changes, models’ performance may deteriorate after new insertions [33, 89, 138], possibly to worse than non-learned methods [141]. This, combined with the lack of a theoretical understanding of the behavior of the learned models as datasets change, poses a critical hurdle to their deployment in practice. It is theoretically unclear why and when learned models outperform non-learned methods, and, until this thesis, no theoretical work shows any advantage in using the learned methods in dynamic datasets and under distribution 212 shift. The goal of this chapter is to theoretically understand the capabilities of learned models for database operations, show why and when they outperform non-learned alternatives and provide theoretical guarantees on their performance. We specifically study learned solutions for three fundamental database operations: indexing, cardinality estimation and sorting. Our main focus is the study of learned indexing and cardinality estimation in the presence of insertions from a possibly changing data distribution, while we also study learned sorting (in static scenario) to show the broader applicability of our developed theoretical tools. In all cases, a learned model, ˆf(x; θ) is used to replace a specific data operation, fD(x), that takes an input x and calculates a desired answer from the dataset D. For cardinality estimation, fD(x) returns the number of points in the database D that match the query x, and for indexing fD(x) returns the true location of x in a sorted array. The model ˆf(x; θ) is trained to approximate fD(x), and an accurate approximation leads to efficiency gains when using the model (e.g., for learned indexing, if ˆf(x; θ) gives an accurate estimate of location of x in a sorted array, a local search around the estimated location efficiently finds the exact location). In the presence of insertions, the ground-truth fD(x) changes as the dataset changes (e.g., the cardinality of some queries increase as new points are inserted). Thus, as more points are inserted (not only due to distribution shift, but exacerbated by it), the accuracy of ˆf(x; θ) worsens. A common solution is to periodically retrain ˆf to ensure consistent accuracy. This, however, increases insertion cost when insertions trigger a (computationally expensive) model retraining. Theoretically, the relationship between accuracy change and new data insertion has not been well understood, leading to a lack of meaningful theoretical guarantees for learned methods in the presence of insertions. The only existing guarantees are by the PGM index [40], which achieves a 213 Learned Operation Query Complexity Insertion Complexity Space Complexity Indexing T X n log log n + log δn T X n log log n + log δn + B X n log2 log n n log n † CE, d-dim, ϵ = Ω(√ n) T X n max{ δn ϵ , 1}BX n S X n CE, 1-dim T X ϵ2 + log n max{δϵ, 1}BX ϵ2 + log n n ϵ2 S X ϵ2 + n ϵ2 log n Sorting T X n n log log n † S√X n + n log n Sorting, appx. known dist. T Xn S X Table 9.1: Summary of results for data sampled from a distribution learnable class X (CE: cardinality estimation, †: for simplicity assuming S X n , B X n are at most linear in data size, see Theorem 13 and Theorem 17 for general cases). worst-case insertion time of O(log n) with worst-case query time of O(log2 n). Although experimental results show PGM often outperforms B-trees in practice [40, 141], the theoretical guarantees are worse than those of a B-tree (that supports both insertions and queries in O(log n)). Such theoretical guarantees do not meaningfully characterize the index’s performance in practice nor show why and when the learned model performs better (or worse) than B-trees. In this chapter, we present the first known theoretical characterization of the performance of learned models for indexing and cardinality estimation in the presence of insertions, painting a thorough picture of why and when they outperform non-learned alternatives for these fundamental database operations. Our analysis develops the notion of distribution learnability, a characteristic of data distributions that helps quantify learned database operation’s performance for data form such distributions. Using this notion, our results are distribution dependent (as one expects bounds on learned operations should be), without making unnecessary assumptions about data distribution. Our developed theoretical framework builds a foundation for the analysis of learned database operations in the future. To show its broader applicability, we present a theoretical analysis of learned sorting, showing its theoretical characteristics and proving why and when it outperforms non-learned methods. 214 9.1.1 Summary of Results Table 9.1 summarizes our results in the following setting. Suppose n data points are sampled independently from distributions χ1, ..., χn, and let the distribution class X = {χ1, ..., χn}. The points are inserted one by one into a dataset. Di denotes the dataset after i insertion. Our goal is to efficiently answer cardinality estimation and indexing queries on Di accurately for any i, i.e., as new points are being inserted. We denote distribution shift by δ ∈ [0, 1] (defined based on, and often equal, to total variation distance) where δ = 0 means no distribution shift. Table 9.1 also contains results for sorting, where the goal is to sort the fixed array Dn , and the reported results are the time and space complexity of doing so. For sorting only, we assume the samples are i.i.d. All results are expected complexities, with the expectation over sampling of the data, and the insertion complexity is amortized over n insertions. To obtain our results, we develop a novel theoretical framework, dubbed distribution learnability. We provide an informal discussion of the framework before discussing the results. Distribution Learnability. At a high level, distribution learnability means we can model a data distribution well. This notion allows us to state our results in the form “if we can model a data distribution well, learned database operations will perform well on data from that distribution”. Then, if one indeed proves that “we can model the data distribution χ well”, our result immediately implies “learned database operations will perform well on data coming from χ”. Crucially, our Theorem 12 shows that purely function approximation results (independent of the application of learned databases) imply distribution learnability, enabling us to utilize function approximation results to show the benefits of learned database operations. 215 More concretely (but still informally), we say a distribution class X is distribution learnable with parameters T X n , S X n , B X n , if given a set of observations, Dn , from distributions in X, there exists a learning algorithm that returns an accurate model, ˆf, of the distributions in X, and that ˆf can be evaluated in T X n operations, and takes space at most S X n to store. Furthermore, the learning algorithm takes time n × BX n to learn ˆf, where B X n is the amortized training time. The notion is related to statistical estimation, but we also utilize it to characterize time and space complexity of modeling. Results in Table 9.1 are stated for data sampled from any distribution learnable class X. For illustration, we summarize the results for two specific distribution classes: (1) distributions, Xρ, with p.d.f bounded between 0 < ρ1 and ρ2 < ∞, and (2) distributions, Xc, where the data distribution is known and probability of events can be calculated efficiently (e.g., distribution is known to be uniform or piece-wise polynomial). The first case formulates a realistic scenario for the data distribution (experimentally shown by [163]), while the second case presents a best case scenario for the learned models, showing what is possible in favorable circumstances. Lemma 16 shows Xρ, (and trivially) Xc are distribution learnable, deriving the corresponding values for T X n , S X n and B X n (See Table 9.2 for exact values). Next, we discuss Table 9.1 for Xρ and Xc, where we substitute the values of T X n , S X n and B X n from Lemma 16 for Xρ and Xc, and discuss the resulting complexities. Indexing. After substituting the complexities in the first row of Table 9.1 we obtain that for Xρ and Xc, query and insertion complexities are O(log log n + log(δn)). To understand this result, consider the simple scenario with δ = 0, where inserted items are sampled form a fixed distribution, and thus frequent model updates are not necessary. The result shows that a learned model performs insertions and queries in O(log log n), showing their superiority over 216 non-learned methods that perform queries and insertions in O(log n). Nonetheless, when there is a distribution shift, model performance worsens. In the worst-case and when δ = 1, we see no advantage to using learned models over non-learned methods. This is not surprising, since learned models use current observations to make prediction about the future, and if the future distribution is drastically different, one should not be able to gain from using the current observations. Cardinality Estimation. First, consider the second row in Table. 9.1, showing performance of learned models for cardinality estimation in high dimensions but when error is at least √ n. Substituting the complexities in this row, for Xc, we obtain that learned models perform insertions and queries in O(1) time and space in this setting. This is significant, given that a non-learned method such as sampling (and more broadly ϵ-approximations [84]), even in this accuracy regime, needs space exponential in dimensionality [140]. Nonetheless, modeling in high dimensions is difficult, and consequently this result requires the accuracy to be at least √ n. Moreover, even for ϵ ≥ √ n but for more general distribution class of Xρ, our results show that learned methods will also take space exponential in dimensionality (which is broadly needed, even for neural networks [102], without further assumptions). We also mention that √ n has a statistical significance (see Sec. 9.3 for discussion), and appears in our analysis throughout. Second, we show that in 1-dimension (the third row of Table 9.1), learned models perform cardinality estimation queries effectively, where for Xc, a learned model can perform queries and insertions in O(log n) while taking space O( n ϵ 2 log n). This result also shows that a learned approach outperforms the non-learned (and worst-case optimal) method discussed in [140] that takes space O( n ϵ log ϵn) to answer queries. 217 Sorting. Substituting complexities in the fourth row of Table 9.1, we obtain O(n log log n) time complexity for Xρ, using a method that is a variation of [63] that learns to sort through sampling. Our framework applies to this method because its study needs to consider the generalization of a model learned from samples (similar to how models need to generalize to a new dataset after insertions). Moreover, last row of Table 9.1 shows that, if we (approximately) know the data distribution, and the distribution can be efficiently evaluated and stored, we can sort an array in O(n) (T X is independent of data size), showing benefits of using data distribution to perform database operations. To conclude, our results in Table 9.1 are more general than the two distribution classes discussed above. A major contribution of this chapter is developing the distribution learnability framework that allows us to orthogonally study the two problems of modeling a data distribution (the modeling problem), and how learned models can be used to perform database operations with theoretical guarantees (the model utilization problem). Table 9.1 summarizes our contributions to the latter problem, while our results connecting distribution learnability to function approximation concepts (Theorem 12) is our contribution to the former. The rest of this chapter discusses our developed framework and results in more detail, but for the sake of space, formal discussion and proofs are differed to the appendix. 9.2 Preliminaries 9.2.1 Problem Setting Setup. We study performing database operations on a possibly changing d-dimensional dataset. We either consider the setting when n data points are inserted one by one (dynamic setting), or 218 that we are given a fixed set of n data points (static setting). We define Di as the dataset after i insertions, and the final dataset, Dn , is often denoted as D. We study indexing, cardinality estimation and sorting operations. For indexing, the goal is to build an index to store and find items in a 1-dimensional dataset. The index supports insertions and queries. That is, after i insertions, for any i, we can retrieve items from the dataset Di , where the query is either an exact match query or a range query. For cardinality estimation, the dataset is d-dimensional, and we support insertions and axis-parallel queries. That is, after i insertions, for any i, we would like to estimate the number of items in the dataset Di that match a query q, where q defines an axis-parallel hyper-rectangle. Finally, the goal of sorting is to sort a fixed 1-dimensional array, D, of size n. Indexing and sorting always return exact results (i.e., array has to be fully sorted after the operation), while cardinality estimation accepts an error of ϵ for the query answer estimates. Data Distribution and Distribution Shift. We consider the case that the i-th data point is sampled independently from a distribution χi , and denote by D ∼ χ this sampling procedure, where χ = {χ1, ..., χn}. We say D was sampled from a distribution class X if χi ∈ X ∀i. We use total variation to quantify distribution shift. We say D was sampled from a distribution χ with distribution shift δ, when maxχi,χj∈χ ∥χi − χj∥T V , where ∥χi − χj∥T V denoted the total variation (TV) distance between χi and χj . We also define total variation of a distribution set χ as T V (χ) = supχi,χj∈χ ∥χi − χj∥T V . TV is a number between 0 and 1 with δ = 1 the maximum distribution shift and δ = 0 the case with no distribution shift. Problem Definition. We study the performance of learned models when performing the above data operations. Assume an algorithm takes at most TI (D) operations to perform n insertions from a dataset D, at most TQ(D) to perform any query and has S(D), space overhead 219 (excluding the space to store the data). We study amortized expected insertion time defined as 1 nED∼χTI (D), expected query time, ED∼χTQ(D), and storage space, ED∼χS(D). 9.2.2 Learned Database Operations Operation Functions. Let fD(x) be an operation function, defined as a function that takes an input x and outputs the answer, calculated from the database D, for some desired operation. In this chapter, fD is either the cardinality function, cD(x), that takes a query, x, as input and outputs the number of points in D that match x, or the rank function, rD(x) that takes a 1- dimensional query as input and returns the number of elements in D smaller than x. The rank function is used in sorting and indexing, because rD(x) is the index of x if D was stored in a sorted array. We use the notation fD ∈ {rD, cD} (or f ∈ {r, c}) to refer to both functions, rD and cD (for instance, fD ≥ 0 is equivalent to the two independent statements that rD ≥ 0 and cD ≥ 0). We also define distribution operation function, fχ, for an operation f, defined as fχ(x) = ED∼χ[fD(x)], if D is sampled from a distribution χ. Note that distribution operation function depend only on the data distribution (and not observed dataset). For instance, 1 n rχ is the c.d.f of data distribution, χ if D is sampled i.i.d from χ, and similarly 1 n cχ(x) is the probability that a sample from χ falls in an axis-parallel rectangle defined by x. We call ED∼χ[cD(x)] distribution cardinatliy function. Learned Database Operations with Insertions. Learned database operations learn a model ˆf that approximates fD well, and use the learned model to obtain an estimate of the operation output (for sorting and indexing, a refinement step ensures exact result, through either local binary search or lightweight sorting). However, as new data points are inserted and the dataset 220 changes, the ground truth answers to operations change, thereby increasing the model error. Note that model answers are scaled to current data size (i.e., if ˆf was trained on a dataset of size i, and tested on a dataset of size j, we report j i ˆf as answers), but this does not stop the error from increasing. Thus, to guarantee the error is below a threshold, one needs to update the models as the datasets change, which is often done by periodically retraining the models. Model retraining contributes to insertion cost in the database. To keep the insertion cost low, one needs to minimize retraining frequency. Meanwhile, infrequent retraining increases error (and, for indexing, query time). Finding a suitable balance between insertion time, accuracy and query time is a subject in much of our theoretical study. 9.3 Analysis through Distribution Learnability Our goal is to ensure that a model ˆf trained to perform operations f, f ∈ {r, c}, has bounded error. We first discuss a lower bound on the error of models in the presence of insertions, which motivates our analysis framework. Lower Bound on Model Generalization. Consider a model ˆf, trained after the i-th insertion and using dataset Di . Assume the model is not retrained after k further insertions so that ˆf is used to answer queries for dataset Dj , j = i + k. The following lemma shows a lower bound on the expected maximum generalization error of the model to dataset Dj , defined as supx ED∼χ[| j i ˆf(x) − fDj (x)]. 221 Theorem 9. Consider any model ˆf trained after the i-th insertion and on dataset Di . For any integer j > i + 2 and after performing k = j − i new insertions we have sup x EDj∼χ[| j i ˆf(x) − fDj (x)] ≥ √ k 4 , when Dj is i.i.d from any continuous distribution χ. Theorem 9 states that the expected error of a single fixed model, no matter how good the model is when it is trained, after k insertions, will increase to Ω(√ k) on some input. Consequently, to achieve an error at most ϵ, we have to retrain the model at least every (4ϵ) 2 insertions. For any constant error ϵ, this implies n (4ϵ) 2 = O(n) model retraining is needed when inserting n data points. Model retraining for many practical choices costs O(n) (to go over the data at least once), so that amortized insertion cost, i.e., insertion cost per insertion, must be at least O(n). This is significantly larger than non-learned methods, e.g., for indexing B-trees support insertions in O(log n). Nonetheless, the √ k barrier (and consequently a heavy insertion cost) can be avoided, as is often done in practice, by partial retraining. A common example is arranging a set of models in a tree structure and retraining parts of the tree structure as new data is inserted. This avoids a full retraining every O(ϵ 2 ) insertions, but makes smaller necessary adjustments throughout that are cheap to make. Thus, Theorem 9 provides a theoretical justification for many practical design choices in existing work [33, 165, 44] that partition the space and train multiple models, utilizing data structures built around multiple models to perform operations. Analysis Framework Overview. In light of Theorem 9 and existing practical modeling choices that use a set of models to perform an operation, analyzing database operations in the 222 presence of insertion can be divided into two components: (1) how well a model can learn a set of observations (the modeling problem), and (2) how a set of models can be used to perform operations in the presence of insertions (the model utilization problem). Our framework allows studying the two separately, as discussed next. 9.3.1 The Modeling Problem Our analysis is divided into studying the problem of modeling and the problem of model utilization. We introduce the notion of distribution learnability to abstract away the modeling problem when studying the utilization problem. Roughly speaking, if a distribution class is distribution learnable, we can use observations from the class to model their distribution operation functions well. In other words, if a distribution class is distribution learnable, we have a solution to the modeling problem, and thus, we can focus on the model utilization problem. Meanwhile, the modeling problem is reduced to showing distribution learnablity. In this section, we define distribution learnability and discuss how we can prove a distribution class is distribution learnable. 9.3.1.1 Defining Distribution Learnability A distribution class is distribution learnable if there exists an algorithm that returns a good model of the data distribution given an observed dataset. Formally, Definition 3. A distribution class X, is said to be distribution learnable for an operation f, f ∈ {r, c}, with parameters T X n , S X n and B X n , if for any χ ⊆ X, there exists an algorithm that takes a set of observations, D ∼ χ, of size n as input and returns a model ˆf such that: 223 • (Accuracy) If D is sampled from χ, for some χ ⊆ X, we have that PD∼χ[∥fχ − ˆf∥∞ ≥ ϵ] ≤ κ1e −κ2( √ϵ n −1)2 , For any ϵ ≥ √ n and universal constants κ1, κ2 > 0; • (Inference Complexity) It takes T X n number of operations to evaluate ˆf and space S X n to store it; and • (Training Complexity) Each call to the algorithm costs B X n amortized number of operations. That a distribution class is distribution learnable for operation f means that observations from the distribution class can be used to model the expected value of f to a desired accuracy, and that distribution dependent parameters T X n , S X n and B X n , characterize the computational complexity of the modeling (amortized number of operations is total number of operations divided by n, so nB X n is total number of operations). We make two remarks regarding the definition. Remark 10. The accuracy requirement for distribution learnability is defined so that, with high probability, the model error is at most O( √ n). This is due to Theorem 9, which shows the expected generalization error, after n insertions, will be Ω(√ n) as dataset changes because of insertions and irrespective of modeling accuracy. Thus, having modeling error lower than O( √ n) will not improve the generalization error, but will increase inference complexity (larger models will be needed to improve accuracy). Meanwhile, due to the inherent Ω(√ n) error, an extra modeling error of √ n only increases generalization error by a constant factor, thus not changing any of our results asymptotically. Remark 11. Distribution learnability for the distribution class, X is defined so that we can characterize the computational complexity of modeling data from X. Such characterization is important 224 because different modeling choices are beneficial for different distributions. For instance, a linear model may be sufficient to model data from uniform distribution but not for a Gaussian distribution. The definition allows us to distinguish between simple distribution classes where we can create models that are fast to evaluate (e.g., linear models for uniform distribution), from more complex distribution classes that may need more complex models with higher runtime and space complexity (e.g., neural networks for complex distributions). This is done through parameters T X n , S X n and B X n . 9.3.1.2 Proving Distribution Learnability For a distribution class, X, to be distribution learnable for an operation f, we need to be able to model distribution operation functions in that class using some model class F. Intuitively, F needs to have enough representation power to model distributions in X, and we need to be able to effectively optimize over F to find the good representations given an input (i.e., F is optimizable). The following theorem shows that if these two properties are true, then the distribution class is indeed distribution learnable. For the sake of space, we only state our results here informally (formal statement is in Sec 9.7.2), as a formal statement requires making enough representation power and opitimizability concrete, which diverts from our main discussion. Theorem 12 (Informal). Assume a function class, F, is optimizable, and that F has enough representation power to represent G. Let X be a distribution class with fχ ∈ G for all χ ∈ X, for an operation f. Then, X is distribution learnable for operation f. Theorem 12 can be broadly used to translate function approximation results to distribution learnability. For instance, Taylor’s theorem shows that infinitely differentiable functions can be 225 Distribution class T X n S X n nB X n Xρ 1 ρ √ n log n ρ√ n log n Xl log l l log n n log n Xc 1 1 1 Table 9.2: Asymptotic complexities of some distribution learnable classes for rank function defined in Lemma 16 approximated by polynomials to arbitrary accuracy (i.e., polynomials have enough representation power to represent infinitely differentiable functions), and the exchange algorithm [107] shows that we can find the best polynomial approximating a function (i.e., shows optimizability for polynomials). These together with Theorem 12 imply that distributions with infinitely differentiable operation functions are distribution learnable. Nonetheless, the time complexity of function approximation is important when deciding what function class to choose for modeling purposes in database applications. For instance, the exchange algorithm, although converges, can take too long to find polynomials that model functions with a desired accuracy [107]. Our next result uses Theorem 12 to show distribution learnability using piecewise linear and piecewise constant models that show better time/space complexity. We first discuss learnability for rank operations. Lemma 16. Let Xρ be the set of distributions with p.d.f bounded by ρ, Xl the set of distributions with piecewise linear c.d.f with at most l pieces and Xc a distribution the c.d.f of which can be stored and evaluated in constant time. Xρ, Xl , Xc are distribution learnable for rank operation with parameters shown in Table 9.2. Lemma 16 presents results for multiple distribution classes. Xρ formulates a realistic scenario (experimentally shown by [163]). Xc shows the ideal scenario for learned models, where the data distribution is easy to model, and is included to show a best-case scenario for our results when using learned models. Piece-wise linear models have been used for the purpose of indexing [40, 226 44], and Xl is included to study their theoretical properties for the distribution class where they are well suited. Next, consider distribution learnability for cardinality operation. Lemma 17. Let Xρ be the set of distributions for which the distribution cardinatliy function has gradient bounded by ρ, and let Xc be a countable set of c distributions for which distribution cardinatliy function can be stored and evaluated in constant time. Xρ and Xc are distribution learnable for cardinality estimation where the same parameters as Table 9.2 hold for Xc. For Xρ, we have B X n and S X n as O( √ 2d(ρ √ n) 2d log n), while T X n = O(1). As before, we have included Xc to show a best-case scenario for learned models. Nonetheless, cardinality estimation is a problem in high dimensions where modeling is difficult. The exponential behavior in Lemma 17 for Xρ is required for different modeling choices, including neural networks [102, 151]. To reduce complexity, stricter assumptions on data distribution are often justified. For example, attributes may be correlated and only fall in a small part of the space. A common assumption using neural networks is that data is supported on a low dimensional manifold [106], which together with results showing that neural networks can approximate data on low dimensional manifolds well [23], yields that neural networks can avoid space complexity exponential in dimensionality. This is an active area of research orthogonal to our work, and our results show how learned database operations can benefit from such approximation theoretic results as they become available. 9.3.2 The Model Utilization Problem Our results in Sec. 9.4 thoroughly discuss how learned models can perform different database operations for distribution learnable classes. Here, we provide a brief overview of the general methodology and state required definitions. 227 Typical methods used in practice to perform database operations partition the domain and model different parts of the domain separately. Each partition can be denoted by a set R of the space it covers. The model in each partition can be seen as a model of the conditional distribution of the data, where the original data distribution is conditioned on the set R. As such, to effectively model the data in a partition, we need to be able to model the conditional distribution for the partition. This means not only the original data distribution, but also the conditional data distributions need to be distribution learnable. We formalize our notion of conditional distribution to be able to formalize this statement. Let R be a set s.t. R ⊂ 2 D (i.e., R is a set of subsets of the data domain). Then, for any R ∈ R with PX∼χ(X ∈ R) > 0, we define χ|R as the data distribution with c.d.f Fχ|R(x) = PX∼χ(X ≤ x|X ∈ R). In this chapter, unless otherwise stated, R is the set of axis-parallel rectangles, where R = (rmin, rmax) with rmin, rmax ∈ [0, 1]d define two corners of the hyper rectangle. We define the normalized conditional distribution, ¯χ|R, as the distribution with c.d.f Fχ¯|R(x) = Fχ|R((rmax −rmin)x+rmin). The normalization scales the domain of the conditional distribution back to [0, 1]d , and helps standardize our modeling discussion. We define the closure of a distribution class X, denoted by X¯, as the set { ¯χ|R, ∀χ ∈ X, R ∈ R}. That is, X¯ contains not only X but all the other distributions obtained by distributions χ ∈ X conditioned under sets R ∈ R. Often, we need the distribution class X¯, and not only X, to be distribution learnable. X¯ and X can be (but not necessarily are) the same set. An example is the uniform distribution, where conditioning the distribution over any interval yields another uniform distribution over the interval. 228 9.4 Results 9.4.1 Indexing Dynamic Data We show the following result for dynamic indexing. Theorem 13. Suppose D ∼ χ for χ ⊆ X for some distribution class X with T V (χ) ≤ δ, and that X¯ is distribution learnable. There exists a learned index into which the n data points of D can be inserted in O(T X n log log n+ log δ √ n+B X n log2 log n) expected amortized time, that can be queried in O(T X n log log n + log δ √ n) expected time and takes space O(n log n + Plog log n i=0 n 192 9i S X n29i ). The term T X n log log n is due to making log log n calls to the distribution model, and B X n roughly refers to the need to rebuild a model every n insertions. For example, without distribution shift (i.e., δ = 0), one can answer queries and perform insertions with O(log log n) model calls, while every n insertions incurs extra B X n cost for model rebuilding. Distribution shift increases both insertion and query time by O(log δ √ n). In the worst case, having δ = 1, we recover the traditional O(log n) insertion and query time. That is, our results show no gain from modeling when distribution shift is too severe. This is as expected. If data distribution changes too much, one cannot use the current knowledge of data distribution to locate future elements. By systematically handling the distribution shift, we show that a learned method can provide robustness in such scenarios. The data structure that achieves the bound is a tree structure with a distribution model used in each node to find the node’s child to traverse given a query or insertion. The structure can be thought of as a special case of Alex [33], with specific tree height, fanount and split mechanism to ensure the desired gaurantees. All elements are stored at leaf nodes, and the traversal to the 229 leaf nodes is similar to B-trees but using learned models to choose the child. Using Lemma 16 we can specialize Theorem 13 for specific distribution classes. Corollary 14. Let Xρ1,ρ2 be the class of distributions with p.d.f, g bounded as 0 < ρ1 ≤ g(x) ≤ ρ2 < ∞ for any x. Suppose D ∼ χ for χ ⊆ X for some distribution class X ⊆ Xρ1,ρ2 with T V (χ) ≤ δ. There exists a learned index that supports insertions in O(log log n + log δ √ n) expected amortized time, queries in O(log log n + log δ √ n) expected time and takes space O( ρ1 ρ2 n log n). Corollary 14 shows a learned index that performs insertions and answers queries in O(log log n+ log(δn)), while non-learned methods take O(log n). Thus, when distribution shift is not severe, a learned method can outperform non-learned methods, while large distribution shift (δ = 1) leads to same bounds as non-learned methods. Corollary 13 strictly generalizes results in [163] to the setting with insertions and data distribution change. 9.4.2 Cardinality Estimation For cardinality estimation, designing learned models that answer queries with arbitrary accuracy is more challenging due to the high dimensionality of the problem. The curse of dimensionality is a well-understood phenomenon for non-learned methods, leading to approaches that take space exponential in dimensionality [27, 140]. We first show that this is not the case when using learned models if an error of Ω(√ n) is tolerable. Theorem 15. Suppose D ∼ χ for χ ⊆ X for a distribution learnable class X with T V (χ) ≤ δ. There exists a learned cardinality estimator that answers queries with expected errorϵforϵ = Ω(√ n) supports insertions in O(max{ δ √ n ϵ , 1}BX n ), queries in O(T X n ) and takes space O(S X n ). Theorem 15 states that we can use a distribution model to answer queries for any expected error Ω(√ n). Consequently, when we can effectively model a data distribution, we can answer 230 queries to accuracy at least √ n without having an exponential space blowup. Comparing this with random sampling, and more broadly ϵ-approximations, that need at least √ n logd−1 ( √ n) data samples to answer queries with accuracy √ n [140, 78], we show a clear advantage to learned models over such non-learned methods in this accuracy regime. Theorem 15 uses a single distribution model that is periodically retrained with insertion. The frequency of retraining depends on distribution shift. If δ ≤ √ 1 n , the error caused by distribution shift is on a similar scale as error due to randomness. Thus, the distribution shift does not significantly affect insertion time. On the other hand, in the worst case when δ = 1, we need to retrain the model every 1 ϵ insertions, which can be significant depending on retraining cost. Error of Ω(√ n) is not necessarily too large. Indeed, expected query answer for a fixed query with probability p is n×p, so error, relative to the expected query answer is O( √ n np ) = O( √ 1 n ) and goes to zero as data size increases. Nonetheless, one may wish to answer queries more accurately. Below, we discuss how to achieve this in one dimension. Appendix 9.7.3.5, presents Lemma 18 that shows how ideas in one dimension can be extended to high dimensions, but nevertheless, only achieves space complexity exponential in dimensionality, similar to non-learned methods. Arbitrary Accuracy in One Dimension. In one dimension, we show the following is possible using learned models. Theorem 16. Suppose D ∼ χ for χ ⊆ X for a distribution class X with T V (¯χ) ≤ δ, and that X is distribution learnable. There exists a learned cardinality estimator that answers queries with expected error ϵ for any ϵ > 0 supports insertions in O(max{deltaϵ, ϵ2}BX ϵ 2 ) + log n), queries in O(log n + T X ϵ 2 ) and takes space O( n ϵ 2 S X ϵ 2 + n ϵ 2 log n). Theorem 16 shows that we can effectively answer queries to any accuracy in one dimension using learned models. Importantly, the result shows that if the data distribution can be modeled 231 space-efficiently (e.g., whenever S X ϵ 2 ≤ log n), then a learned approach outperforms non-learned (and worst-case optimal) method discussed in [140] that takes space O( n ϵ log n) to answer queries with accuracy ϵ. The learned model that achieves the bound in Theorem 16 uses a combination of materialized answers and model estimates to answer queries. Given that a model can be at best accurate to √ n if the dataset contains n points, the algorithm divides up the data domain into n ϵ 2 intervals, each containing ϵ 2 points, so that a model for each interval will have accuracy ϵ. Meanwhile, the algorithm materializes query answers that span multiple intervals so that errors do not accumulate when answering such queries. The materialization is done through a B-tree like structure, where each node stores the exact number of points inserted into it. Because, in our construction, we build several models each for a subset of the space, it is not enough that the total variation between the distributions is bounded, but also that the total variation after conditioning is bounded (χ¯ is the closure of χ under conditioning defined in Sec. 9.3.2). 9.4.3 Sorting Sorting involves only a fixed array, while the operations we discussed so far consider a dataset that changes due to insertions. Our discussion here shows that the distribution learnability framework can be beneficial for analyzing learned database operations beyond insertions. To see why our framework applies to learned sorting, first recall the existing learned sorting algorithm of [63], which sorts an array by first sampling a subset of the array, learning a model to predict the correct location using the sample (the sample is sorted by an existing algorithm for the purpose of training), and then using the learned model to predict the item locations in the original array. Here, similar to the case of learned operations with insertions, the problem 232 isn’t (only) how well we can learn a model, but also how well a model learned from a sample of the array will generalize to the complete array. Thus, we need to both study a modeling problem and a model utilization problem, and we can do so using distribution learnability. Finally, since a function that sorts an array is the rank function, our discussion on distribution learnability for the rank operation already covers the modeling problem. Before stating our results, we also note that one can sort a fixed array by iteratively inserting its element into a learned index. Thus, Theorem 13 has already provided a method for sorting an array using machine learning. Our result below presents another learned method for sorting an array. This is analogous to how both B-trees and merge sort can be used to sort a fixed array. The result below can be seen as a means of extending merge sort with machine learning. Theorem 17. Suppose an array consists of n points sampled i.i.d. from a distribution learnable class X. There exists a learned sorting method that sorts the array in O(T√ X n n log log n + √ nB√ X n + Plog log n i=0 n 19 1 2 ( 4 5 ) i B X n 1 2 ( 4 5 ) i ) and space O(S√ X n + n log n) We note that √ nB√ X n + Plog log n i=0 n 19 1 2 ( 4 5 ) i B X n 1 2 ( 4 5 ) i ) is O(n) if model training takes linear in data size, so that the runtime is O(T√ X n n log log n). This is independent of training time, B X n , because, due to sampling, training is done on much smaller arrays than the original data. Thus, Theorem 17 provides a time complexity for sorting similar to Theorem 13, both showing that for efficient modeling choices, one can sort an array in O(n log log n) model calls. The algorithm that achieves the bound in Theorem 17 is similar to [63], which first samples a subset of the original array, uses it to build a distribution model, and uses the model to sort the original array. Due to modeling errors, the resulting attempt using the model will only be a partially sorted array. Unlike [63] that uses insertion sort fully sort the partially sorted array, we use a merge sort like approach to recursively sort the array. This is because, to reduce the asymptotic 233 complexity below O(n log n), the sample needs to be of size o(n). However, the generalization error of a model trained on a sample of size o(n) would be too large to allow insertion sort to be effective. Partitioning the partially sorted array and recursively sorting each portion allows us to sort the array while performing O(T√ X n n log log n) operations. Finally, the lower bound of Theorem 9 does not apply to sorting, and a natural question is if it is possible to do better than O(n log log n). The following result shows that under stronger assumptions on data distribution, this is possible. Theorem 18. Suppose an array consists of n points sampled i.i.d. from a distribution χ, and assume we have a model rˆ s.t., ∥rˆ − rχ∥∞ ≤ ϵ0, that can be evaluated in T χ and takes space S χ . There exists an algorithm that sorts the array in O(T χn log ϵ0) taking space O(S χ + n). The theorem shows if we know the data distribution very accurately, then we can sort the data very efficiently. T χ can be O(1), e.g., if the data c.d.f was a polynomial, so that we can sort an array in O(n). This is because the data distribution provides a good indicator of the location of the item in the sorted array. The algorithm that achieves this can be seen as a special case of merge sort, where instead of dividing the array into 2, we divide the array, using rˆ, to up to n groups, and recursively sort each. The difference between Theorems 18 and 17 is how accurate of a model of data distribution we have access to. Theorem 17 effectively assumes that data distribution can only be modeled to accuracy O( √ n), which is too large to allow fixing model errors in sorting with a single pass over the array. On the other hand, Theorem 18 assumes the model is correct to within a constant accuracy. As a result, a single iteration over the partially sorted array fixes any potential inversions and yields the O(n) complexity. Nonetheless, knowing the data distribution to a constant 234 accuracy can be impractical, because it requires further knowledge about the data distribution beyond merely observing its samples. 9.5 Related Work A large and growing body of work has focused on using machine learning to speed up database operations, among them, learned indexing [44, 62, 40, 33], learned cardinality estimation [61, 142, 55, 149, 148, 72, 88] and learned sorting [63]. Most existing work focus on improving modeling choices, with various modeling choices such as neural networks [165, 61, 62], piece-wise linear approximation [40], sum-product networks [51] and density estimators [75]. Existing results show significant empirical benefits in static datasets, while performance often deteriorates in dynamic datasets and in the presence of distribution shift [138, 141]. Our theoretical results help explain such observations and provide a theoretical framework for analysis of the operations under different modeling choices. On the theory side, no existing study meaningfully characterizes performance of learned models in the dynamic setting or studies learned sorting. In the static setting, [163, 39] study query time of learned indexing. [39] shows learned models can provide constant factor improvements under an assumption on the distribution of the gap between observations, and [163] shows a learned model can answer queries in O(log log n) query time if the p.d.f of data distribution is non-zero and bounded. Our results strictly generalize the latter to the dynamic setting, in the presence of insertions from a possibly changing distribution, and also show that more generally, O(T X n log log n + log δn) query time is possible for any distribution learnable class X. Moreover, [165] presents a special case of our Theoerem 15 for cardinality estimation on static datasets for 235 distributions where the operation distribution function is Lipschitz continuous. Our result strictly generalizes [165] to the dynamic setting with distribution change and any distribution learnable class X. Orthogonal to our work, [55, 6] study the number of training samples needed to achieve a desired accuracy for different database operations. 9.6 Conclusion We have presented a thorough theoretical analysis of learned indexing and cardinality estimation in the presence of insertions from a possibly changing data distribution. Our results characterize learned models’ performance, and show when and why they can outperform their non-learned counterparts. We have developed the distribution learnability analysis framework that provides a systematic tool for analyzing learned database operations. Our results enhance our understanding of learned database operations and provide the much-needed theoretical guarantees on their performance for robust practical deployment. We believe our theoretical tools will pave the way for a broader theoretical understanding of various learned methods. Future work includes incorporating deletions, considering query distribution, analyzing other database operations, and better understanding distribution learnability for real-world datasets. 9.7 Appendix 9.7.1 Formalized Setup and Operations We are interested in performing database operations on a possibly changing dataset. We assume data records are d-dimensional points in the range [0, 1] (otherwise, the data domain can be scaled and shifted to this range). We either consider the setting when n data points are inserted one by 236 one into the dataset, or that we are given a fixed set of n data points. We refer to the former as the dynamic setting and the latter as the static setting. We define Di as the dataset Di ∈ [0, 1]i×d , i.e., a dataset consisting of i records inserted so far and in d dimensions with each attribute in the range [0, 1], where i and d are integers greater than or equal to 1. Dn is the dataset after the last insertion, and is often denoted as D. Di:j denotes the dataset of points inserted after the i-th insertion until the j-th (i.e., Dj \ Di ). We use Di to refer to the i-th record of a dataset (which is a d-dimensional vector) and Di,j to refer to the j-th element of Di . If d = 1 (i.e., D is 1-dimensional), then Di is the i-th element of D (and is not a vector). We study the following database operations. Indexing. The goal is to use an index to store and find items in a 1-dimensional dataset. The index supports insertions and queries. n items are inserted into the index one by one. After inserting k items, for any 1 ≤ k ≤ n, we would like to retrieve items from the dataset based on a query q ∈ [0, 1]. The query is either an exact match query or a range query. An exact match query returns the point in the database that exactly matches the query q (or NULL if there is none) while a range query [q, q′ ] returns all the elements in the dataset that fall in the range [q, q′ ], for q, q′ ∈ [0, 1]. Cardinality Estimation. Used often for query optimization, the goal is to find how many records in the dataset match a range query, where the query specifies lower and upper bound conditions on the values of each attribute. Specifically, the query predicate q = (c1, ..., cd, r1, ..., rd), specifics the condition that the i-th attribute is in the interval [ci , ci + ri ], for ci , ri ∈ [0, 1]. Data records can be inserted into the data set one by one. After the insertion of k-th item, for any 1 ≤ k ≤ n, we would like to obtain an estimate of the cardinality of query q. We expect that the answers are within error ϵ of the true answers. That is, if c(q) is the true cardinality of q and cˆ(q) 237 is an estimate, we expect |c(q) − cˆ(q)| ≤ ϵ. This guarantee has to hold throughout, and as new elements are inserted in the dataset. Sorting. The goal is to sort a fixed array of size n. That is, we are given a one-dimensional array, D, and the goal is to return an array, D′ , which has the same elements as D but ordered so that D′ i ≤ D′ i+1. Unlike indexing and cardinality estimation, sorting assumes a fixed given array that needs to be sorted. Although indexing can often be used to sort an array (e.g., inserting elements one by one into a binary tree sorts a fixed array), we study the problem of sorting more broadly and explore other learned solutions to the problem beyond indexing (e.g., analogous to how merge sort can also be used to sort an array). 9.7.2 Distribution Learnability Through Function Approximation We first formalize representation power and optimizablity, and then present a formal statement for Theorem 12. Representation Power. Consider using a function class F to approximate another function class G (e.g., neural networks to approximate real-valued functions). Consider some hyperparameter, ϑ, that controls the representation power and inference complexity in F, and denote by Fϑ is the subset of F with hyperparameter ϑ. For instance, ϑ can be the number of learnable parameters of a neural network, the maximum degree of a polynomial, or the number of pieces in a piecewise approximation. In all such cases, larger ϑ implies better representation power but also higher inference time and/or space complexity. Assume we have access to a representation complexity function αF→G(ϵ), that given a maximum error ϵ returns the smallest value of ϑ such that for any g ∈ G there exists an f ∈ Fϑ with ∥f − g∥∞. The function αF→G(ϵ) determines the required 238 model complexity of F, in terms of ϑ, to represent all elements of G with error at most ϵ. For instance, such a function for neural networks approximating real-valued functions will show the minimum number of neural network parameters needed to approximate all real-valued functions to error at most ϵ with a neural network. We say that a function class, F has the representation power to model G if there exists a representation complexity function αF→G(ϵ) for all ϵ > 0. Finally, let τF→G(ϵ) and σF→G(ϵ) respectively be the maximum time and space complexity of performing a model forward pass for functions in Fϑ for ϑ = αF→G(ϵ). Optimizability. We say a function class F is optimizable with an algorithm A if given any function h and a hyperparameter value ϑ, A(h, ϑ) returns an approximately optimal representation of h in Fθ. Formally, for hˆ = A(h, ϑ) and if h ∗ = arg minhˆ∈Fθ ∥h − hˆ∥, ∥h ∗ − h∥∞ ≥ κ∥hˆ − h∥∞ for a constant κ ≤ 1. Let β(ϑ) be the maximum time complexity of A. We note that although optimizability as defined broadly above is sufficient to show distribution learnability, it is not necessary. Here, we discuss two qualifications to the definition that make proving optimizability simpler, specifically for database operations. First, it is only necessary to have optimizability for h ∈ fD for all possible D and for a desired operation function f (since we will only use A to model operation functions). This can simplify the optimizability requirement depending on the operation function considered. For example, when showing opimizability for rank operations, we only need an A that returns approximately optimal estimates for input functions that are non-decreasing (since all rank functions are non-decreasing). Second, when A is used on h = fDn , we can allow additive error of O( √ 1 n ). That is, we only need to show 1 κ ∥h ∗ − h∥∞ + κ ′ √ n ≥ ∥hˆ − h∥∞ for κ ′ ≥ 0 and κ ≤ 1 universal constants. 239 Theorem 19. Assume a function class, F, is optimizable with an algorithm A, and that F has enough representation power to represent G. Let X be a distribution class with fχ ∈ G for all χ ∈ X. Then, X is distribution learnable with T X n = τF→G( √ 1 n ), S X n = σF→G( √ 1 n ), and B X n = β(αF→G( √ 1 n )). 9.7.3 Proofs The high-level idea behind most of our theoretical results is to use the relationship between query answers and distribution properties. Overall, many statistical tools have been developed that relate the properties of an observed dataset to the data distribution (e.g., studying the relationship between sample mean and distribution mean). In statistics, such tools have been used to describe the population using observed samples. We often use such tools to do the opposite, that is, use the properties of the data distribution to describe observed samples. Indeed, that is the intuition behind learned database operations, that if the data distribution can be efficiently modeled, then it can be used to answer queries about the observed samples (i.e., the database) efficiently. Our proposed distribution learnability framework allows us to state this more formally. It allows us to assume that we can indeed model the data distribution efficiently. Then, the analysis can focus on utilizing statistical tools to characterize the relationship between the observed sample and the data distribution. Having access to an accurate model of the data distribution, we use existing statistical tools to analyze its error. However, a main challenge in the case of learned database operations is to balance accuracy and efficiency. Thus, our theoretical study includes designing data structures and algorithms that can utilize modeling capacities while performing operations as efficiently as possible. 240 9.7.3.1 Lemma 9 We would like to bound ED∼χ[∥ ˆf − fDj ∥]. Note that both ˆf and fDj are random variable (since ˆf depends on fDi ). First, consider fDj (q) = 1 j X k∈[j] IDk∈q = i j fDi (q) + j − i j fDi:j (q), So that the error is ED∼χ[∥ ˆf − i j fDi (q) + j − i j fDi:j (q)∥] = EDi∼χ[EDi:j∼χ[∥ ˆf − i j fDi (q) − j − i j fDi:j (q)∥|Di ]]. Now consider EDi:j∼χ[∥ ˆf − i j fDi (q) − j−i j fDi:j (q)∥|Di ]. Given Di , ˆf − i j fDi (q) is a fixed quantity. Furthermore, recall that arg minc E[|X − c|] = Med(X) for any random variable X, where Med(X) is a median of X [139]. Therefore, for any query, EDi:j∼χ[∥ ˆf − i j fDi (q) − j − i j fDi:j (q)∥|Di ] ≥ j − i j EDi:j∼χ[Med(fDi:j (q)) − fDi:j (q)]. Observe that fDi:j (q) ∼ Binomial(j − i, Pp∼χ(Ip∈q)) and consider any query such that (j − i)Pp∼χ(Ip∈q) is an integer, which exists as long as the c.d.f of the distribution is continuous. For such queries, we have Med(X) = (j − i)Pp∼χ(Ip∈q) since mean and median of binomial distributions where (j − i)Pp∼χ(Ip∈q) is an integer are equal [57]. Let pq = Pp∼χ(Ip∈q). Using 241 the bound on the binomial mean absolute deviation in [14], we have, when j − i ≥ 2 and for any query s.t. 1 (j−i) ≤ pq ≤ 1 − 1 j−i , p (j − i)pq(1 − pq) √ 2 ≤ EDi:j∼χ[|(j − i)pq − fDi:j (q)|]. Moreover, setting pq = ⌊ j−i 2 ⌋ j−i , we have p (j − i)pq(1 − pq) √ 2 ≥ √ j − i 4 . 9.7.3.2 Theorem 12 (formally Theorem 19) First, we use the algorithm A (due to optimizability) to construct the algorithm in definition 3 as ˆf = 1 nA(fD, αF→G( √ 1 n )) for f ∈ {r, c} given an input dataset, D, of size n. Let ϑ = αF→G( 1 ϵ ). Note that since F has enough representation power to represent G, and since by assumption {χ ∈ X, fχ} ⊆ G, we have that, for any χ there exists ˆfχ ∈ Fϑ s.t. ∥ ˆfχ − fχ∥∞ ≤ √ 1 n . Furthermore, since we approximately optimally find ˆf, we have | ˆf(x)− 1 n fD(x)| ≤ 1 κ | ˆfχ(x)− 1 n fD(x)|+ κ ′ √ n . Now, to analyze accuracy of ˆf, observe that, for any input x we have | ˆf(x) − 1 n fD(x)| ≤ 1 κ | ˆfχ(x) − 1 n fD(x)| + κ ′ √ n ≤ 1 κ | ˆfχ(x) − fχ(x)|+ 1 κ |fχ(x) − 1 n fD(x)| + κ ′ √ n ≤ 1 κ √ n + 1 κ |fχ(x) − 1 n fD(x)| + κ ′ √ n . 242 We also have | ˆf(x) − fχ(x)| ≤ | ˆf(x) − 1 n fD(x)| + |fχ(x) − 1 n fD(x)|, So that , n| ˆf(x) − fχ(x)| ≤ √ n κ + 2 κ |nfχ(x) − fD(x)| + √ nκ ′ . By Hoeffeding’s inequality, we have P(|nfχ(x) − fD(x)| ≥ ϵ ′ ) ≤ e −2( ϵ ′ √n ) 2 , (9.1) So that P(n|fχ(x) − ˆf(x)| ≥ 2 κ ϵ ′ + √ n( 1 κ + κ ′ )) ≤ e −2( ϵ ′ √n ) 2 , (9.2) And therefore, for some universal constant κ2 and ϵ = Ω(√ n), P(n|fχ(x) − ˆf(x)| ≥ ϵ) ≤ e −κ2( √ϵ n −1)2 . (9.3) 9.7.3.3 Lemma 16 For each distribution class, we show optimizability and representation power of some function class F that can be used to model the distribution class, which combined with Theorem 19 shows the desired result for both Lemmas 16 and 17. Then, for each class, we discuss modeling complexities. 243 Distribution learnability for Xρ. Let F be the class of piecewise constant functions with uniformly spaced pieces and let G be the class of real-valued differentiable functions [0, 1]d → R with gradient bounded by ρ. Consider the number of pieces to use for approximation as a hyperparameter. Optimizability. Given the number of pieces, the function that creates the minimum infinity norm is to place a constant at the mid-point of maximum and minimum values in each interval. That is, for an interval I ⊆ [0, 1]d , the constant approximating g over with the lowest infinity norm I is 1 2 (minx∈I g(x)+maxx∈I f(x))). Note that this function has error at most maxx∈I g(x)− 1 2 (minx∈I g(x)+maxx∈I f(x))) = 1 2 (maxx∈I f(x)−minx∈I g(x)). For efficiency purposes, instead of the optimal solution, we let the constant for the piece responsible for I be g(p) for some p ∈ I. Note that for all x ∈ I |g(p) − g(x)| ≤ | maxx∈I g(x) − minx∈I g(x)|, so that this construction gives us a 1 2 -approximation of the optimal solution. Representation Power. Define αF→G(ϵ) = √ dρ ϵ . We show that for any g ∈ G and any ϵ > 0, there is a function ˆf ∈ Fα(ϵ) s.t. ∥ ˆf − g∥∞ ≤ ϵ. This function is the optimal solution as constructed above. To see why the error is at most ϵ, consider a partition over I with j-th dimension [pj,i, pj,i+1], where pj,i+1 − pj,i = ϵ ρ , and let x1 and x2 be the two points in I that, respectively, 244 achieve the minimum and maximum of g in I. For any point in x ∈ I, our function approximator answer ˆf(x) = 1 2 (g(x1) + g(x2)). We have that | ˆf(x) − g(x)| = | 1 2 (g(x1) + g(x2)) − g(x)| ≤ max{g(x2) − g(x), g(x) − g(x1)} ≤ ∥g ′ (x)∥2∥x2 − x∥2 ≤ ρ( √ d ϵ √ dρ ) = ϵ. Model Complexity. The inference time, T X n , is constant independent of the number of pieces used. The space complexity is the number of pieces multiplied by the space to store each constant. Given that g consists of integers between 0 to n, S X n can be stored in O( √ d(ρ √ n) d log n). Finally, for rank operation, the algorithm that outputs the function optimizer makes ρ √ n calls to g, so that building the function approximator can be done in O(ρ √ n log n), assuming the data is sorted (so that each call to g takes O(log n)). For cardinality estimation there are O( √ d(ρ √ n) d log n) calls to the cardinality function, where each call in the worst case takes O(n) (this can optimized by building high dimensional indexes). Thus, in this case, B X n = O( √ d(ρ √ n) d log n) Distribution learnability for Xl . Let F and G be the class of piecewise linear functions with at most l pieces (not necessarily uniformly spaced pieces). Trivially, F has enough representation power to represent G, thus, it remains to show optimizability and model complexity. Optimizability. The PLA algorithm, P(ϵ) by [91], used in PGM index [40], is able to find the piecewise linear solution with the smallest number of pieces given an error ϵ. Here, we want to achieve the opposite, i.e., given a number of pieces find piecewise linear approximation with smallest error. Note that ϵ is in the range 0 to 1, and we can do a binary search on the values of ϵ, 245 Algorithm 18 Dynamic Learned Index Query Input: New element to be inserted in tree rooted at N Output: Balanced data structure 1: procedure Query(q, N) 2: if N is a leaf node then 3: return BinarySearch(N.content) 4: ˆi ← N. ˆf(p) 5: i ← ExpSearch(p,ˆi, N.content) 6: j ← BinarySearch(p, N.children[i]) 7: return Query(q, N.children[i][j]) for each calling P(ϵ) until we find the smallest ϵ where |P(ϵ)| ≤ l. Note that since suboptimality of O( √ 1 n ) in ϵ is allowed, wee can discretize [0, 1] to √ n groups, and only do binary search over this discrete set, which takes O(log(√ n)) calls to P(ϵ), and each call takes O(n) operations [40] on a sorted array, so that F is optimizable to with the algorithm running in O(n log n) Model Complexity. The learning time nB X n = O(n log n) is discussed above. The algorithm always returns l pieces which can be evaluated in T X n = O(log l) time. Note the each linear piece can be adjusted to cover an interval starting and ending at points in the dataset (so the interval can be stored as pointers to corresponding dataset item). Moreover, the beginning and end of each line can be adjusted to be an integer (since the rank function only returns integers), similar to [40], so that the lines can be stored in S X n = O(l log n). Distribution Learnability for Xc. Trivially, the class F containing the distribution operation function has enough approximation power for Xc and is optimizable with S X n , T X n , B X n all O(1). 246 Algorithm 19 Dynamic Learned Index Insertions Input: New element, p to be inserted in tree rooted at N Output: Index with p inserted 1: procedure Insert(p, N) 2: N.counter++ 3: if N.children is NULL then 4: InsertContent(p, N.content) 5: return 6: ˆi ← N. ˆf(p) 7: i ← ExpSearch(p,ˆi, N.content) 8: j ← BinarySearch(p, N.children[i]) 9: Insert(p, N.children[i][j]) 10: if N.counter = N.max_points then 11: A ← the sorted array in the index rooted at N 12: if N has no parent then 13: return Rebuild(A) 14: P ← parent of N 15: ip ← index of N in P.children 16: Remove N from P.children[i] 17: N1 ← Rebuild(A[: N.max_points/2]) 18: N2 ← Rebuild(A[N.max_points/2 :]) 19: Insert N1 and N2 in P.children[i] 9.7.3.4 Learned Indexing Index Operations The index supports two operations,Query and Insert, which are presented in Algs. 18 and 19. Recall that A is an algorithm defined in Definition 3 and exists due to distribution learnability of X. The index builds a tree structure similar to [33], with each node containing a model, and a set of children. An overview of the tree architecture is shown in Fig. 9.1. Each node can be seen to cover a subarray of the original indexed array. If the size of the covered subarray is k elements, then the node will have √ k children, where the subarray of size k is equally divided between the children (so each child covers √ k elements). The root node covers the entire n elements of the array, and therefore has √ n children. As can be seen, the number of children of the nodes decreases as we go down the tree. A node won’t have any children if its covered 247 Algorithm 20 Procedure for for Rebuilding Root Input: A sorted array, A Output: A learned index rooted at a new node N 1: procedure Root(A) 2: N ← new node 3: k ← |A| 4: if k ≤ κ then 5: N.content = A 6: return N 7: N.max_points = 2k 8: N. ˆf = A(A) 9: N.content = A[:: √ k] 10: for i in √ k do 11: Nc ← Rebuild(A[ik : (i + 1)k]) 12: N.children.append(Nc) 13: return N level level 0 level 1 level 2 array indexed leaf nodes have constant size Sub array covered by a node Figure 9.1: Structure of the learned dynamic index subarray is smaller than some constant c. Moreover, shown as black elements in the figure, each parent node stores the minimum value of the subarray covered by each of its children in an array called the node’s content. Thus, the node’s content can be used to traverse the tree. When a node’s model predicts which child the node should travel, the node first checks with its content to make sure it is the correct node. This is done by doing an exponential search on the node’s content. During insertions, each node keeps a counter of the number of points inserted through it. If a node is at level i, then at most k 2−i elements are allowed in the node for i > 0, where k 248 (a) before insertion (b) after insertion Figure 9.2: Insertion Causing a split in index is the size of the dataset at the time of construction of the current root node (root node’s are periodically rebuilt). If the number of insertions reaches k 2−i , the node splits. When a node splits the subarray it covers is split into two, and an entirely new subtree is built for each half of the subarray, rebuilding all models. To avoid splits affecting parent nodes, as Fig. 9.2 shows, the newly created node is appended to the list of children (we’ll discuss how exactly this is done later). Finally, the root is rebuilt every time its size doubles. To support the splitting discussed above, the children are arranged in a two dimensional array N.children. We refer to children pointed to in N.children[i] as children in the i-th child slot. Each child slot contains a sorted list of at least one, but a variable number, of children, where the list is kept sorted using binary trees. To find which node to traverse, we first find the correct node slot with the help of the learned model, and then use the binary tree in the node slot to find the correct child. Leaf nodes have content storing the data. The index keeps a counter at each node, and periodically rebuilds the tree rooted at a node. Thus, the two operations are performed as follows. Query. Performing queries on an index with root node N is similar to performing queries with a B-tree, where nodes are recursively traversed until reaching a leaf node. The only difference is how we decide which child to search. This is done by, for each non-leaf node, first asking a model to estimate which child to search. Then, the model estimate is corrected by performing 249 a local exponential search. Since a child might have been split, the search then uses the binary tree at the correct node slot to find the correct child. Insertions. Insertions first traverses the tree similar to the queries, with the extra addition that a counter in each node is incremented if a new element is inserted in that node. If the counter of a node passes the maximum size of the node, the node is split into two, and the parent meta data for the corresponding node slot is updated. The only exception is the root node, which does not split, but triggers a full rebuild of the entire tree. Query Time Queries are performed by recursively searching each node, where a single node per level is queried. Consider the number of operations performed at the i-th level. Each level performs a model inference, exponential search on the node’s content and a binary search on the node’s extension. We consider each separately. Model Inference. A model at the i-th level is built on at most n 1 2i elements. Thus, the inference time is O(T X n29i ). Exponential Search. The time complexity of the exponential search step depends on the accuracy of the model estimate. We show that, on expectation, the time complexity is constant. Note that neither the content of each node, nor its model gets modified by insertions, unless a node is rebuilt. Thus, we only need to show the statement for right after the node construction. Assume the node is constructed for some array A with |A| = k for some integer k. Assume the i-th element of A is a sample originally obtained from the distribution χi , let R = [l, u] define the range the elements A have fallen into passed to the algorithm. The elements of A are independent samples from conditional distributions χ1 | R, ..., χk | R, respectively. Let χR = {χ1 | R, ..., χk | R}. 250 Applying the extension of the DKW bound to independent but non-identical random variables [126, Chapter 25.1] we have that P(∥krχR − rA∥∞ ≥ r k 2 ϵ) ≤ 2e −ϵ 2+1 . (9.4) Furthermore, for rˆ, the model of rχR = 1 k Pk i=1 rχi|R, obtained from A and by the accuracy requirement of distribution learnability, we have for any ϵ ≥ √ κ2( √ 1 k − 1) P(∥krχR − rˆ∥∞ ≥ √ k( ϵ √ κ2 + 1)) ≤ κ1e −ϵ 2 . (9.5) By union bound on Ineq. 9.4 and 9.5, and the triangle inequality, for any ϵ ≥ √ k, we have P(∥rˆ − rA∥∞ ≥ ϵ) ≤ κ ′ 1 e −κ ′ 2 ( √ϵ k −1)2 , (9.6) where κ ′ 1 = 2e + κ1 and κ ′ 2 = 2κ2 ( √ 2+√κ2) 2 . Now let N(q) be the number of operations by performed by exponential search for a query q. Recall that the content of the node is Ac = A[:: √ k] and that we use rˆAc = ⌈ √ 1 k rˆ(q)⌉ as the start location to start searching for q in Ac with exponential search. We have the true location of q in Ac is rAc (q) = ⌈ √ 1 k rA(q)⌉. Observe that, for any q ∥rA − rˆ∥∞ > (ˆrAc (q) − rAc (q)| − 2)√ k, (9.7) 251 and that it is easy to see that for exponential search we have |rˆAc (q) − rAc (q)| ≥ 2 N(q) 2 −1 , (9.8) Combining which we get, for any query q, ∥rA − rˆ∥∞ > (2 N(q) 2 −1 − 2)√ k. So that, for any i, N(q) ≥ i implies that ∥rA − rˆ∥∞ > (|2 i 2 −1 − 2)√ k Thus, we have EA∼χ[N(q)] = 2 log X k i=1 PA∼χ(N(q) ≥ i) ≤ 2 log X k i=1 PA∼χ(∥rA − rˆ∥∞ > (|2 i 2 −1 − 2)√ k) ≤ 5 + κ ′ 1 2 log X k i=6 e −κ ′ 2 (2 i 2 −1−3)2 = O(1) Thus, the expected time searching performing exponential search is O(1). Searching Node’s Extension. Next, we study the expected time for searching the additional list added to the nodes. Let L be the size of the list. Note that the lists are created for all the nodes except the root. Moreover, a non-root node at level i has capacity cn 1 2i , it will split every cn 1 2i insertions into the parent, with each split adding an element to the parent’s extension list. Furthermore, the node gets rebuilt after every k = n 1 2i−1 insertions into its parent. Thus, if kN 252 elements out of n 1 2i−1 get inserted into a node slot N, the number of splits for that slot will be ⌊ kN cn 1 2i ⌋. Assume the k new insertions into the parent, Np of N since the last rebuild of the parent were from r.v.s with distribution χ1, ..., χk. Given that they fall in Np, their conditional distribution is χ1 | R, ..., χk | R for R defining an interval for which the node NP was built. Let χ ′ R = {χ1 | R, ..., χk | R}. Furthermore, let χR be the original distribution the model of in the parent node was created based on. Let A′ be the set of k insertions and let A be the set of points based on which the parent of N was built. The number of insertions out of the k new insertions into the j-th node slot is rA′(Nj ) − rA′(Nj−1). To study this quantity, recall that the j-th slot was created so that rA(Nj ) − rA(Nj−1) = n 1 2i . (9.9) Thus, we first relate rA and r ′ A. The elements of A and A′ are independent so that applying the extension of the DKW bound to independent but non-identical random variables [126, Chapter 25.1] for both A and A′ we have P(∥krχ ′ R − rA′∥∞ ≥ r k 2 ϵ) ≤ 2e −ϵ 2+1 , (9.10) P(∥krχR − rA∥∞ ≥ r k 2 ϵ) ≤ 2e −ϵ 2+1 . (9.11) 253 Moreover, assume TV(χi | R, χj | R)≤ δ for all i, j, we have that ∥rχ − rχ′∥∞ ≤ δ. Combining this with Ineq. 9.10 and 9.11 and using the triangle inequality, we have P(∥rA′ − rA∥∞ ≥ √ 2kϵ + δk) ≤ 4e −ϵ 2+1 . (9.12) Finally, combining Ineq. 9.12 with Eq. 9.9, and recalling that k = n 2−i+1 implies that P(rA′(Nj ) − rA′(Nj−1) ≥ n 2−i (1 + 2√ 2ϵ) + 2δn2−i+1 ) ≤ 4e −ϵ 2+1 . And therefore P(⌊ rA′(Nj ) − rA′(Nj−1) n2−i ⌋ ≥ 2 √ 2ϵ + 2δn2−i ) ≤ 4e −ϵ 2+1 , Where ⌊ rA′ (Nj )−rA′ (Nj−1) n2−i ⌋ is the number of splits of the j-th node slot. Let Sj denote this random variable. We have E[Sj ] = X k i=0 P(Sj ≥ i) ≤ 2δn2−i + k−2δn2−i X i=0 P(Sj ≥ i + 2δn2−i ) ≤ 2δn2−i + k−2δn2−i X i=0 4e − i 2 8 +1 = O(δn2−i ) Finally, we are interested in E[log(Sj )] ≤ log(E[Sj ]) = O(log(n 1 2i δ)). 254 Total Query Time. Thus, the expected time to search a node at the i-th level to find its children is O(T X n29i + log(n −2 i δ)). Thus, the total time to search the tree is O( Plog log n i=1 T X n29i + log(n −2 i δi)). We can bound this as O(T X n log log n + log(n¯δ)), where ¯δ = min{δ, δlog log n c }. Insertion Time Note that insertion time is equal to query time plus the total cost of rebuilds. Next, we calculate the cost of rebuilds. Let T(N) be the amortized cost of inserting N elements into a tree that currently has N elements and was just rebuilt at its root, so that NT(N) will be the total insertion cost for the N elements. Note that the amortized cost of all n insertions starting from a tree with one 1 element is at most 1 n Plog n i=0 n 2 iT( n 2 i ) ≤ T(n) P 1 2 i = O(T(n)). Thus, we only need to study T(n). Note that when inserting n elements into a tree that currently has n elements and was just rebuilt at its root, the height of the tree remains constant throughout insertions. Furthermore, at the i-th level, i ≥ 0, there will be at most n n2−i rebuilds and each rebuild costs log log Xn j=i 2n 2−i n2−j B X 2n29j . Thus, we have the amortized cost of all rebuilds is 1 n log log Xn i=0 n n2−i log log Xn j=i 2n 2−i n2−j B X 2n29j = O( log log Xn i=0 (i + 1) B X 2n29i n2−i ). Thus, the total cost of insertions is O(T X n log log n + log(¯δn) + B X n n log2 log n)). 255 Space Overhead After n insertions, we will have log log n levels. Right after the root was rebuilt, level i has at most n n2−i models. If n further insertions are performed, each level will have at most n n2−i new models. Thus, the total size of the models at level i is at most 2 n n2−i S X n29i , and thus the total size of all models is O(n Plog log n i=0 S X n29i n2−i ). Furthermore, the total number of nodes in the tree is O(n), and for each node we store a pointer to it and its lower and upper bounds, as well as a counter, which can be done in O(log n). Thus, the total space consumption is O(n(log n + Plog log n i=0 S X n29i n2−i )). Proof of Corollary 14 Observe that for any distribution in X, we have that ¯χ|R for any interval R has p.d.f at most ρ2 ρ1 . According to Lemma 16, distributions with p.d.f at most ρ2 ρ1 are distribution learnable. Substituting the complexities proves Corollary. 9.7.3.5 Cardinality Estimation High Dimensions (Theorem 15) Construction. By distribution learnablility we have an algorithm A that builds a model that we use for estimation. To prove the lemma, we use A and periodically rebuild models to answer queries. Specifically, A is called every k insertions, where if δ ≥ √ 2κ n , k = ϕ−κ 2κδ √ n, and when δ ≤ √ 2κ n , k = n× min{( ϕ−κ κ(1+2κ) ) 2 , 1}. That is, if currently there are n points inserted, and we insert n ′ new points, for n ′ < k, the algorithm will answer queries as (n + n ′ )ˆc. However, when n ′ = k, the algorithm rebuilds the model and starts answering queries using the new model. Query Time and Space Consumption. We use a single model with no additional data structure, so query time is O(T X n ) and space complexity is O(S X n ). 256 Insertion Complexity. To analyze the cost of insertions, first consider, T(n), the total number of operations when we insert n new elements in a data structure that already has n elements. Consider the two cases where δ ≥ √ 2κ n and δ ≤ √ 2κ 2 . In the first case, we have the we rebuild the model every ϕ−κ 2κδ √ n insertions, so that there are at most n ϕ−κ 2κδ √ n = 2κδ ϕ−κ √ n rebuilds. If δ ≥ √ 2κ 2 , we rebuild the model every ρn times, so that the total number of rebuilds is 1 ρ . Ensuring that ϕ ≥ κ + 1, we have that 1 ρ ≤ κ 2 (2κ + 1)2 . In either case, each rebuild costs O(B X 2n ) and besides rebuilds insertions takes constant time. Thus, if δ ≥ √ 2κ n , T(n) = O(n + δ ϕ √ nB X 2n ) and if δ ≤ √ 2κ 2 we have T(n) = O(n + B X 2n ). Next, to analyze the total runtime of starting from 0 elements and inserting n new elements, we have that the amortized insertion is 1 n Plog n i=1 T( n 2 i ). Now if δ ≥ √ 2κ n , this is O( 1 n Plog n i=1 n 2 i + δ ϕ p n 2 iB X 2n 2i ) = O( 1 n Plog n i=1 δ ϕ √ n2 iB X 2n 2i ) = O(n + δ ϕ √ nB X 2n ). Furthermore, if δ ≤ √ 2κ n , this is O(n + B X 2n ). Thus, the amortized insertion cost is O(max{ δ ϕ √ n , 1 n }BX n ) Accuracy. We show that if δ ≥ √ 2κ n , rebuilding the model every ϕ−κ 2κδ √ n, and when δ ≤ √ 2κ n rebuilding the model every ρn for ρ = min{( ϕ−κ κ(1+2κ) ) 2 , 1} insertions is sufficient to answer queries with error at most ϕ √ n, whenever ϕ ≥ κ + 1. Assume a model was built using dataset Di , i.e., after i insertions. We study the error in answering after k new insertions, so that the goal is to answer queries on Dj , j = i + k. Let χ and χ ′ be the distributions so that Di ∼ χ and Di:j ∼ χ ′ . Consider a model, cˆ that was built on Di using A, so we have P(i∥cˆ− cχ∥∞ ≥ √ i( ϵ √ κ2 + 1)) ≤ κ1e −ϵ 2 (9.13) 257 We are interested ∥jcˆ− cDj ∥ = ∥icˆ+ kcˆ− cDi − cDi:j ∥ ≤ ∥icˆ− cDi∥ + ∥kcˆ− cDi:j ∥. (9.14) For the first term, by Hoeffding’s inequality we have P(|icχ(q) − cDi (q)| ≥ √ iϵ) ≤ e −2ϵ 2 , (9.15) Which combined with Ineq. 9.13 gives P |icˆ(q) − cDi (q)| ≥ √ i((1 + 1 √ κ2 )ϵ + 1) (9.16) ≤ (1 + κ1)e −2ϵ 2 . For the second term, again by Hoeffding’s inequality we have P(|kcχ(q) − cDi:j (q)| ≥ √ kϵ) ≤ e −2ϵ 2 , (9.17) We also have that ∥χ − χ ′∥ ≤ δ, which combined with Ineq. 9.17 gives P(|kcχ(q) − cDi:j (q)| ≥ √ kϵ + kδ) ≤ e −2ϵ 2 , 258 And therefore, using Ineq. 9.13, we have P |kcˆ(q) − cDi:j (q)| ≥ √ kϵ + k √ i ( ϵ √ κ2 + 1) + kδ ≤ (κ1 + 1)e −2ϵ 2 . (9.18) Combining Ineq. 9.14, 9.16 and 9.18, we have P |jcˆ(q) − cDj (q)| ≥ ( √ i + √ i κ2 + √ k + k √ iκ2 )ϵ+ k √ iκ2 + √ i + kδ ≤ 2(κ1 + 1)e −2ϵ 2 As such, we have E[ |jcˆ(q)−cDj (q)|−( √k iκ2 + √ i+kδ) √ i+ √ i κ2 + √ k+ √k iκ2 ] ≤ κ3, for some universal constant κ3 so that, E[|jcˆ(q) − cDj (q)|] ≤ √ k iκ2 + √ i + kδ + κ3 √ i + κ3 √ i κ2 + κ3 √ k + √κ3k iκ2 . Assuming k ≤ i, we have E[|jcˆ(q) − cDj (q)|] ≤ κ( √ i + √ k + kδ), For some universal constant κ. Now if δ ≥ √ 2κ i we let k = ϕ−κ 2κδ √ i and otherwise set k = ρi for ρ = min{( ϕ−κ κ(1+2κ) ) 2 , 1}. 259 First, consider the case where δ ≥ √ 1 j . Consider the error ϵ = ϕ √ j ≥ ϕ √ i, so it suffices to show that the error is at most ϕ √ i. Let k = ϕ−κ 2κδ √ i. Substituting this in, we want to show qκ(ϕ−κ) 2δ √ i − ϕ−κ 2 √ i ≤ 0. Indeed, for δ ≥ √ 2κ i , we have κ(ϕ−κ) 2δ √ i ≤ (ϕ−κ) 4 i, so that r κ(ϕ − κ) 2δ √ i − ϕ − κ 2 √ i ≤ 1 2 p (ϕ − κ)i − ϕ − κ 2 √ i = 1 2 p (ϕ − κ)i(1 − p ϕ − κ) ≤0 Which proves E[|jcˆ(q) − cDj (q)|] ≤ ϕ √ j whenever δ ≥ √ 2κ i and ϕ − κ ≥ 1. If δ ≤ √ 2κ i , we set k = ρi for ρ = min{( ϕ−κ κ(1+2κ) ) 2 , 1}. We have κ( √ i + p ρi + ρiδ) ≤ κ √ i(1 + √ ρ + 2√ ρκ) So that we need to ensure 1 + √ρ(1 + 2κ) ≤ ϕ κ . Observe that √ρ ≤ ϕ−κ κ(1+2κ) implies the above, so that setting ρ as above proves the result in this case. One Dimension (Theorem 16) Construction. We build a B-tree like data structure. However, in addition to the content of each node, each node also keeps a counter of the number of elements inserted into the node. After each insertion, if a leaf node has more than k elements, the node is split into two, for k = ϵ 2 4(κ+1)2 . Thus, leaf nodes cover between k 2 to k elements, while the rest of the tree has its own fanout B. Leaf nodes do not store the elements associated with them, but build models to answer queries. To answer a query, the tree is traversed similar to typical range query answering with a B-tree. However, if a node is fully covered in a range, then the 260 number of insertions in the node is used to answer queries. Otherwise, the node is recursively searched until reaching a leaf node. We will have at most 2 leaf nodes reached that will be partially covered by the query range. Finally, the model of each node are constructed by using the construction in Theorem 15 with ϕ = κ + 1. Query Time and Space Consumption. Each query will take O(log n+2T X k ) where log n is due to the tree traversal and 2T X k for the two model inferences needed. Moreover, the total space consumption is O( n k S X k + n k log n) Insertion Complexity. Each leaf node will start with k 2 elements and will be split whenever it reaches k elements. Insertion of the k 2 elements in a node cost O(k+max{δ √ k, 1}BX k ). Given n insertions, we have 2n k insertions of k 2 elements in the nodes, so that the amortized cost of rebuilds is O( 1 n n k (k + max{δ √ k, 1}BX k )) = O(max{ √ δ k , 1 k }BX k )). Furthermore, traversing the tree nodes costs O(log n) per insertion, so that amortized insertion cost is O(max{ √ δ k , 1 k }BX k ) + log n) = O(max{ δ ϵ , 1 ϵ 2 }BX ϵ 2 ) + log n). Accuracy. Setting ϕ = κ+1 in Theorem 15 and having k ≤ ϵ 2 4(1+κ) 2 ensures that the expected error of the each model is at most ϕ √ k = ϵ 2 . Since each query is answered by making two model calls, the total expected error for answering queries is at most ϵ as required. High dimension with any accuracy Here we also discuss how we can use models to answer queries to arbitrary accuracy in high dimensions. We note that, as we see here, building data structures to answer queries is high dimension is difficult. We discuss this result only in the static setting. 261 Algorithm 21 Cardinality Estimation with Grid Input: Query q, dimension to refine, i, set of models, M, and set of partition points S Output: Estimate to cardinatliy of q 1: procedure Query(q, i, S, M) 2: if i = 0 then 3: return use grid to answer q 4: il ← index of q[i][0] in S[i] 5: iu ← index of q[i][1] in S[i] 6: if iu = il then 7: return M[i][il ](q) 8: qu ← q 9: qu[i][0] ← S[iu] 10: ql ← q 11: ql [i][1] ← S[il + 1] 12: if iu = il + 1 then 13: return M[i][il ](ql) + M[i][iu](qu) 14: q[i][0] ← S[il + 1] 15: q[i][1] ← S[iu] 16: return Query(q, i − 1, S, M)+M[i][il ](ql) + M[i][iu](qu) Lemma 18. There exists a learned model that can answer cardinality estimation query with error up to ϵ with query time O(T X ( ϵ 2d ) 2 +( 4d 2 ϵ 2 ) dΠiki) and space complexity O( 4d 2n ϵ 2 S X ( ϵ 2d ) 2 +( 4d 2n ϵ 2 ) d ), where ki is the cardinality of the query in the i-th dimension. Cardinality of a query in the i-th dimension is the number of points the would query we only consider the i-th dimension. Construction. Assume we would like to obtain accuracy ϵ. We build a grid and materialize the exact result in each cell. Then, for queries, where part of a query partially overlaps a cell, we also build models to answer queries. Thus, a query is decomposed into parts that fully contain cell and parts that don’t which are answered by models. Split the i-th dimension into k = 4d 2n ϵ 2 partitions, with each partition containing ( ϵ 2d ) 2 points. Let S[i] = {s i 1 , ..., si k } be the partition points in the i-th dimension, that is, for all j we have for P[i, j] = {p ∈ D, si j ≤ pi < si j+1}, |P[i, j]| = ( ϵ 2d ) 2 . Using theorem 15 to build a model for each 262 set of points in P, we have that the expected error of each is O( ϵ 2d ). The models are stored in M, with M[i][j] denoting the model corresponding to j-th partition in the i-th dimension. Now, to answer a query, we first decompose it into 2d + 1 queries. 2d of the queries are answered by models, which reduce the original query to one that matches all facets of the grid cells. Then, the grid cells are used to answer the final query, and the answer is combined with the model estimates to find final query answer estimate. This is presented in Alg. 21. The decomposition of the query is done by recursively moving the upper and lower facets of the query hyperretangle in the d-th dimensions to aligh with the grid cell. Thus, in the d-th dimensions, if the closest grid partition points, respectively larger and smaller than q[d][0] and q[d][1] (the lower and upper bound of the query in d-th dimension) are si and sj , we decompose the query into three queries: q1, q2 and q3, all the same as q but q1[d][1] = si , q2[d][0] = sj and q3[d][0] = si , q3[d][1] = sj . Then, learned models are used to answer q1 and q2, while q3 is further recursively decomposed along its d − 1-th dimension (and after full decomposition is answered using the grid). Note that q1 and q2 can now be answered using models, because by grid construction, they fall in a part of the space with at most ( ϵ 2d ) 2 points. Accuracy. Grid cells are exact, and, as discussed above each model is built on a dataset of size at most ( ϵ 2d ) 2 , so that it will have expected error O( ϵ 2d ). Thus, combining the error of the 2d queries, the total model error is ϵ as desired. Query Time and Space Complexity. There are 2d model calls, each model call costing T X ( ϵ 2d ) 2 . Furthermore, if the i-th dimension of the query covers ki points, then total of at most 4d 2ki ϵ 2 partitions in the i-th dimension intersect the query, so that the total number of cells traversed will be ( 4d 2 ϵ 2 ) dΠiki . Thus, total query time is O(T X ( ϵ 2d ) 2 + ( 4d 2 ϵ 2 ) dΠiki). Furthermore, the total cost of 263 Algorithm 22 Learned Sorting Input: An array A of length n to be sorted Output: The sorted array 1: procedure Sort(A) 2: if n ≤ κ then 3: return MergeSort(A) 4: S ← random sample of A of size √ n 5: S ← MergeSort(S) 6: k ← n 1 8 7: ˆf ← A(S) 8: B ← array of size k 9: Bmin, Bmax ← arrays tracking min/max B[i] ∀i 10: for i in n do 11: B[⌊ kfˆ(A[i]) n ⌋].append(A[i]) 12: Update Bmin, Bmax for bucket ⌊ kfˆ(A[i]) n ⌋ 13: for i in k − 2 do 14: if Bmax[i] > Bmin[i + (2κ + 1)] then 15: return MergeSort(A) 16: for i in k do 17: if |B[i]| ≥ (2κ + 1)n 4 5 then 18: return MergeSort(A) 19: else 20: B[i] ← Sort(B[i]) 21: return Merge(B) ▷ Alg. 23 storing the models is 4d 2n ϵ 2 S X ( ϵ 2d ) 2 , and the cost of the grid is ( 4d 2n ϵ 2 ) d . Thus, total space complexity is O( 4d 2n ϵ 2 S X ( ϵ 2d ) 2 + ( 4d 2n ϵ 2 ) d ). 9.7.3.6 Sorting Using Learned Model (Theorem 17) Algorithm. The algorithm is presented in Alg. 22. A sample of the array is first created and a model is built using the sample. Then, using the model, the array is split into n 1 8 buckets, where we theoretically show, using such a number of buckets, based on the accuracy of the model and with high probability, merging the buckets can be done by in linear time (because there will be limited overlap between the buckets) and each bucket will 264 Algorithm 23 Merge Step Input: An array of sorted buckets Output: Buckets merged into a sorted array 1: procedure Merge(B) 2: As ← empty array of size n 3: As[: len(B[1])] ← B[1] 4: j ← len(B[1]) 5: for b ← 2 to k do ▷ Merges As[: j] with B[b] 6: j ← j + len(B[b]) ▷ Iterator for As 7: i ← len(B[b]) ▷ Iterator for B[b] 8: while i > 0 do 9: if B[b][i] > As[j] then 10: As[j + i] ← B[b][i] 11: i-- 12: else 13: As[j + i] ← As[j] 14: j-- 15: j ← j + len(B[b]) 16: return As not be too big. Indeed, we first make sure the two properties mentioned before hold (otherwise the algorithm quits and reverts to merge sort), and then proceed to merge the buckets. Correctness. If all the created buckets are sorted, the merge step simply merges them and thus returns a sorted array correctly. At the base case, merge sort is used, so the buckets will be sorted correctly. Thus, using the invariant above, the algorithm is correct. Time Complexity. Consider sorting an array A ∼ χ of size n. We take a subset S, of size √ n from A without checking the elements to preserve the i.i.d assumption. We sort them and use the algorithm A to obtain a model rˆ. We have that P(∥rˆ − rχ∥∞ ≥ √ nϵ1) ≤ κ1e −κ2(ϵ1n − 1 4 −1)2 . (9.19) 265 Note that ∥rˆ − rA∥∞ ≤ ∥rˆ − rχ∥∞ + ∥rχ − rA∥, And, by DWK, P(∥rA − rχ∥∞ ≥ ϵ3) ≤ 2e −2( √ ϵ3 n ) 2 . Set ϵ1 = n 1 4 ( qlog log n κ2 + 1) and ϵ3 = qn log log n 2 , we have P(∥rˆ − rχ∥∞ ≥ √ nϵ1 or ∥rA − rχ∥ ≥ ϵ3) ≤ κ1e − log log n + 2e − log log n = κ1 + 2 log n Thus, P(∥rˆ − rA∥ ≥ √ nϵ1 + ϵ3) ≤ κ1 + 2 log n and thus, whenever log log n ≥ 2 and for κ = √ 1 2 + 1 √κ2 P(∥rˆ − rA∥∞ ≥ κn 3 4 p log log n) ≤ κ1 + 2 log n . To simplify, observe that n 3/4√ log log n ≤ n 4/5 for n ≥ e so that P(∥rˆ − rA∥∞ ≥ κn 4 5 ) ≤ κ1 + 2 log n . 266 Recall that we use k buckets and consider the i-th bucket. The elements, x, assigned to it must have i k ≤ 1 n rˆ(x) < i+1 k . Combining this with the above, we have that whenever ∥rˆ − rA∥∞ ≤ κn 4 5 holds, for the elements x in the i-th bucket, we must have 1 n rˆ(x) ≥ i k − κn − 1 5 and that rA(x) ≤ i+1 k + κn − 1 5 . Therefore, we must have rA(x) ∈ [ in k − κn 4 5 , (i+1)n k + κn 4 5 ]. There are at most n k + 2κn 4 5 elements in this set. Setting k = n 1 5 , we have that whenever ∥rˆ − rA∥∞ ≤ κn 4 5 holds, all buckets will have at most (2κ + 1)n 4 5 elements. Thus, the probability that a bucket will have more than (2κ + 1)n 4 5 elements is at most κ1+2 log n . Furthermore, the largest element in the i-th bucket or before will have rA(x) < (i+ 1 +κ)n 4 5 and the smallest element in the i + 2κ + 1-th bucket or after will have rA(x) ≥ (i + 1 + κ)n 4 5 , so that the content of buckets up to i are less than the content of the buckets from i+2κ+1 onwards. Thus, whenever ∥rˆ−rA∥∞ ≤ κn 4 5 holds, if all the buckets are sorted, then the algorithm takes at most Pk i=1 P2κ+1 j=0 |Bi−j | number of operations to merge the sorted array, which is O(n), where |Bi | is the number of elements in the i-th bucket. Finally, let Tn be the expected number of operations it takes to sort an array with n elements i.i.d sampled from some a distribution learnable class. Recall that we do merge sort if ∥rˆ−rA∥∞ ≤ κn 4 5 does not hold, and recursively sort the array if it does. Thus, whenever we recursively sort 267 Algorithm 24 Sorting Using Distribution Model Input: An unsorted array A of size n Output: A sorted array 1: procedure Sort(A) 2: if n ≤ 10 then 3: return MergeSort(A) 4: A′ ← new array, A′ [i] initialized as linked list, ∀i 5: for i ← 1 to n do 6: i ′ ← ⌈rˆ(A[i])⌉ 7: A′ [i ′ ].append(A[i]) 8: return Merge(B) ▷ Alg. 23 an array, the array will have at most (2κ + 1)n 4 5 i.i.d elements, distributed from a conditional distribution of the original distribution. Thus, we have T(n) ≤ O(B√ X n + T√ X nn) + P(merge sort)n log n+ P(recursively sort)n 1 5 T((2κ + 1)n 4 5 ) ≤ O(B√ X n + T√ X nn) + κ + 2 log n n log n+ n 1 5 T((2κ + 1)n 4 5 ) = O(T√ X nn log log n+ √ nB√ X n + log log Xn i=0 n 19 1 2 ( 4 5 ) i B X n 1 2 ( 4 5 ) i ). Space Comlexity. First, observe that we only create one model at a time, so the maximum size used for modeling is S√ X n+ √ n log n. Moreover, the depth of recursion is at most O(log log n), and the overhead of storing B is dominated by the first of recursion, whose overhead is O(n 1 8 log n+ O(n log n)), giving overall space overhead of O(S√ X n + n log n) 268 Using data distribution (Theorem 18) Alg. 24 shows how to sort an array given an approximate model of the data distributionˆ[r]. The algorithm is very similar to Alg. 24, but uses n buckets and merges each bucket using merge sort (and thus no recursive sorting of the buckets). Correctness. The algorithm creates buckets, sorts them independently. The sort is done by merge sort so it is correct, and thus merging the sorted buckets creates a sorted array. Running time Let T(A) be the number of operations the algorithm performs on an array A, and let T (n) = EA∼χn [T(A)] be the expected run time of the algorithm on an input of size n. First, assume we use ⌊nrχ(x)⌋ to map an element x to a location in the array S. After the mapping, we study the expected time to sort the elements in S[i : j] where j = i + k and k > 0. Note that the probability that an element is mapped to location [i : j], i.e., Px∼χ( i n ≤ rχ(x) < j n ), is k n . Let Ni:j = |S[i : j]| be the number of elements mapped to S[i : j]. We have that PA∼χn (Ni:j = z) = C(n, z)(k n ) z (1 − z n ) n−z . Thus, we have EA∼χn [Ni:j log(Ni:j )] = Xn z=1 PA∼χn (Ni:j = z)[z log(z)] = Xn z=1 C(n, z)(k n ) z (1 − k n ) n−z z log z = X 4ek i=1 C(n, i)(k n ) i (1 − k n ) n−i ilog i + Xn i=4ek C(n, i)(k n ) i (1 − k n ) n−i ilog i 269 For the first part of the summation, we have X 4ek i=1 C(n, i)(k n ) i (1 − k n ) n−i ilog i ≤ log(4ek) X 4ek i=1 C(n, i)(k n ) i (1 − k n ) n−i i ≤ k log(4ek). For the second part, we have Xn i=4ek C(n, i)(k n ) i (1 − k n ) n−i ilog i ≤ Xn i=4ek 1 √ i ( en i ) i ( k n ) i (1 − k n ) n−i ilog i = Xn i=4ek ( ekn i(n − k) ) i (1 − k n ) n √ ilog i ≤ Xn i=4ek ( 2ek i ) i (1 − k n ) n √ ilog i ≤ Xn i=4ek ( 1 2 ) i i ≤ 2. So that EA∼χn [Ni:j log(Ni:j )] ≤ (j − i) log(4e(j − i)) + 2. Now recall that we use ⌊rˆ⌋ with error ∥rˆ− nrχ∥∞ ≤ ϵ to map the elements to an array S ′ . Thus, if, for any x, ⌊nrχ(x)⌋ = j, rˆ ∈ {j − ⌊ϵ⌋, ..., j +⌈ϵ⌉}. Let N¯ i:j be the number of elements mapped 270 to positions [i : j] using rˆ. Note that N¯ i:j ≤ Ni−ϵ:j+ϵ . Thus, dividing S ′ into groups of ϵ and sorting each separately, we have that the total cost of sorting the groups is nXϵ j=0 EA∼χn [N¯ jϵ:(j+1)ϵ log(N¯ jϵ:(j+1)ϵ)] ≤ nXϵ j=0 EA∼χn [N(j−1)ϵ:(j+2)ϵ log(N(j−1)ϵ:(j+2)ϵ)] ≤ O(ϵlog ϵ) Finally, note that rχ is a non-decreasing function. Therefore, if x ≤ y, we have ⌊nrχ(x)⌋ ≤ ⌊nrχ(y)⌋. Given that ∥rˆ − nrχ∥∞ ≤ ϵ, we have that if x ≤ y, rˆ(y) ≥ rˆ(x) − 2ϵ. Consequently, if x is mapped to the i-th group by rˆ, all elements in the j-th group with j < i − 2 are less than x. This means, to merge the sorted groups, we start with the first group and iteratively merge the next group with the merged array so far. Performing each merge from the end of the two sorted arrays (as done in merging using learned data distribution), each merge will cost at most 3ϵ, so the total cost of merging all the n ϵ sorted groups is O( n ϵ ϵ) = O(n). Putting everything together,r when each model call costs T X n , we have that expected time complexity of the algorithm is O(nT X n + n log ϵ). Moverover the space overhead of the algorithm is O(n log n + S X n ) where the log n factor is to keep a pointer to the elements of the array (instead of copying them). 271 Chapter 10 Required Model Size for Learned Database Operations 10.1 Introduction Recent empirical results show that learned models perform many fundamental database operations (e.g., indexing, cardinality estimation) more efficiently than non-learned methods, providing significant speed-ups and space savings [44, 62, 40, 165, 61]. Nevertheless, the lack of theoretical guarantees on their performance poses a significant hurdle to their practical deployment, especially since the non-learned alternatives often provide the required theoretical guarantees [5, 100, 48, 13]. Such guarantees are needed to ensure the reliability of the learned operations across all databases at deployment time, that is, to ensure consistent performance of the learned model on databases where the learned model had not been apriori evaluated. Thus, a theoretical guarantee, similar to existing worst-case bounds for non-learned methods, is needed for learned models, i.e., a guarantee that a learned operation will achieve the desired accuracy level on all possible databases. Providing such a guarantee depends on how large the learned model is (e.g., number of parameters of a neural network), the desired accuracy level, and the size and dimensionality of the underlying databases. This chapter takes the first step towards a theoretical understanding of the relationship between these factors for three key database operations, offering theoretical 272 bounds on the required model size to achieve a desired accuracy on all possible databases of a certain size and dimensionality when using learned models to perform the operation. Specifically, the three operations studied in this chapter are (1) indexing: finding an item in an array, (2) cardinality estimation: estimating how many records in a database match a query, and (3) range-sum estimation: estimating the aggregate value of an attribute for the records that match a query. We focus on numerical datasets and consider axis-aligned range queries for cardinality and range-sum estimation (i.e., queries that ask for the intersection of ranges across dimensions). Typical learned approaches to the above database operations take a function approximation view of the operations. Let f(q) be a function that takes a query, q, as an input, and outputs the answer to the query calculated from the database. For instance, in the case of cardinality estimation, f(q) will be the number of records in the dataset that match the query q (and f(q) can be similarly defined for indexing and range-sum estimation). At training time, a model, ˆf(q; θ) (e.g., a neural network) is trained to approximate f. Training is done using supervised learning, where training labels are collected for different queries by performing the queries on the database using an existing method (e.g., for cardinality estimation, by iterating over the database and counting how many records match a query). At test time, the models are used to obtain estimates directly (e.g., by performing a forward pass of a neural network), providing ˆf(q; θ) as an estimate to the answer to a query q. For indexing, where the exact location of the query in the array is needed (not an estimated location returned by the model), a local search around the model estimate is performed to find the exact answer. Such learned approaches are currently state-of-the-art, with experimental results showing significantly faster query time and lower storage space when using learned methods compared 273 Database Operation Worst-Case Error Average-Case Error (Uniform Dist.) Average-Case Error (Arbitrary Dist.) Indexing n 2ϵ+1 log2 (1 + (2ϵ+1)u n ) Theorem 20 ( √ n − 2) log2 (1 + 1 2ϵ ) Theorem 21 ( √ n − 2) log2 (1 + 1 2ϵ ) Theorem 25 Cardinality Estimation n 2ϵ+1 log2 (1 + (2ϵ+1)u d n ) Theorem 20 ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ) Theorem 22 X Lemma 20 Range-Sum Estimation n 2ϵ+1 log2 (1 + (2ϵ+1)u d n ) Theorem 20 ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ) Corollary 23 X Lemma 20 Table 10.1: Our bounds on required model size in terms of data size, n, dimensionality, d, tolerable error, ϵ, and domain size, u. Each column shows the result when ϵ is the tolerable error for the specified error scenario. X: No non-trivial bound possible with non-learned methods for indexing [62, 40, 33], cardinality estimation [61, 89] and rangesum estimation [165]. Furthermore, recent results also show theoretical advantages to using learned models [163, 39, 165], most significantly, with [163] showing the existence of a learned index that can achieve expected query time of O(log log n) under mild assumptions on the data distribution, asymptotically better than the traditional O(log n) of non-learned methods such as binary search. However, there has been no theoretical understanding of the required modeling choices, such as the required model size, for the learned approaches to provide an error guarantee across databases. Without any theoretical guidelines, design choices are made through empirical hyperparameter tuning, leading to choices with unknown performance guarantees at deployment time. 10.1.1 Our Results In this chpater, we present the first known bounds on the model size needed to achieve a desired accuracy when using machine learning to perform indexing, cardinality estimation and rangesum estimation. We provide bounds on the required model size, defined as the smallest possible size for a model to achieve error at most ϵ on all d-dimensional datasets of size n. We measure 274 model size in terms of number of bits required to store a model (which translates to the number of model parameters by considering the storage precision for the parameters). We refer to ϵ as the tolerable error parameter, which denotes the maximum error that can be tolerated in the system. We thoroughly study the required model size in two different scenarios, namely when considering the worst-case and average-case error. That is, ϵ can be provided in terms of worstcase or average-case error across queries that can be tolerated for all databases (i.e., worst-case across databases). Table 10.1 summarizes our main results, which we further discuss considering the two error scenarios in turn. First, suppose our goal is to answer all possible queries with error at most ϵ across all ddimensional datasets of size n. The results in the second column of Table 10.1, summarizing our Theorem 20 in Sec 10.3.1, provide a lower bound on the required model size to achieve this. For example, for indexing, to be able to guarantee error at most ϵ on all possible queries and datasets of size n, one must use a model whose size exceeds n 2ϵ+1 log2 (1 + (2ϵ+1)u n ). Notably, the bounds depend on the domain size u, which is the number of possible values the records in the database can take, implicitly assuming a finite data domain. We show in Lemma 19 in Sec. 10.3.1 that this is necessary: no model with finite size can answer queries with a bounded worst-case error on all possible datasets with infinite domain (this result is mostly of theoretical interest, since data stored in a computer always has finite domain). In the second scenario, our goal is to answer queries with average error of at most ϵ on all ddimensional datasets of size n. Assuming the queries are uniformly distributed, the third column in Table 10.1, summarizing Theorem 21, 22 and Corollary 23 in Sec. 10.3.2, presents our lower bounds on the required model size. Our bounds in this scenario show a weaker dependency on data size and tolerable error parameter compared with the worst-case error scenario, and as 275 expected, suggest smaller required model size. Interestingly, bounds do not depend on the domain size and hold when the data domain is the set of real numbers, showing a significant difference between model size requirements when considering the two scenarios. Thus, our results formally show that robustness guarantees (i.e., guarantees on worst-case error) must come at the expense of larger model sizes. Furthermore, the results in the last column of Table 10.1, summarizing our Theorem 25 and Lemma 20 in Sec. 10.3.3, show that we can extend our results to arbitrary query distribution (compared with uniform distribution) in the case of indexing without affecting the bounds. However, for cardinality and range-sum estimation, we show in Lemma 20 that when relaxing our assumption on data distribution, one can construct arbitrarily easy distribution to answer queries from, so that no non-trivial lower bound on the model size can be obtained (surprisingly, this is not possible for learned indexing). Finally, not presented in Table 10.1, for average-case error, we complement our lower bounds on the required model size with corresponding upper bounds, showing tightness of our results. Theorem 21-25 show that our lower bounds are tight up to an O( √ n) factor, asymptotically in data size. For practical purposes, our results can be interpreted in two ways. In the first interpretation, given a model size and data size, our results provide a lower bound on the worst-case possible error. This bound shows what error can be guaranteed by a model of a certain size (and how bad the model can get) after it is deployed in practice. This is important, because datasets change in practice and our bound on error help quantify if a model of a given size can guarantee a desired accuracy level when the dataset changes. Experiments in Sec. 10.4 illustrate that this bound on error is meaningful, showing that models achieve error values close to what the bound 276 suggests. In the second interpretation, our results provide a lower bound on the required model size to achieve a desired accuracy level across datasets. This shows how large the model needs to be, to be able to guarantee the desired accuracy, and has significant implications for resource management in database systems. For instance, it helps a cloud service provider decide how much resources it needs to allocate (and calculate the cost) for learned models to be able to guarantee an accuracy level across all its database instances. Overall, our results are information theoretic, showing that it is not possible for any model to contain enough information to answer queries on all datasets accurately if they contain less than the specified number of bits. The bounds are obtained by considering the parameters of a model as a data representation, and showing bounds on the required size of any data representation to achieve a desired accuracy when performing the specific operations. Our proofs provide a novel exploration of the function approximation view of database operations, connecting combinatorial properties of datasets with function approximation concepts. Specifically, we prove novel bounds on packing and metric entropy of the metric space of database query functions to prove the bounds. In Sec.10.6, we discuss various possible extensions of our results to queries with joins, other aggregation function such as min/max/avg. and other error metrics not considered in this chapter. 10.2 Preliminaries Setup. We are given a dataset, D ∈ Dn×d , i.e., a dataset consisting of n records and in d dimensions with each attribute in the data domain D, where n and d are integers greater than or equal to 1. Unless otherwise stated, we assume D = [0, 1] so that D ∈ [0, 1]n×d (attributes can be scaled to [0, 1] if they fall outside the range). We use Di to refer to the i-th record of the dataset 277 (which is a d-dimensional vector) and Di,j to refer to the j-th element of Di . If d = 1 (i.e., D is 1-dimensional), then Di is the i-th element of D (and is not a vector). We study the following database operations. Indexing. The goal is to find an item in a sorted array. Formally, consider a 1-dimensional sorted dataset D (i.e., a sorted 1-dimensional array). Given a query q ∈ [0, 1], return the index i ∗ = Pn i=1 IDi≤q, where I is the indicator function. i ∗ is the index of the largest element no greater than q and is 0 if no such element exists. Furthermore, if q ∈ D, q will be at index i ∗ + 1. i ∗ is referred to as the rank of q. Define the rank function of the dataset D as rD(q) = Pn i=1 IDi≤q, which takes a query as an input and outputs its rank. We have Qr = [0, 1] as the domain of the rank function. Cardinality Estimation. Used mainly for query optimization, the goal is to find how many records in the dataset match a range query, where the query specifies lower and upper bound conditions on the values of each attribute. Formally, consider a d-dimensional dataset. A query predicate q = (c1, ..., cd, r1, ..., rd), specifics the condition that the i-th attribute is in the interval [ci , ci + ri ]. Define Ip,q as an indicator function equal to one if a d-dimensional point p = (p1, ..., pd) matches a query predicate q = (c1, ..., cd, r1, ..., rd), that is, if cj ≤ pj ≤ cj +rj , ∀j ∈ [d] ([k] is defined as [k] = {1, ..., k} for integers k). Then, the answer to a cardinality estimation query is the number of points in D that match the query q, i.e., cD(q) = P i∈[n] IDi,q. We refer to cD as the cardinality function of the dataset D, which take a query as an input and outputs the cardinality of the query. We define Qc = {rj ∈ [0, 1], cj ∈ [−rj , 1 − rj ], j ∈ [d]}, where the definition ensures cj + rj ∈ [0, 1] to avoid asking queries outside of the data domain. Range-Sum Estimation. The goal is to calculate the aggregate value of an attribute for the records that match a query. Formally, consider a (d + 1)-dimensional dataset D and a query 278 q = (c1, ..., cd, r1, ..., rd), where q, similar to the case of cardinality estimation, defines lower and upper bounds on the data points. The goal is to return the total value of the (d+1)-th attributes of the points in D that match the query q, i.e., sD(q) = P i∈[n] IDi,qDi,d+1. Here, for simplicity, we overload the notation and use Ip,q, when the dimensionality of the query and predicate doesn’t match to be defined as cj ≤ pj ≤ cj + rj , ∀j ∈ [min{d, d′}], where d is dimensionality of the point p and d ′ is the dimensionality of the predicate q. sD is called the range-sum function of the dataset D, which takes a query as an input and outputs the range-sum of the query. We define the range-sum function domain Qs to be the same as Qc. We use the term query function to collectively refer to the rank, cardinality and range-sum functions, and use the notation fD ∈ {rD, sD, cD} to refer to all the three functions, rD, cD and sD (for instance, fD ≥ 0 is equivalent to the three independent statements that rD ≥ 0, cD ≥ 0 and sD ≥ 0). We drop the dependence on D if it is clear from context and simply use f(q). We also use Qf to refer to Qs, Qc and Qr for f ∈ {r, c, s}. For cardinality and range-sum estimation, often only an estimate of the query result is needed, because many applications (e.g., query optimization and data analytics) prefer a fast estimate over a slow but exact answer. For indexing, although exact answers are needed to locate an element in an array, one can do so through approximation. First, an estimate of the rank function is obtained, and then, a local search of the array around the provided estimate (e.g., using exponential or binary search) leads to the exact result. Thus, in all cases, approximating the query function with a desired accuracy is the main component in answering the query, which is the focus of the rest of this chapter. Learned Database Operations. Learned database operations use machine learning to approximate the database operations as follows. First, during training, a function approximator, 279 ˆf(.; θ) is learned to approximate the function f, for f ∈ {r, s, c}. This is typically done through supervised learning (although unsupervised approaches are also possible), where for different queries sampled from Qf , the operations are performed on the database to find the ground-truth answer, and the models are optimized through a mean squared loss. Subsequently, at test time, for a test query q, ˆf(q; θ)is used as an estimate of the query answer, which is obtained by performing a forward pass of the model. In practice, models used have a much fewer number of parameters than the data size, so that the models don’t memorize the data but rather utilize patterns in query answers to perform the operations, leading to the practical gains in query answering. This procedure can be formally specified as follows (both for supervised and unsupervised approaches). First, a function ρ(D), takes the dataset as an input and generates model parameters θ (e.g., through training with gradient descent). Then, to answer a query q at test time, the function ˆf(q; θ) takes both the model parameters and the query as input and provides the final query answer (i.e., ˆf specifies the model forward pass). From this perspective, the model parameters θ is a representation of the dataset D and the function ˆf only uses this representation to answer queries, without accessing the data itself. We call ρ the training function and ˆf the inference function. The Model Size Problem. An important question is how large the model needs to be to be able to achieve error at most ϵ on datasets of size n. We quantify the model size in terms of the number of bits needed to store the parameters θ. The required model size is formalized as below. Definition 4 (Required Model Size). Let f ∈ {r, c, s} and consider an error norm ∥.∥ for functions approximating f, and a data domain D. Let Fσ be the set of all possible training and inference function pairs, where the training function generates a parameter set of size at most size σ bits. 280 Let Σ be the smallest σ so that there exists (ρ, ˆf) ∈ Fσ such that ∥ ˆf(.; ρ(D)) − fD∥ ≤ ϵ for all D ∈ Dn×d . We call Σ the required model size to achieve ∥.∥-error of at most ϵ in the worst-case across all d-dimensional datasets of size n. Σ is the size of the parameter set passed from the training function to the inference function. Thus, in the above formulation, the training/inference functions can be arbitrarily complex. The goal of this chapter is to present lower bounds on Σ in terms of n and ϵ, and depending on the error norm ∥.∥, which is an important factor impacting the lower bounds. One expects that larger models are needed if the worst-case error over all queries is considered, compared with the average error. Specifically, for f ∈ {r, s, c}, we consider the 1-norm error of approximating f with ˆf as ∥f − ˆf∥1 = R Qf |f − ˆf|, the ∞-norm error as ∥f − ˆf∥∞ = supq∈Qf |f(q) − ˆf(q)| and the µ-norm ∥f − ˆf∥µ = R Qf |f − ˆf|dµ where µ is a probability measure over Qf . ∞-norm is also called the worst-case error and µ-norm error is referred to as the average error with arbitrarily distributed queries and 1-norm is referred to as the average error with uniformly distributed queries (note that the volume of query space is 1 for all the function domains, so that 1-norm indeed corresponds to uniform distribution). 10.3 Lower Bounds On Model Size for Database Operations We present lower bounds on the required model size to be able to provide worst-case and averagecase error guarantees. For all cases, our results provide lower bounds for achieving error ϵ on all datasets. In other words, we show that if the model size is smaller than a specific threshold, then there exists a dataset such that the error is larger than ϵ. As such, our bounds consider the worst-case error across datasets, while considering either average or worst-case error across queries. We first present our results considering the worst-case error in Sec. 10.3.1, then present 281 results considering average-case error in Secs. 10.3.2 and 10.3.3 for uniform and arbitrary query distributions, respectively. Proof of our results are presented in Sec. 10.8. 10.3.1 Bounds Considering Worst-Case error We present our results when considering the worst-case error, or ∞-norm in approximation. First, for the purpose of the following theorem, suppose the datasets are discretized at the unit 1 u , that is datasets are from the set Du = { i u , u ∈ [u]} n×d (this reduces the data domain from the set of real numbers in [0, 1] to multiples of 1 u in [0, 1]). Define Σ ∞ f for f ∈ {r, c, s}, as the required model size to answer queries to ∞-norm error at most ϵ for all datasets in Du. For instance, Σ ∞ r is the smallest possible model size to be able to approximate rank function with ∞-norm at most ϵ on all possible datasets from the data domain Du. Theorem 20. For any error 1 ≤ ϵ < n 2 , (i) For the case of learned indexing, we have that Σ ∞ r ≥ n 2ϵ+1 log(1 + (2ϵ+1)u n ), (ii) For the case of learned cardinality estimation, we have that Σ ∞ c ≥ n 2ϵ+1 log(1 + u d(2ϵ+1) n ), and (iii) For the case of learned range-sum estimation, we have that Σ ∞ s ≥ n 2ϵ+1 log(1 + u d2ϵ+1 n ). The theorem provides lower bounds on the required model size to be able to perform database operations with a desired accuracy. For instance, for the case of learned indexing, the theorem states that the model size must be larger than n 2ϵ+1 log(1 + (2ϵ+1)u n ) to be able to guarantee ∞- norm errorϵ on all datasets, or alternatively, that if the model size is less than n 2ϵ+1 log(1+(2ϵ+1)u n ), then for any model approximating the rank function, there exists a database where the model’s ∞-norm error is more than ϵ. We see that the required model size is close to linearly dependent on data size and dimensionality, while inversely correlated with the tolerable error paramter. Besides dependence on data size, dimensionality and error, the bound shows a dependence on u, 282 the domain size. An interesting question, then, is whether similar bounds on model size will hold if the data domain is not finite. The next lemma shows that the answer is no. Lemma 19. When ϵ < n 2 and the data domain D = [0, 1], for any finite size σ, and any training/inference function pair(ρ, ˆf) ∈ Fσ, there exists a dataset D ∈ [0, 1]n such that ∥ ˆf(.; ρ(D))−rD∥∞ > ϵ. We remark that Lemma 19 may not be surprising. Storing real numbers requires infinitely many bits, and, although we are interested in the size of the model required to answer queries (and not the space required to store the data), one might expect the model size should be similar to the space required to store that data. Lemma 19 shows this to be true in the case of worst-case error. However, perhaps more surprisingly, the remainder of our results in the next sections show this is not true when considering the average-case error. As such, we consider the case where D ∈ [0, 1]n×d for the remainder of this chapter. 10.3.2 Bounds Considering Average-Case Error with Uniform Distribution In this section, our results provide the average-case error assuming uniform data distribution, or 1-norm approximation error. Average-case error corresponds with the expected performance of the system, another important measure needed for real-world deployments. A bound on 1-norm error across all possible databases provides a performance guarantee for all possible databases. Our results in this section show how large the model needs to be to provide such guarantees. 10.3.2.1 Learned Indexing We first present our result showing a lower bound on the required model size for learned indexing. 283 Theorem 21. Let Σ 1 r be the required model size to achieve 1-norm error of at most ϵ on datasets of size n when approximating the rank function. (i) For any 0 < ϵ ≤ √ n 2 , we have that Σ 1 r ≥ ( √ n − 2) log2 (1 + 1 2ϵ − √ 1 n ). (ii) For any 0 < ϵ ≤ n, we have that Σ 1 r ≤ n log2 (e + e ϵ + e n ). Part (i) of the theorem states that, if a model whose size is less than ( √ n−2) log2 (1+ 1 2ϵ − √ 1 n ) bits is used to approximate the rank function, then there will exist a dataset of size n for which the model results in error larger than ϵ. As expected, the required model size increases both as data size increases, and as error threshold decreases. Furthermore, for a constant error ϵ, the results shows that the require model size is Ω(√ n), providing the first known results showing the increase of model size with data size has to be at least in the order of √ n. Part (ii) of the theorem shows that the asymptotic dependence on n in the lower bound is tight up to a √ n factor and to achieve a desired accuracy, model size does not need to increase more than linearly in data size. Overall, Theorem 21 shows that for a constant error, Σ 1 r is Ω(√ n) and O(n). The proof of part (ii) constructs a model that achieves the bound. The model can be seen as a nearest neighbor encoder, modeling a dataset based on its nearest neighbor in a constructed set of datasets. Observe that this, and the rest of our results considering average case error do not depend on the domain size (as Theorem 20 did). Thus, a fundamental difference between answering queries accurately in the worst case, compared with average case is that in the first scenario the lower bounds depend on the discritization unit, while in the second scenario, model size does not depend on the discretization unit (i.e., Theorems21-24). Furthermore, our results show that the lower bound in the case of the worst-case error has a stronger dependence on the tolerable error parameter, compared with when average error is considered. 284 10.3.2.2 Learned Cardinality Estimation Next, we present an analog of Theorem 21 for the case of cardinality estiamtion. Theorem 22. Let Σ 1 c be the required model size to achieve 1-norm error at most ϵ on d-dimensional datasets of size n when approximating the cardinality function. (i) For any 0 < ϵ ≤ √ n 4 d , we have that Σ 1 c ≥ ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ). (ii) For any 0 < ϵ ≤ n, we have that Σ 1 c ≤ n log2 ( e2 d(d+1)dn d−1 ϵ d + e − e n ). The above theorem shows that, in the case of cardinality estimation, bounds of a similar form to the case of indexing hold, however, the bounds now also depend on data dimensionality. We see that, asymptotically in data size, Σ 1 c is Ω(d √ n log( √ n 4 dϵ )) and O(dn log 2dn ϵ ), where we see a close to linear required dependency on dimensionality while there is also an additional logarithmic dependency on n compared with the case of indexing. 10.3.2.3 Range-Sum Estimation Finally, we extend our results to range-sum estimation. For the discussion in this section, let Σ 1 s be the required model size to achieve 1-norm error at most ϵ on (d + 1)-dimensional datasets of size n when approximating the range-sum function. Recall that we consider d + 1 dimensional datasets here, where query predicates apply to the first d dimensions and the query answers are the aggregation of the (d + 1)-th dimension, as defined in Sec. 10.2. To prove a lower bound on Σ 1 s , observe that range-sum estimation can be seen as a generalization of cardinality estimation. Specifically, answering range-sum queries on a d + 1-dimensional dataset, where the d + 1-th attribute of all the records is set to 1, is equivalent to answering cardinality estimation queries on the d-dimensional dataset consisting only of the first d dimensions of the original dataset. Thus, if a model with size less than Σ 1 s is able to answer range-sum queries 285 on all datasets with error at most ϵ, then it can also answer cardinality estimation queries with error at most ϵ. This means the lower bound on model size from Thoerem 22 (i) translates to range-sum estimation as well. Thus, we have the following result as a corollary to Thoerem 22. Corollary 23. For any 0 < ϵ ≤ √ n 4 d , we have that Σ 1 s ≥ ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ). Next, we show that an upper bound very similar to Theorem 22 (ii) on the required model size also holds for range-sum estimation. Theorem 24. For any 0 < ϵ ≤ n, we have that Σ 1 s ≤ n log2 (e( 2(d+2) ϵ ) d+1n d + e − e n ). Observe that the upper bound on Σ 1 s is similar to Σ 1 c , but slightly larger, showing a stronger dependence on dimensionality in the case of Σ 1 s . This reflects the discussion above, that rangesum estimation is a generalization of cardinality estimation. Indeed, the proof of Theorem 24 is a generalization of the proof of Theorem 22 (ii). 10.3.3 Bounds Considering Average-Case Error with Arbitrary Distribution Next, we discuss extending the results in Sec. 10.3.2 to an arbitrary query distribution. The following theorem shows that this generalization does not impact the bounds in the case of indexing, i.e., the theorem below shows that the same bounds as in Theorem 21 also hold when considering µ-norm. Theorem 25. Let Σ µ r be the required model size to achieve µ-norm error at most ϵ on datasets of size n when approximating the rank function, for any continuous probability measure µ over [0, 1]. (i) For any 0 < ϵ ≤ √ n 2 , we have that Σ µ r ≥ ( √ n − 2) log2 (1 + 1 2ϵ − √ 1 n ). (ii) For any 0 < ϵ ≤ n, we have that Σ µ r ≤ n log2 (e + e ϵ + e n ). 286 However, the next lemma shows that the lower bounds do not hold for arbitrary distributions in the case of cardinality and range sum estimation. Lemma 20. For f ∈ {c, s}, there exists a query distribution, µ, such that for any error parameter ϵ > 0, we have ∥fD − fD′∥µ ≤ ϵ for all D, D′ ∈ [0, 1]n×d . The above lemma shows that one can construct a distribution for which the µ-norm difference between all datasets is arbitrarily small. As a result, one can answer queries independently of the observed dataset, and therefore the required model size for achieving any error is 0. The proof of Lemma 20 creates a data distribution consisting only of queries with small ranges, so that the answer to most queries is zero or close to zero for any dataset. Thus, comparing Lemma 20 with Theorem 25, we see that queries having both a lower and upper bound on the attributes leads to a different theoretical characteristic for cardinality and range-sum estimation compared with indexing. 10.4 Empirical results We present experiments comparing our bounds with the error obtained by training different models on datasets sampled from different distributions. We train a linear model, and two neural networks with a single hidden layer where the two neural networks have 10 and 50 model parameters, respectively referred to as NN-S1 and NN-S2. Small neural networks and linear models are common modeling choices for learned database operations [62, 40, 61, 165]. We also present results for using a random samples as a non-learned baseline, referred to as Sample. For direct comparison, the number of samples are set so that Sample takes the same space as the linear 287 10 3 10 4 10 5 10 6 10 7 10 8 Data Size 10 2 10 4 10 6 10 8 Maximum Error (a) Indexing (Worst-Case) 10 3 10 4 10 5 10 6 10 7 10 8 Data Size 10 3 10 5 10 7 Maximum Error (b) Card. Est. (Worst-Case) 10 3 10 4 10 5 10 6 10 7 10 8 Data Size 10 1 10 1 10 3 10 5 10 7 Avg. Error (c) Indexing (Avg-Case) 10 3 10 4 10 5 10 6 10 7 10 8 Data Size 10 1 10 1 10 3 10 5 10 7 Avg. Error (d) Card. Est. (Avg-Case) Linear NN-S1 NN-S2 Sample Uniform GMM Lower Bound Figure 10.1: Theoretical Bounds in Practice model. We consider 1-dimensional datasets sampled from uniform and 2-component Gaussian mixture model distributions. Recall that our theoretical results provide bounds of the form Σ > g(n, ϵ, d), for some function g specified in our theorems. Given a model size, σ, data size, n and dimensionality d, define ϵ ∗ as the largest ϵ such that σ ≤ g(n, ϵ, d) holds. For model size σ, this implies that for any model, there exists a d-dimensional dataset of size n where the error of the model is at least ϵ ∗ . Thus, an interpretation of our theoretical results is that given a model size, σ, data size, n and dimensionality, d, our results provide a lower bound on the worst-case error across all ddimensional datasets of size n for any model of size s, where this lower bound is equal to ϵ ∗ as defined above. Our experiments present results following this view of our theoretical bounds. Our experimental results are presented in Fig. 10.1. The color of the lines/points in the figure corresponds to the specific models, and the points in the figures show either the maximum or average error across queries, observed after training the specific models on datasets sampled from either GMM or uniform distributions. The solid lines in the figures plot the value of ϵ ∗ , i.e., lower bound on the worst-case error across datasets, for different model and data sizes. The results are presented for indexing and cardinality estimation and under 1-norm and ∞-norm errors. Our bounds on range-sum estimation are similar to cardinality estimation, and are thus omitted. Note that the error is often (much) larger than 1, since the error is based on the absolute 288 difference between model prediction and true query answers. The true query answers can be large, and their values scale with data size, so that the absolute error of the models is also large and increases with data size. We perform no normalization of the error values, to allow direct comparison with our theoretical results. First, consider our results on the worst-case error, shown in Figs. 10.1 (a) and (b). As expected, the observed error of different trained models increases with data size, with the models achieving lower error on uniform distribution compared with a GMM. Furthermore, our theoretical bounds on the model size lie close to the error of the models on GMMs, showing that the bounds are meaningful. In fact, in the case of the linear model, all the observed errors lie below the theoretical lower bound. This implies that, based on our theoretical result, there exists some dataset (on which the models haven’t been evaluated in this experiment) whose error lies on or above the theoretical lower bound. This shows a practical benefits of our bound: one can obtain a bound on the error of the model on all possible databases, and beyond the datasets on which the model has been empirically evaluated. Next, consider our results on the average-case error, shown in Figs. 10.1 (c) and (d). Compared with the worst-case scenario, we see that the results show a large gap between error of the models and our theoretical bounds, especially so for larger model sizes. Our tightness results in Sec. 10.3.2, not plotted here, theoretically quantify how large this gap can be. We also note tha, for both worst-case and average-case scenarios, the gap between the observed error and our lower bounds on large models does not necessarily imply that our bounds are looser for larger model sizes. Such a gap can also be due to the models used in practice being wasteful of their storage space for larger model sizes. Our results support the latter hypothesis, since we observe marginal improvement in accuracy as model size increases across both distributions. This is also supported 289 by observations that sparse neural networks can achieve similar accuracy as non-sparse models while using much less space (e.g., [42]), hinting at the suboptimality of fully connected networks. Finally, Fig. 10.1 shows that sampling performs worse than learned models for uniform distribution while it performs similarly for GMMs (Sample should be compared with Linear as they both have the same size). The latter can be because a linear model is not a good modeling choice for GMMs. Nonetheless, our theoretical bounds suggest (see Sec. 10.5 for theoretical comparison) the gap between learned models and sampling will grow as dimensionality increases (d = 1 in our experiments). 10.5 Related Work A large and growing body of work has focused on using machine learning to speed up database operations, among them, learned indexing [44, 62, 40, 33], learned cardinality estimation [61, 142, 55, 149, 148, 72, 88] and learned range-sum estimation [165, 51, 75]. Most existing work focus on improving modeling choices, with various modeling choices such as neural networks [165, 61, 62], piece-wise linear approximation [40], sum-product networks [51] and density estimators [75]. In all existing work, modeling choices are based on empirical observations and hyperparametr tuning, with no theoretical understanding of how the model size should be set for a dataset to achieve a desired accuracy. On the theory side, existing results in learned database theory show existence of models that achieve a desired accuracy [165, 163, 39], showing bounds on the performs nce of specifically constructed models to perform database operation. Among them, [163] shows that a learned model can perform indexing in O(log log n) expected query time (i.e., better than the traditional O(log n)). Our results complement such work, showing a lower bound on the required size for 290 any modeling approach to perform the database operations. We note that the bounds in [165, 163, 39] either hold on expectation or with a certain probability, while our bounds are nonprobabilistic and consider the worst-case across all datasets. Orthogonal to our work, [55, 6] study the number of training samples needed to achieve a desired accuracy for different database operations. More broadly, non-learned data models, such as samples, histograms, sketches, etc. (see e.g. [27, 29, 122]) are also used to estimate query answers. We discuss existing lower bounds which is the focus of this chapter, and refer the reader to [27, 29] for a complete treatment of nonlearned methods. One approach is using ϵ-approximations, where a subset of the dataset is selected (through random sampling or deterministically) and used for query answering. [140, 78] show that for cardinality estimation, the size of such a subset has to be at least Ω(n ϵ logd−1 ( n ϵ )) to answer queries with worst-case error at most ϵ. Note that using the fact that VC dimension of orthogonal range queries is 2d [121], we have that random sampling uses O((n ϵ ) 2 (d+log δ)) samples to provide error at mostϵ with probability δ [84], while [103] provides a deterministic method using a subset of size O( n ϵ log2d ( n ϵ ) polylog(log(n ϵ ))). Comparing these results to our bound in Theorem 20, we observe that ϵ-approximations can be much less space-efficient compared with other modeling choices in high dimensions. This is because ϵ-approximations correspond to a restricted class of models, where the model of the data is simply a subset of the data. Relaxing this restriction, [140] considers a special case of our Theorem 20 (ii), where they provide a lower bound of n ϵ (log2 ( n ϵ )+log(n)) on the required size when answering cardinality estimation queries in two dimensions and when u = n (recall that u is the discretization factor in Sec. 10.3.1). The lower bound is tighter than our bound in Theorem 20 (ii), but unlike our bound that applies to arbitrary dimensionsionality and granularity, is only applicable to the specialized case of d = 2 291 and u = n. In this setting, [140] also present a non-learned data structure that matches the lower bound. Finally, other lower bounds have been presented in the streaming setting [29, 128] which is orthogonal to our work, as in our setting one has access to the entire dataset during model training. Finally, although model parameters can be seen as a compressed data representation, bounds on data compression [30, 34] do not apply to our setting, as we are not interested in the error of reconstructing the data, but the error of answering queries using this representation. Nonetheless, we use information theoretic tools (e.g., packing and metric entropy [136]) also used in studies of other representation systems [34, 102]. 10.6 Discussion Bounding log2 error. Recall that our results consider the absolute error of prediction. In the case of indexing, one is often interested in log2 of the error, since that’s the runtime of the binary search performed to find the true element after obtaining an estimate from the model. Our worstcase absolute error bound directly translates to worst-case log2 error bound (this is because log is an increasing function). That is, there exists a dataset such that worst-case absolute error is ϵ if and only if there exists a dataset such that worst-case log2 error is log2 ϵ. Thus, to ensure log2 error is at most an error parameter τ , we can directly set ϵ = 2τ in the bound presented in Theorem 20 to obtain the bound on required model size. Regarding the average-case error, the situation is slightly more complex. Using Jensen’s inequality, we have that the average log2 error can be smaller than log2 of the average error. This implies that, to obtain average log2 error of τ the required model size may indeed be smaller than the bounds in Corollary 23 and part (i) of 292 Theorems 21, 22, 25 suggest if we set ϵ = 2τ . Nonetheless, our upper bounds on the required model size (i.e., Theorem 24, Lemma 20 and part (ii) of Theorems 21, 22, 25) still apply by setting ϵ = 2τ . Cardinality Estimation for Joins. Cardinality estimation is often used to estimate cardinality of joins, which is important for query optimization. Our bounds present lower bounds on cardinality estimation on a single table. Note that a naive extension of our bounds to the cardinality estimation for joins is to apply the bound to the join of tables. That is, if two tables have respectively d1 and d2 dimensions and if their join consists of nJ elements, then we can apply our bounds in Table 10.1 with d = d1 + d2 and n = nJ to obtain a lower bound on the required model size for estimating the cardinality of the join. However, we expect such an approach to overestimate the required model size, as it does not consider the join relationship between the two tables. For instance, nJ × d may be much larger than n1 × d1 + n2 × d2, because of duplicate records created due to the join operations. Considering the join relationship, one may be able to provide bounds that depend on the original table sizes and not the join size. Other Aggregations. Our results consider counts and sum aggregations. Although we expect similar proof techniques as what was used to apply to other aggregations such as min/- max/avg, we note that studying worst-case bounds for min/max/avg may not be very informative. Intuitively, this is because, for such aggregations, one can create arbitrary difficult queries that require the model to memorize all the data points. For instance, consider datasets where the (d + 1)-th dimension, where the aggregation is applied, only takes 0 or 1 values, while the other dimensions can take any values in a domain of size u. Now avg/min/max for any range query will have an answer between 0 and 1 (for min/max the answer be exactly 0 or 1). However, unless the model memorizes all the points exactly (which requires a model size close to the data 293 size), its worst-case error can be up to 0.5. This is because queries with a very small range can be constructed that match exactly one point in the database, and unless the model knows the exact value of all the points, it will not be able to provide a correct answer for all such queries. Note that the error of 0.5 is very large for avg/min/max queries, and a model that always predicts 0.5 also obtains worst-case error 0.5. This is not the case for sum/count aggregations whose query answers range between 0 to n, and a model with an absolute error of 0.5 can be considered a very good model for most queries when answering sum/count queries. To summarize, worst-case errors of min/avg/max queries are disproportionately affected by smaller ranges, while smaller ranges have a limited impact on the worst-case error of count/sum queries. As such we expect that for the case of min/max/avg queries, one needs to study the error for a special set of queries (e.g., queries with a lower bound on their range, as done in [165], which also makes similar observations) to be able to obtain meaningful bounds. We leave such a study to future work. 10.7 Conclusion We presented the first known lower bounds on the required model size to achieve a desired accuracy when using machine learning to perform indexing, cardinality estimation and range-sum estimation. We studied the required model size when considering average-case and worst-case error scenarios, showing how the model size needs to change based on accuracy, data size and data dimensionality. Our results highlight differences in model size requirements when considering average-case and worst-case error scenarios and when performing different database operations. Our theoretical results provide necessary conditions for ensuring reliability of learned models when performing database operations. Future work includes providing tighter bounds for 294 average-case error, bounds based on data characteristics such as data distribution, and studying other database operations. 10.8 Proofs 10.8.1 Intuition Our proofs study the characteristics of the space of the query functions to show the number of bits needed to represent the elements of this space with a desired accuracy. For f ∈ {r, c, s}, let F = {fD, D ∈ [0, 1]n×d}, be the set of all possible query functions for d-dimensional datasets of size n. Then, for a norm ∥.∥, M = (F, ∥.∥) is a metric space. A learned model, ˆf(.; θ), represents the elements of M with its parameters θ. Therefore, θ needs to be large enough (contain enough number of bits), to able to represent all the possible elements of F with a desired accuracy. This in turn depends how large M is and how far its elements are from each other. Thus, our bounds follow from bounding suitable measures of the size of M, specifically, the packing entropy and metric entropy of M. Bounding the packing entropy and metric entropy of M is non-trivial, especially when considering the 1-norm error, and requires an in-depth analysis of the metric space, relating combinatorial and approximation theoretic concepts. Intutiveily, our proofs for the lower bounds on the required model size proceed as follows. Note that if a model has size σ bits there are exactly 2 σ different possible model parameter settings. Thus, when there are more than 2 σ different possible datasets, multiple datasets must be mapped to the same model parameters values (that is, after training, multiple datasets will have the exact same model parameter values). To obtain a bound on the error, our proof shows that some datasets that are mapped to the same parameter values will be too different from each other (i.e., queries 295 on the datasets have answers that are too different), so that the exact same parameter setting cannot lead to answering queries accurately for both datasets. The proofs do this by constructing 2 σ + 1 different datasets such that query answers differ by more than 2ϵ on all the 2 σ + 1 datasets. Thus, two of those datasets must be mapped to the same model parameter values, and the model must have an error more than ϵ on at least one of them, which completes the proof. The majority of the proofs are on how this set of 2 σ+1 datasets is constructed. Specifically, the proofs construct a set of datasets, where each pair differs in some set of points, S. The main technical challenge is constructing/showing that for any of the two datasets that differ in a set of points S, the maximum or average difference between the query answers is at least 2ϵ. This last statement is query dependent, and our technical Lemmas 3-8 in the appendix are proven to quantify the difference between query answers between the datasets based on the properties of the set S and the query type. This is especially difficult in the average-case scenario, as one needs to study how the set S affects all possible queries. 10.8.2 Background Our proofs are based on the notions of metric and packing entropy which we briefly describe here. Consider a metric space, M = (F, ∥.∥). Let M(F, ∥.∥, ϵ) be the packing number of M and let N(F, ∥.∥, ϵ) be the covering number of M. The packing number is the maximum number of non-overlapping balls of radius ϵ that can be placed in M, and the covering number is the minimum number of balls of radius ϵ that can cover M (see [136] for rigorous definitions). We have that for M = M(F, ∥.∥, ϵ), that there exists f1, ..., fM ∈ F s.t. ∥fi , −fj∥ > ϵ for all i ̸= j, 296 and for N = N(F, ∥.∥, ϵ), that there exists f1, ..., fN ∈ F s.t. ∀f ∈ F, ∥f − fi∥ ≤ ϵ for some i ∈ [N]. log2 M(F, ∥.∥, ϵ) is called the packing entropy of M and log2 N(F, ∥.∥, ϵ) is called the metric entropy of M. Our proofs are based on the following theorem, utilizing the metric and packing entropy of a metric space. Let ρσ(f) : F → {0, 1} σ be a function that takes elements of the metric space as input and outputs a bit vector of size σ, and let ˆfσ : {0, 1} σ → F′ , be a function that takes bit vector of size σ as input and outputs elements of space F ′ where ∥.∥ is well-defined on F ′ ∪ F. Theorem 26. Consider a metric space, M = (F, ∥.∥) (i) For any ρσ and ˆfσ such that ∥ ˆfσ(ρσ(f)) − f∥ ≤ ϵ for all f ∈ F, we must have σ ≥ log2 M(F, ∥.∥, 2ϵ). (ii) There exists ρσ and ˆfσ such that ∥ ˆfσ(ρσ(f))−f∥ ≤ ϵfor all f ∈ F with σ ≤ ⌈log2 N(F, ∥.∥, ϵ)⌉. Proof of Part (i). Let M = M(F, ∥.∥, 2ϵ). By definition, there exists f1, ..., fM ∈ F, such that ∥fi − fj∥ > 2ϵ for all i ̸= j. Now assume, for the purpose of contradiction that there exists ρσ and ˆfσ with 2 σ < M s.t., ∥ ˆfσ(ρσ(f)) − f∥ ≤ ϵ for all f ∈ F. Note that a bit vector of size σ can take at most 2 σ different values. Since 2 σ < M, then ρ(f) must create an identical output for at least two of f1, ..., fM. That is, there exist fi and fj , i, j ∈ [M], i ̸= j s.t. ρ(fi) = ρ(fj ). Therefore, ˆf(ρ(fi)) = ˆf(ρ(fj )). By assumption the error of approximation is less than ϵ for both fi and fj , i.e., ∥ ˆf(ρ(fi)) − fi∥ ≤ ϵ and ∥ ˆf(ρ(fj )) − fj∥ ≤ ϵ. Then ∥fj − fi∥ ≤ ∥fi − ˆf(ρ(fi))∥ + ∥ ˆf(ρ(fi)) − fj∥ = ∥fi − ˆf(ρ(fi))∥ + ∥ ˆf(ρ(fj ) − fj∥ ≤ 2ϵ, 297 Showing ∥fj − fi∥ ≤ 2ϵ which is a contradiction. Therefore, we must have 2 σ ≥ M which implies σ ≥ log2 M(F, ∥.∥, 2ϵ) as desired. Proof of Part (ii). Let N = N(F, ∥.∥, ϵ). There exists f1, ..., fN ∈ F s.t. ∀f ∈ F, ∥f − fi∥ ≤ ϵ for some i ∈ [N]. Then, construct ρσ as follows. For any f ∈ F, find i s.t., ∥f − fi∥ ≤ ϵ. Let ρσ be the binary representation of i. Since i ∈ [N], ⌈log2 N⌉ bits are needed to represent i, so that σ = ⌈log2 N⌉. Then, for a binary representation b, define ˆfσ(b) as a function that finds the integer i with representation b and returns fi . Thus, for any f ∈ F, we have that ∥ ˆfσ(ρσ(f)) − f∥ = ∥fi − f∥ ≤ ϵ, which completes the proof. 10.8.3 Results with ∞-Norm 10.8.3.1 Proof of Theorem 20 For the purpose of this section, define ϵ¯ = ⌊ϵ⌋ + 1 for the proof of all three parts. Proof of Part (i). Let F = {rD, D ∈ Dn u}, be the set of all possible rank functions for datasets of size n, and consider the metric space M = (F, ∥.∥∞). We prove a lower bound on M(F, ∥.∥∞, ϵ) which in turn proves the desired results using Theorem 26. Let P1, P2 ⊆ D⌊ n ϵ¯ ⌋ u , P1 ̸= P2, that is, P1 and P2 are datasets of size ⌊ n ϵ¯ ⌋ only containing points in Du. Let D1 be the dataset of size n, where each point in P1 is repeated ϵ¯ times (let the remainder of n − ⌊n ϵ¯ ⌋ × ϵ¯ elements be equal to one), and similarly define D2. We have that ∥rD1 − rD2 ∥∞ ≥ ϵ > ϵ ¯ . Let S be the set of all possible datasets generated using the above procedure. We have that S is an ϵ-Packing of M, and thus, |S| ≤ M(F, ∥.∥∞, ϵ). 298 It remains to find |S|. Each element of S is created by selecting ⌊ n ϵ¯ ⌋ elements from u + 1 elements in Du with repetition. The unique number of ways to perform this selection is C(⌊ n ϵ¯ ⌋+ u, ⌊ n ϵ¯ ⌋) ≥ ( n ϵ¯ +u n ϵ¯ ) n ϵ¯ ≥ ( n ϵ+1 +u n ϵ+1 ) n ϵ+1 . Thus, M(F, ∥.∥∞, ϵ) ≥ (1 + (ϵ+1)u n ) n ϵ+1 Combining this with Theorem 26, we have that any model answering queries to ∞-error ϵ must have size at least n 2ϵ+1 log2 (1 + (2ϵ+1)u n ). Proof of Part (ii). Let F = {cD, D ∈ Dn u}, be the set of all possible cardinality functions for datasets of size n, and consider the metric space M = (F, ∥.∥∞). We prove a lower bound on M(F, ∥.∥∞, ϵ) which in turn proves the desired results using Theorem 26. Let P1, P2 ⊆ D⌊ n ϵ¯ ⌋×d u , P1 ̸= P2, that is, P1 and P2 are d-dimensional datasets of size ⌊ n ϵ¯ ⌋ only containing values in Du. Let D1 be the dataset of size n, where each point in P1 is repeated ϵ¯ times (let the remainder of n − ⌊n ϵ¯ ⌋ × ϵ¯ elements be equal to one), and similarly define D2. First, we show that ∥cD1 − cD2 ∥∞ ≥ ϵ¯. Let p be a point that appears more times in D1 than D2, and consider a query q = (c, r) with c = p − 1 2u and r = 1 u . The only point in P1 ∪ P2 that matches q is p. Since p appears more times in D1 but not D2, and each appearance of the point is repeated ϵ¯ times by definition we have |cD1 (q) − cD2 (q)| ≥ ϵ¯. Thus, we have ∥cD1 −cD2 ∥∞ ≥ ϵ > ϵ ¯ . Let S be the set of all possible datasets generated using the above procedure. We have that S is an ϵ-Packing of M, and thus, |S| ≤ M(F, ∥.∥∞, ϵ). It remains to find |S|. Each element of S is created by selecting ⌊ n ϵ¯ ⌋ elements from (u + 1)d elements in Du with repetition. The unique number of ways to perform this selection is at least C(⌊ n ϵ¯ ⌋ + u d , ⌊ n ϵ¯ ⌋) ≥ ( n ϵ¯ +u d n ϵ¯ ) n ϵ¯ ≥ ( n ϵ+1 +u d n ϵ+1 ) n ϵ+1 . Thus, M(F, ∥.∥∞, ϵ) ≥ (1 + (ϵ+1)u d n ) n ϵ+1 . Combining this with Theorem 26, we have that any model answering queries to ∞-error ϵ must have size at least n 2ϵ+1 log2 (1 + (2ϵ+1)u d n ). 299 Proof of Part (iii). For the purpose of contradiction, assume there exists a training/inference function pair (ρ, ˆf) with size less than n 2ϵ+1 log2 (1 + (2ϵ+1)u d n ) that for all datasets D ∈ Dn×(d+1) u we have ∥ ˆf(.; ρ(D)) − sD∥∞ ≤ ϵ. We use (ρ, ˆf) to construct a training/inference function pair (ρ ′ , ˆf) that answers cardinality estimation queries for any dataset D ∈ Dn×d u with error at most ϵ. Specifically, define ρ ′ (D) as a function that takes D ∈ Dn×d u as an input, constructs D′ ∈ Dn×(d+1) u as D′ [i, j] = D[i, j] for j ∈ [d], i ∈ [n], and D′ [i, d + 1] = 1, and returns ρ(D′ ). Here, D′ is a dataset with its first d dimensions identical to D and but with it’s d+1-th dimension set to 1 for all data points. Then, we have that ∥ ˆf(.; ρ ′ (D)) − cD∥∞ ≤ ϵ for all D ∈ Dn×d u , because by construction cD = sD′, ˆf(.; ρ ′ (D)) = ˆf(.; ρ(D′ )) and that ∥ ˆf(.; ρ(D′ )) − sD′∥∞ ≤ ϵ by assumption. However, ∥ ˆf(.; ρ ′ (D))−cD∥∞ ≤ ϵ contradicts Part (ii), and thus we have proven that no training/inference function pair (ρ, ˆf) with size less than n 2ϵ+1 log2 (1 + (2ϵ+1)u d n ) exists that for all datasets D ∈ Dn×(d+1) u we have ∥ ˆf(.; ρ(D)) − sD∥∞ ≤ ϵ. 10.8.3.2 Proof of Lemma 19 For any ρσ, ˆfσ with finite σ, we construct a dataset, D, such that ∥rD − ˆfσ(ρσ(D))∥∞ > n 2 Let Dp be the dataset of size n with point p repeated n times. For any k, consider ∆k = {D i k , 0 ≤ i ≤ k}. Note that for any D, D′ ∈ ∆k, ∥rD − rD′∥∞ = n. Now for the purpose of contradiction, assume, σ bits are sufficient for answering queries with error ϵ across all datasets of size n. Now consider any ρσ, ˆfσ. Let k = 2σ , so that |∆k| = 2σ + 1. Thus, there exists D, D′ ∈ ∆k s.t. ρσ(D) = ρσ(D′ ), so that ˆfσ(ρσ(D)) = ˆfσ(ρσ(D′ )). Now assume the error on 300 either D or D′ is less than n 2 , or otherwise the proof is complete. Therefore, w.l.o.g., we have ∥ ˆfσ(ρσ(D)) − rD∥∞ < n 2 . We have that n = ∥rD − rD′∥∞ = ∥(rD − ˆfσ(ρσ(D))) − (rD′ − ˆfσ(ρσ(D′ )))∥∞ < n 2 + ∥rD′ − ˆfσ(ρσ(D′ ))∥∞ So that ∥rD′ − ˆfσ(ρσ(D′ ))∥∞ > n 2 . Thus, for any ρσ, ˆfσ with finite σ, there exists a dataset, D such that ∥rD − ˆfσ(ρσ(D))∥∞ > n 2 . 10.8.4 Results with 1-norm 10.8.4.1 Proof of Theorem 21 We first present the following lemma, whose proof can be found in Appendix 10.8.6. Lemma 21. Let D, D′ ∈ [0, 1]n be datasets in sorted order. Then, ∥rD − rD′∥1 = P i∈[n] |D[i] − D′ [i]|. Note that P i∈[n] |D[i] − D′ [i]| = ∥D − D′∥1, so that ∥rD − rD′∥1 = ∥D − D′∥1. The lemma shows that 1-norm between rank functions has a closed-form solution that can be calculated based on the difference between the points in the dataset. We use this lemma throughout for analyzing the 1-norm error. Let F = {rD, D ∈ [0, 1]n}, be the set of all possible rank functions for datasets of size n, and consider the metric space M = (F, ∥.∥1). Proof of Theorem 21 (i). We prove a lower bound on M(F, ∥.∥1, ϵ) which in turn proves the desired results using Theorem 26. Let P = { i ⌈ k ϵ ⌉−1 , 0 ≤ i ≤ ⌈ k ϵ ⌉−1} for an integer k specified later. Let P , P ′ ∈ P ⌊ n k ⌋ , P ̸= P ′ , that is, P and P ′ are datasets of size ⌊ n k ⌋ only containing points in P. Let D be the dataset of 301 size n, where each point in P is repeated k times (let the remainder of n − ⌊n k ⌋ × k elements be equal to one), and similarly define D′ using P ′ , and consider D and D′ in a sorted order. We show that ∥rD − rD′∥1 > ϵ. Observe that D and D′ differ in at least k points, so that D[i] ̸= D′ [i] for k different values of i. Using this observation and Lemma 21, we have that ∥rD − rD′∥1 = ∥D − D′ ∥1 ≥ 1 ⌈ k ϵ ⌉ − 1 × k > k k ϵ = ϵ. Thus, for any two different datasets, D, D′ generated by the above procedure, we have ∥rD − rD′∥ > ϵ. Let S be the set of all datasets generated that way. We have that S is an ϵ-Packing of M, and thus, |S| ≤ M(F, ∥.∥1, ϵ). It remains to find |S|. Each element of S is created by selecting ⌊ n k ⌋ elements from ⌈ k ϵ ⌉ elements in P with repetition. The unique number of ways to perform this selection is C(⌊ n k ⌋ + ⌈ k ϵ ⌉ − 1, ⌊ n k ⌋). Let k = ⌈ √ n⌉, so that we have C(⌊ n ⌈ √ n⌉ ⌋ + ⌈ ⌈ √ n⌉ ϵ ⌉ − 1, ⌊ n ⌈ √ n⌉ ⌋) ≥ ( ⌊ n ⌈ √ n⌉ ⌋ + ⌈ ⌈ √ n⌉ ϵ ⌉ − 1 ⌊ n ⌈ √ n⌉ ⌋ ) ⌊ n ⌈ √n⌉ ⌋ ≥ ( √ n + √ n ϵ − 1 √ n ) ⌊ n ⌈ √n⌉ ⌋ = (1 + 1 ϵ − 1 √ n ) ⌊ n ⌈ √n⌉ ⌋ ≥ (1 + 1 ϵ − 1 √ n ) √ n−2 . Thus, M(F, ∥.∥1, ϵ) ≥ (1 + 1 ϵ − √ 1 n ) √ n−2 . Combining this with Theorem 26, we have that any model answering queries to 1-error ϵ must have size at least ( √ n − 2) log2 (1 + 1 2ϵ − √ 1 n ). Proof of Theorem 21 (ii). We prove an upper bound on the metric entropy N(F, ∥.∥1, ϵ). Let D¯ = { i ⌈ n ϵ ⌉ , 0 ≤ i ≤ ⌈n ϵ ⌉}, and define F¯ = {rD, D ∈ D¯n}. We show that F¯ is an ϵ-cover of F, so that its size provides an upper bound for the covering number of F. 302 Specifically, for any D ∈ [0, 1]n , we show that ∃rD¯ ∈ F¯, s.t. ∥rD − rD¯ ∥1 ≤ ϵ. Specifically, consider D¯ such that D¯ [i] = ⌊⌈ n ϵ ⌉×D[i]⌋ ⌈ n ϵ ⌉ for all i. Such a D¯ exists in D¯n as all its points belong to D¯. This is because ⌊⌈n ϵ ⌉ × D[i]⌋ is an integer between 0 and ⌈ n ϵ ⌉, for D[i] ∈ [0, 1]. Furthermore, using Lemma 21, ∥rD − rD¯ ∥1 = X i∈[n] |D[i] − D¯[i]| = X i∈[n] |D[i] − ⌊⌈n ϵ ⌉ × D[i]⌋ ⌈ n ϵ ⌉ | < X i∈[n] 1 ⌈ n ϵ ⌉ = n ⌈ n ϵ ⌉ ≤ ϵ. Therefore, F¯ is an ϵ-covering of F. It remains to calculate the size of F¯, which is the number of possible ways to select n elements from ⌈ n ϵ ⌉ + 1 elements in D¯ with repetition. That is, |F| ¯ = C(n + ⌈ n ϵ ⌉, n) ≤ (e n + ⌈ n ϵ ⌉ n ) n ≤ (e n + n ϵ + 1 n ) n = (e + e ϵ + e n ) n , So that N(F, ∥.∥1, ϵ) ≤ (e + e ϵ + e n ) n . Combining this with Theorem 26, we obtain that n log2 (e + e ϵ + e n ) is an upper bound on the required model size. 10.8.4.2 Proof of Theorem 22 We first present three lemmas, whose proof can be found in Appendix 10.8.6. These lemmas can be seen as substitute for Lemma 21 for the case of cardinality estimation, since we do not 303 have such a closed-form general statement for 1-norm difference between cardinality functions (as we did for indexing in Lemma 21). However, the following lemmas provide scenarios where the 1-norm difference between the cardinality functions are bounded, which are utilized in our Theorem’s proof. Lemma 22. Consider two 1-dimensional databases D and D′ of size n such that |D[i]−D′ [i]| ≤ ϵ n for i ∈ [n]. Then ∥cD − cD′∥1 ≤ 2ϵ. Lemma 23. Consider a 1-dimensional database D of size n. Assume that we are given two mask vectors, m1 ,m2 ∈ {0, 1} n that each create two new dataset D1 and D2 , s.t., Di consists of records in D with mi = 1 for i ∈ {1, 2}. We have that ∥cD1 − cD2 ∥1 ≤ 1 2 P i∈[n] |m1 [i] − m2 [i]| For the remainder of this section, let F = {cD, D ∈ [0, 1]n×d}, be the set of all possible cardinality functions for d-dimensional datasets of size n, and consider the metric space M = (F, ∥.∥1). Our proof in this case reduce the problem to a 1-dimensional setting and then utilize the lemmas stated above to analyze the cardinality functions. Proof of Theorem 22 Part (i). We prove a lower bound on M(F, ∥.∥1, ϵ) which in turn proves the desired results using Theorem 26. Let u 2 = ⌈ k 2ϵ ⌉ − 1 and let P = {( i1 u , ..., id u ), 0 ≤ i1, ..., id ≤ u 2 } for an integer k specified later. Let P , P ′ ∈ P ⌊ n k ⌋ , P ̸= P ′ , that is, P and P ′ are datasets of size ⌊ n k ⌋ only containing points in P. Let D be the dataset of size n, where each point in P is repeated k times, and similarly define D′ using P ′ . W.l.o.g, assume D and D′ differ on their i-th point and its d-th dimension. We have 304 ∥cD − cD′∥1 = Z c1 ... Z cd−1 Z r1 ... Z rd−1 ∥cD(c1, ..., cd−1, ., r1, ..., rd−1, .) − cD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 ≥ Z c1 ... Z cd−1 Z r1 ... Z rd−1 I d−1 q,D[i] ∥cD(c1, ..., cd−1, ., r1, ..., rd−1, .) − cD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 Where I i q,p is an indicator function equal to 1 if a record p matches the first i dimensions in q = (c1, ..., cd, r1, ..., rd), that is if cj ≤ pj ≤ cj + rj for all j ∈ [i], and zero otherwise. Next, we show that for any q, we have I d−1 q,D[i] ∥cD(c1, ..., cd−1, ., r1, ..., rd−1, .) − cD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 ≥ Id−1 q,D[i] ϵ 2 . Specifically, if the D[i] does not match the first d − 1 dimensions of q, then both sides are zero. Otherwise, let D¯ = {D[i, d] ∀i, I d−1 q,D[i] = 1} be a 1-dimensional dataset of the d-th attribute of the records of D that matches the first d − 1 dimensions of q, and similarly define D¯ ′ = {D′ [i, d] ∀i, I d−1 q,D′ [i] = 1} for D′ . By, definition, we have that ∥cD(c1, ..., cd−1, ., r1, ..., rd−1, .) − cD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 = ∥cD¯ − cD¯′∥1. We state the following lemma whose proof is deferred to Appendix 10.8.6 to bound the 1-norm difference between D¯ and D¯ ′ . Lemma 24. Let D¯ and D¯ ′ be as defined above. For ϵ < k, we have that ∥cD − cD′∥1 > ϵ 2 . 305 Using Lemma 24 and the preceding discussion, we have that ∥cD − cD′∥1 > Z c1 ... Z cd−1 Z r1 ... Z rd−1 I d−1 q,D[i] ϵ 2 = ϵ 2 Z c1 ... Z cd−1 Z r1 ... Z rd−1 I d−1 q,D[i] . Z 1 0 Z min{D[i,j],1−rj} D[i,j]−rj 1 dcj drj = Z 1 0 (min{D[i, j], 1 − rj} − D[i, j] + rj ) drj = Z 1−D[i,j] 0 rj drj + Z 1 1−D[i,j] (1 − D[i, j]) drj = 1 2 (1 − D[i, j])2 + (1 − D[i, j])2 = 1 2 − D[i, j] 2 2 ≥ 1 4 , Where the last inequality follows because all points in P have all of their coordinates less than 1 2 . Repeating the above for all the dimensions, we obtain that ∥cD − cD′∥1 > ϵ0.25d . Thus, for any two different datasets, D, D′ generated by the above procedure, we have ∥cD − cD′∥ > ϵ0.25d . Let S be the set of all datasets generated that way. 306 It remains to find |S|. Each element of S is created by selecting ⌊ n k ⌋ from the ( u 2 +1)d possible elements in P with repetition. Set k = ⌈ √ n⌉ + 1, which is C((u 2 + 1)d + ⌈ √ n⌉, ⌈ √ n⌉ + 1) ≥ ( ( u 2 + 1)d + ⌊ n ⌈ √ n⌉ ⌋ − 1 ⌊ n ⌈ √ n⌉ ⌋ ) ⌊ n ⌈ √n⌉ ⌋ ≥ ( ( √ n 2ϵ ) d + √ n − 1 √ n ) √ n−2 = ( √ n d−1 (2ϵ) d + 1 − 1 √ n ) √ n−2 . Thus, we have that S contains elements each of which are at least ϵ0.25d apart for ϵ < √ n. Define ϵ ′ = ϵ0.25d so that ϵ ′ 0.25d = ϵ, and repeat tha above procedure for ϵ ′ , we have that there exists a set of ( √ n d−1 4 d(d+1)ϵ d +1− √ 1 n ) √ n−2 elements where they all differ by ϵ ′ for 4 d ϵ ′ ≤ √ n. Thus, we have an ϵ-Packing of M, and thus, ( √ n d−1 4 d(d+1)ϵ d + 1 − √ 1 n ) √ n−2 ≤ M(F, ∥.∥1, ϵ). Using this together with Theorem 26 proves the results. Proof of Theorem 22 Part (ii). We prove an upper bound on N(F, ∥.∥1, ϵ) which in turn proves the desired results using Theorem 26. Let D¯ = {{( i1 ⌈ n ϵ ⌉ , ..., id ⌈ n ϵ ⌉ ), 0 ≤ i1, ..., id ≤ ⌈n ϵ ⌉}}, and define F¯ = {cD, D ∈ D¯n}. We show that F¯ is an ϵ-cover of F, so that its size provides an upper bound for the covering number of F. For any D ∈ [0, 1]n×d , we show that ∃cD¯ ∈ F¯, s.t. ∥cD − cD¯ ∥1 ≤ ϵ. Specifically, consider D¯ such that D¯ [i, j] = ⌊⌈ n ϵ ⌉×D[i,j]⌋ ⌈ n ϵ ⌉ for all i. Such a D¯ exists in D¯n as all its points belong to D¯. Our goal is to show that ∥cD − cD¯ ∥1 ≤ ϵ. 307 To do so, consider the dataset Dˆ that Dˆ [i, j] = D¯ [i, j] for j ∈ [d − 1] and Dˆ [i, d] = D[i, d]. That is Dˆ is a dataset with its first d − 1 dimensions the same as D¯ and its d-th dimension the same as D. We have that ∥cD − cD¯ ∥1 ≤ ∥cD − cDˆ ∥1 + ∥cDˆ − cD¯ ∥1 We study the first two terms separately. Consider the second term ∥cDˆ − cD¯ ∥1. Observe that the first d − 1 dimensions of D¯ and Dˆ are identical, so that applying any predicate on the first d − 1 attributes of the datasets leads to the same filtered data points. Furthermore, the d-th dimension of the points differ by at most ϵ n , so by Lemma 22, we have ∥cDˆ − cD¯ ∥1 ≤ 2ϵ. Next, consider the term ∥cD −cDˆ ∥1. Recall that points in D and Dˆ have identical last dimensions. Define I i q,p (similar to proof for Part (i)) an indicator function equal to 1 if a record p matches the first i dimensions in q = (c1, ..., cd, r1, ..., rd), that is if cj ≤ pj ≤ cj +rj for all j ∈ [i], and let D∗ = {D[i, d] ∀i, I d−1 q,D[i] = 1} be a 1-dimensional dataset of the d-th attribute of the records of D that matches the first d − 1 dimensions of q, and similarly define Dˆ ∗ = {Dˆ [i, d] ∀i, I d−1 q,Dˆ [i] = 1} for Dˆ . Note that the d-th dimension of Dˆ ∗ and D∗ are identical, so that they both contain a subset of elements of the d-th dimension of D, where the subset is selected based on the indicator function I d−1 q,q . Thus, we apply Lemma 23 to Dˆ ∗ and D∗ (the masks in the lemma are induced based 308 on the indicator functions, i.e., m1 = (I d−1 q,Dˆ [1], ..., I d−1 q,Dˆ [n] ) and m2 = (I d−1 q,D[1], ..., Iq,D[n] d−1 )) to obtain ∥cD∗ − cDˆ ∗ ∥1 ≤ 1 2 X i∈[n] |Id−1 q,D[i] − Id−1 q,Dˆ [i] |. Moreover, by definition, ∥cD(c1, ..., cd−1, ., r1, ..., rd−1, .) − cD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 = ∥cD∗ − cDˆ ∗ ∥1. Combining the last two inequalities, we have ∥cD − cDˆ ∥1 ≤ Z c1 ... Z cd−1 Z r1 ... Z rd−1 X i∈[n] 1 2 |Id−1 q,D[i] − Id−1 q,Dˆ [i] | ≤ 1 2 X i∈[n] Z c1 ... Z cd−1 Z r1 ... Z rd−1 |Id−1 q,D[i] − Id−1 q,Dˆ [i] |. Now observe that for any q and i |Id−1 q,D[i] − Id−1 q,Dˆ [i] | ≤ X j∈[d−1] |Icj≤D[i,j]≤cj+rj − Icj≤Dˆ [i,j]≤cj+rj |, Where I is the indicator function, so that ∥cD − cDˆ ∥1 ≤ 1 2 X j∈[d−1] X i∈[n] Z c1 ... Z cd−1 Z r1 ... Z rd−1 |Icj≤D[i,j]≤cj+rj − Icj≤Dˆ [i,j]≤cj+rj |. 309 Note that for any j, |Icj≤D[i,j]≤cj+rj − Icj≤Dˆ [i,j]≤cj+rj | = 1 only in the two following scenario: (1) if Dˆ [i, j] ≤ cj ≤ D[i, j] and D[i, j] − cj ≤ rj ≤ 1 or (2) Dˆ [i, j] − 1 ≤ cj ≤ Dˆ [i, j] and Dˆ [i, j] − cj ≤ rj ≤ D[i, j] − cj . For the first scenario, we have that Z D[i,j] Dˆ [i,j] Z 1 D[i,j]−cj 1 drj dcj = Z D[i,j] Dˆ [i,j] (1 − D[i, j] + cj ) dcj = [(1 − D[i, j])cj + c 2 j 2 ] D[i,j] Dˆ [i,j] dcj = −[(1 − D[i, j])Dˆ [i, j] + Dˆ [i, j] 2 2 ] + [(1 − D[i, j])D[i, j] + D[i, j2 ] 2 ] = −Dˆ [i, j] + D[i, j]Dˆ [i, j] − Dˆ [i, j] 2 2 + D[i, j] − D[i, j] 2 2 = (D[i, j] − Dˆ [i, j]) − 1 2 (D[i, j] − Dˆ [i, j])2 ≤ ϵ n And for the second scenario we have Z Dˆ [i,j] Dˆ [i,j]−1 Z D[i,j]−cj Dˆ [i,j]−cj 1 drj dcj = Z Dˆ [i,j] Dˆ [i,j]−1 (Dˆ [i, j] − D[i, j]) dcj = (Dˆ [i, j] − D[i, j]) ≤ ϵ n , 310 Thus, we have ∥cD − cDˆ ∥1 ≤ 1 2 X j∈[d−1] X i∈[n] Z c1 ... Z cd−1 Z r1 ... Z rd−1 |Icj≤D[i,j]≤cj+rj − Icj≤Dˆ [i,j]≤cj+rj | ≤ 1 2 X j∈[d−1] X i∈[n] 2ϵ n = (d − 1)ϵ. Putting everything together, we have that ∥cD − cD¯ ∥1 ≤ 2ϵ + (d − 1)ϵ = (d + 1)ϵ. Thus, we have show that for any D, there exists cD¯ ∈ F such that ∥cD − cD¯ ∥ ≤ (d + 1)ϵ. Next, we calculate the size of F. it is equal to the number of ways n elements can be selected from a set of size (⌈ n ϵ ⌉ + 1)d . This is equal to C((⌈ n ϵ ⌉ + 1)d + n − 1, n) ≤ (e (⌈ n ϵ ⌉ + 1)d + n − 1 n ) n Define ϵ ′ = (d+1)ϵ, we have that there exists an ϵ ′ -cover of F with (e ( 2n(d+1) ϵ ′ ) d+n−1 n ) n elements for ϵ ′ ≤ 2 3 n(d+1) (we have used the fact that 2x ≥ ⌈x⌉+1 for x ≥ 3 2 ), which proves N(F, ∥.∥1, ϵ) ≤ (e ( 2n(d+1) ϵ ′ ) d+n−1 n ) n . Note that d ≥ 1 implies that ϵ ′ ≤ 2 3 n(d + 1) is satisfied for all ϵ ′ ≤ n. Combining this with Theorem 26, we obtain that n log2 (e( 2(d+1) ϵ ) dn d−1 + e − e n ) is an upper bound on the required model size. 311 10.8.4.3 Proofs for Range-Sum Estimation Similar to previous sections, we first present the following lemma that is used for bounding difference between range-sum functions, proved in Appendix 10.8.6. Lemmas 26 and 27 are a direct generalization of Lemmas 22 and 23 to answering range count queries, with almost identical proofs. However, Lemma 25 is specific to range-sum queries, allowing us to analyze the attribute that is being aggregated by the query. Lemma 25. Assume D and D′ are two (d + 1)-dimensional datasets with identical first d dimensions, but with |D[i, d + 1] − D′ [i, d + 1]| ≤ ϵ n for i ∈ [n]. Then, ∥sD − sD′∥ ≤ ϵ. Lemma 26. Consider two 2-dimensional databases D and D′ of size n such that |D[i, 1]−D′ [i, 1]| ≤ ϵ n and D[i, 2] = D′ [i, 2] for i ∈ [n]. Then ∥sD − sD′∥1 ≤ 2ϵ. Lemma 27. Consider a 2-dimensional database D of size n. Assume that we are given two mask vectors, m1 ,m2 ∈ {0, 1} n that each create two new dataset D1 and D2 , s.t., Di consists of records in D with mi = 1 for i ∈ {1, 2}. We have that ∥sD1 − sD2 ∥1 ≤ 1 2 P i∈[n] |m1 [i] − m2 [i]|. In this section, let F = {sD, D ∈ [0, 1]n×d}, be the set of all possible range-sum functions for d + 1-dimensional datasets of size n, and consider the metric space M = (F, ∥.∥1). Proof of Corollary 23. For the purpose of contradiction, assume there exists a training/inference function pair (ρ, ˆf) with size less than ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ) that for all datasets D ∈ [0, 1]n×(d+1) we have ∥ ˆf(.; ρ(D))−sD∥1 ≤ ϵ. We use (ρ, ˆf)to construct a training/inference function pair (ρ ′ , ˆf) that answers cardinality estimation queries for any dataset D ∈ [0, 1]n×d with error at most ϵ. Specifically, define ρ ′ (D) as a function that takes D ∈ [0, 1]n×d as an input, constructs D′ ∈ [0, 1]n×(d+1) as D′ [i, j] = D[i, j] for j ∈ [d], i ∈ [n], and D′ [i, d + 1] = 1, 312 and returns ρ(D′ ). Here, D′ is a dataset with its first d dimensions identical to D and but with it’s d + 1-th dimension set to 1 for all data points. Then, we have that ∥ ˆf(.; ρ ′ (D)) − cD∥1 ≤ ϵ for all D ∈ [0, 1]n×d , because by construction cD = sD′ and that ˆf(.; ρ ′ (D)) = ˆf(.; ρ(D′ )) and that ∥ ˆf(.; ρ(D′ )) − sD′∥1 ≤ ϵ by assumption. However, ∥ ˆf(.; ρ ′ (D)) − cD∥1 ≤ ϵ contradicts Theorem 22, and thus we have proven that no training/inference function pair (ρ, ˆf) with size less than ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ) exists that for all datasets D ∈ [0, 1]n×(d+1) we have ∥ ˆf(.; ρ(D)) − sD∥1 ≤ ϵ. Thus, Σ 1 s ≥ ( √ n − 2) log2 (1 + √ n d−1 4 d(d+1)ϵ d − √ 1 n ). Proof of Theorem 24. Let D¯ = {( i1 ⌈ n ϵ ⌉ , ..., id+1 ⌈ n ϵ ⌉ ), 0 ≤ i1, ..., id+1 ≤ ⌈n ϵ ⌉}, and define F¯ = {sD, D ∈ D¯n}. We show that F¯ is an ϵ-cover of F, so that its size provides an upper bound for the covering number of F. For any D ∈ [0, 1]n×d , we show that ∃sD¯ ∈ F¯, s.t. ∥sD − sD¯ ∥1 ≤ ϵ. Specifically, consider D¯ such that D¯ [i, j] = ⌊⌈ n ϵ ⌉×D[i,j]⌋ ⌈ n ϵ ⌉ for all i. Such a D¯ exists in D¯n as all its points belong to D¯. Furthermore, define D′ as D′ [i, j] = Di,j for i ∈ [n], j ∈ [d] but D′ [i, d+1] = D¯ [i, d+1]. We have that ∥sD −sD¯ ∥1 ≤ ∥sD −sD′∥1+∥sD′−sD¯ ∥1 ≤ ∥sD′−sD¯ ∥1+ϵ by Lemma 25 The remainder of the proof follows closely the proof of Theorem 22 Part (ii) with a slight generalization. We state this as a lemma and differ the proof to Appendix 10.8.6. Lemma 28. Let D¯ and D′ be as defined as above. We have that ∥sD¯ − sD′∥ ≤ (d + 1)ϵ This, together with the above discussion implies that ∥sD − sD¯ ∥ ≤ (d + 2)ϵ. Thus, we have show that for any D, there exists sD¯ ∈ F such that ∥sD − sD¯ ∥ ≤ (d + 1)ϵ. Next, we calculate 313 the size of F. it is equal to the number of ways n elements can be selected from a set of size (⌈ n ϵ ⌉ + 1)d+1. This is equal to C((⌈ n ϵ ⌉ + 1)d+1 + n − 1, n) ≤ (e (⌈ n ϵ ⌉ + 1)d+1 + n − 1 n ) n Define ϵ ′ = (d + 2)ϵ, we have that there exists an ϵ ′ -cover of F with (e ( 2n(d+2) ϵ ′ ) d+1+n−1 n ) n elements for ϵ ′ ≤ 2 3 n(d + 2) (we have used the fact that 2x ≥ ⌈x⌉ + 1 for x ≥ 3 2 ), which proves N(F, ∥.∥1, ϵ) ≤ (e ( 2n(d+2) ϵ ′ ) d+1+n−1 n ) n . Note that d ≥ 1 implies that ϵ ′ ≤ 2 3 n(d + 2) is satisfied for all ϵ ′ ≤ n. Combining this with Theorem 26, we obtain that n log2 (e( 2(d+2) ϵ ) d+1n d + e − e n ) is an upper bound on the required model size. 10.8.5 Results with µ-norm 10.8.5.1 Proof of Theorem 25 Theorem 25 can be seen as a direct generalization of Theorem 21. We first present a generalization of Lemma 21 to the case of µ-norm. Lemma 29. Let D and D′ be 1-dimensional datasets in sorted order. Then, ∥rD − rD′∥µ = ∥D − D′∥µ, where ∥D1−D2∥µ = P i∈[n] µ([D1[i], D2[i]]), µ([D1[i], D2[i]])is the probability of observing a query in the range [D1[i], D2[i]]. Using this lemma, the remainder of the proof is a straightforward adaptation of arguments in the proof of Theorem 21. Let F = {rD, D ∈ [0, 1]n}, be the set of all possible rank functions for datasets of size n, and consider the metric space M = (F, ∥.∥µ). 314 Proof of Theorem 25 Part (i). Let p0 = 0 and define pi inductively s.t. µ([pi−1, pi ]) = 1 ⌈ k ϵ ⌉−1 for i > 0. Since µ is a continuous distribution over [0, 1], a total of ⌈ k ϵ ⌉ distinct such points in [0, 1] exist. Let P = {p0, ..., p1} be the set of all such points, for an integer k specified later. Let P , P ′ ∈ P ⌊ n k ⌋ , P ̸= P ′ , that is, P and P ′ are datasets of size ⌊ n k ⌋ only containing points in P. Let D be the dataset of size n, where each point in P is repeated k times, and similarly define P ′ . Note that D and D′ differ in at least k points, so that D[i] ̸= D′ [i] for k different i’s. This means ∥rD −rD′∥µ = ∥D−D′∥µ ≥ 1 ⌈ k ϵ ⌉−1 > ϵ k k = ϵ. Therefore, as long as we generate datasets the way described above, for every two datasets we have ∥rD − rD′∥ > ϵ. Similar to the case of 1-norm error, the total number of such distinct datasets is greater than (1 + 1 ϵ − √ 1 n ) √ n−2 which bounds the packing entropy and together with Theorem 26 proves the results (see proof of Theorem 21 for more details). Proof of Theorem 25 Part (ii). Let p0 = 0 and define pi s.t. µ([pi−1, pi ]) = 1 ⌈ n ϵ ⌉ and let P be the set of ⌈ n ϵ ⌉ + 1 such points. For any D ∈ [0, 1]n , we show that ∃D¯ ∈ P n , s.t. ∥D − D¯ ∥µ ≤ ϵ. Specifically, consider D¯ such that D¯ [i] = arg minp∈P |p − D[i]| for all i. Note that for every i, we have that µ([D¯ [i], D[i]]) ≤ ϵ n . Therefore, ∥rD − rD¯ ∥ ≤ P i∈[n] ϵ n = ϵ. Similar to the case of 1- norm error, we can calculate the total number of possible datasets in P n to be at most (e+ e ϵ + e n ) n , which combined with Theorem 26 proves the result. 10.8.5.2 Proof of Lemma 20 For f ∈ {s, c}We construct a query distribution such that for any error ϵ > 0, we have ∥fD − fD′∥ ≤ ϵ for any D and D′ . Specifically, consider the set Q = {( i kn, 1 kn), 0 ≤ i ≤ kn − 1} for an integer k defined later. 315 Then, define the p.d.f. as g(c, r) = 2nk if ∀x ∈ Q, x /∈ [c, c + r] and 0 otherwise. Note that this is a valid p.d.f that integrates to 1, as shown below. X 0≤i≤kn−1 Z 1 nk c=0 Z r= 1 nk −c r=0 2nk dr dc = 2nk X 0≤i≤kn−1 Z 1 nk c=0 ( 1 nk − c) dc = 2nk X 0≤i≤kn−1 [ 1 nk c − c 2 2 ] 1 nk 0 = 2nk X 0≤i≤kn−1 1 nk 2 − ( 1 nk ) 2 2 = 2nk 2 X 0≤i≤kn−1 ( 1 nk ) 2 = 1 Now for any two dataset D and D′ , we have ∥fD − fD′∥ = 2knX i Z i+1 nk c= i nk Z r= i+1 nk −c r=0 |fD(c, r) − fD′(c, r)| ≤ 2knX i Z i+1 nk c= i nk Z r= i+1 nk −c r=0 2fD(c, r) ≤ 4knX i Z i+1 nk c= i nk Z r= i+1 nk −c r=0 fD( i nk , 1 nk ) = 4knX i fD( i nk , 1 nk ) 1 (nk) 2 = 4 (nk) X i fD( i nk , 1 nk ) ≤ 4 k . The first inequality follows by assuming, w.l.o.g that fD(q) > fD(q). The second inequality follows because both cardinality and range-sum functions are increasing functions in the length of the query predicate and the last inequality follows since P i |fD1 ( i nk , 1 nk )| ≤ n because all queries in Q are disjoint, so that every point in D will contribute at most once to queries in Q. Now setting k so that 4 k < ϵ complete the proof. 316 10.8.6 Proof of Technical Lemmas Proof of Lemma 21. The 1-norm error is the area between the two curves for D and D′ . We break down this area into rectangles whose area can be computed in closed form. The main intuition is to calculate this area by considering vertically stacked rectangles on top of each other. Specifically, we have ∥rD −rD′∥1 = R 1 0 |rD(q)−rD′(q)|. Let Ii,q be an indicator variable denoting whether q ∈ [D[i], D′ [i]) (or [D′ [i], D[i]) if D′ [i] < D[i]). We have that |rD(q) − rD′(q)| = Pn i=0 Ii,q. Thus, we have ∥rD − rD′∥1 = R 1 0 Pn i=0 Ii,q = Pn i=0 R 1 0 Ii,q = Pn i=0 R D′ [i] D[i] 1 = P i |D[i] − D′ [i]|. Proof of Lemma 22. ∥rD − rD′∥1 = Z r Z c | X i∈n (IDi∈[c,c+r]Di,2 − ID′ i∈[c,c+r]Di,2)| ≤ Z r Z c X i∈n |(IDi∈[c,c+r] − ID′ i∈[c,c+r])|D ′ i,2 ≤ Z r Z c X i∈n |(IDi∈[c,c+r] − ID′ i∈[c,c+r])| = X i∈n Z r Z c |(IDi∈[c,c+r] − ID′ i∈[c,c+r])| = 2X i∈n Z r min{ ϵ n , r} ≤ 2 X i∈n Z r ϵ n = 2ϵ Proof of Lemma 23. We first state the following lemma, whose proof is deferred to the end. 317 Lemma 30. Consider two 1-d datasets D and D′ over points p1, ..., pk s.t. the points in D are each repeated z1, ..., zk times for k ∈ [n] and 0 ≤ zi ≤ n, and the points D′ are each repeated z ′ 1 , ..., z′ k times for k ∈ [n] and 0 ≤ z ′ i ≤ n. Let t = P i∈[k] |zi − z ′ i |. We have that ∥cD − cD′∥1 ≤ 1 2 t. In our setting, let p1, ..., pk be the distinct element of D, let z1, ..., zk be the number of times each element is repeated in D1 and z ′ 1 , ..., z′ k be the number of times each element is repeated in D2 . By Lemma 30, we have that ∥cD1 − cD2 ∥1 ≤ 1 2 P i∈[k] |zi − z ′ i |. Now observe that zi = P j∈[n] m1 [j]ID[j]=pi and similarly z ′ i = P j∈[n] m2 [j]ID[j]=pi . Thus, we have ∥cD1 − cD2 ∥1 ≤ 1 2 X i∈[k] |zi − z ′ i | = 1 2 X i∈[k] | X j∈[n] ID[j]=pi (m1 [j] − m2 [j])| ≤ 1 2 X i∈[k] X j∈[n] ID[j]=pi |m1 [j] − m2 [j]| = 1 2 X j∈[n] |m1 [j] − m2 [j]|. Proof of Lemma 24. For simplicity of notation, we prove this lemma with D¯ renamed as D and D¯ ′ renamed as D′ . For any fixed r, we first bound ∥cD(., r) − cD′(., r)∥1. Let z = r − ⌊ r 1 u ⌋. Let i be the index of the first element in which D and D′ differ, and let p = min{D[i], D′ [i]}. Such an index exists as the datasets are different (also recall that any difference is repeated k times, by construction). Therefore, the functions are identical in the range [−1, p − r). Now consider two cases, when r > 1 u , and when r ≤ 1 u . In the first case consider the range [p − r, p − r + z] and [p − r + z, p − r + 1 u ]. Over both ranges the functions are constant, and the functions may only 318 change at p − r + z. Since the functions are identical before p − r but change at p − r, and one changes more than the other, the difference between the two functions in the range [p−r, p−r+z] is at least k, that is |cD(c, r)−cD′(c, r)| ≥ k for q ∈ [p−r, p−r +z]. Now, observe that p−r +z is a multiple of 1 u , and that it is strictly less than p. Thus, all points in D and D′ at p − r + z are identical. Therefore, both functions have an identical change at p − r + z so that their difference remains the same. Thus, we have that |cD(q, r) − cD′(c, r)| ≥ k for q ∈ [p − r, p − r + 1 u ]. Thus, in this case ∥cD(., r) − cD′(., r)∥1 ≥ k × 1 u > ϵ k for u = 2⌈ k 2ϵ ⌉ − 2. In the second case, consider the range [p − r, p]. Both functions are identical right before at p − r, while one changes by at least k more than the other at p − r, and the functions are constant over (p − r, p). Therefore, |cD(c, r) − cD′(c, r)| ≥ k for q ∈ [p − r, p]. Thus, in this case ∥cD(., r) − cD′(., r)∥1 ≥ k × r. Now we have ∥cD − cD′∥1 = Z r Z c |cD(c, r) − cD′(c, r)| = Z r< ϵ k ∥cD(., r) − cD′(., r)∥1 + Z r≥ ϵ k ∥cD(., r) − cD′(., r)∥1 > k( Z r< ϵ k r + Z r≥ ϵ k ϵ k ) = k( ( ϵ k ) 2 2 + (ϵ k )(1 − ϵ k )) = k( ϵ k − ( ϵ k ) 2 2 ) 319 Note that, for ϵ < k, ( ϵ k ) 2 ≤ ϵ k so that ϵ k − ( ϵ k ) 2 2 ≥ ϵ k − ϵ k 2 = ϵ k 2 . As such, we have ∥cD − cD′∥1 ≥ ϵ 2 . Proof of Lemma 25. ∥sD − sD′∥1 = Z q∈Qs | X i∈[n] IDi,q(D[i, d + 1] − D′ [i, d + 1])| ≤ Z q∈Qs X i∈[n] IDi,q|D[i, d + 1] − D′ [i, d + 1]| ≤ Z q∈Qs X i∈[n] IDi,q ϵ n ≤ X i∈[n] ϵ n Z q∈Qs 1 = ϵ 320 Proof of Lemma 26. ∥sD − sD′∥1 = Z r Z c | X i∈n (ID[i,1]∈[c,c+r]D[i, 2] − ID′ [i,1]i∈[c,c+r]D[i, 2])| ≤ Z r Z c X i∈n |ID[i,1]∈[c,c+r] − ID′ [i,1]∈[c,c+r] |D[i, 2] ≤ Z r Z c X i∈n |ID[i,1]∈[c,c+r] − ID′ [i,1]∈[c,c+r] | = X i∈n Z r Z c |ID[i,1]∈[c,c+r] − ID′ [i,1]∈[c,c+r] | ≤ 2 X i∈n Z r min{ ϵ n , r} ≤ 2 X i∈n Z r ϵ n = 2ϵ Proof of Lemma 27. We first state the following lemma, whose proof is deferred to the end. Lemma 31. Consider two 2-dimensional datasets D and D′ over points p1, ..., pk s.t. the points in D are each repeated z1, ..., zk times for k ∈ [n] and 0 ≤ zi ≤ n, and the points D′ are each repeated z ′ 1 , ..., z′ k times for k ∈ [n] and 0 ≤ z ′ i ≤ n. Let t = P i∈[k] |zi −z ′ i |. We have that ∥sD −sD′∥1 ≤ 1 2 t. In our setting, let p1, ..., pk be the distinct element of D, let z1, ..., zk be the number of times each element is repeated in D1 and z ′ 1 , ..., z′ k be the number of times each element is repeated 321 in D2 . By Lemma 31, we have that ∥sD1 − sD2 ∥1 ≤ 1 2 P i∈[k] |zi − z ′ i |. Now observe that zi = P j∈[n] m1 [j]ID[j]=pi and similarly z ′ i = P j∈[n] m2 [j]ID[j]=pi . Thus, we have ∥cD1 − cD2 ∥1 ≤ 1 2 X i∈[k] |zi − z ′ i | = 1 2 X i∈[k] | X j∈[n] ID[j]=pi (m1 [j] − m2 [j])| ≤ 1 2 X i∈[k] X j∈[n] ID[j]=pi |m1 [j] − m2 [j]| = 1 2 X j∈[n] |m1 [j] − m2 [j]|. Proof of Lemma 28. For notational consistency with proof of Theorem 22, we rename D′ as D in the proof here. Consider the dataset Dˆ that Dˆ [i, j] = D¯ [i, j] for j ∈ [d − 1] and Dˆ [i, d] = D[i, d]. That is Dˆ is a dataset with its first d − 1 dimensions the same as D¯ and its d-th dimension the same as D. We have that ∥sD − sD¯ ∥1 ≤ ∥sD − sDˆ ∥1 + ∥sDˆ − sD¯ ∥1 We study the first two terms separately. Consider the second term ∥sDˆ − sD¯ ∥1. Observe that the first d − 1 dimensions of D¯ and Dˆ are identical, so that applying any predicate on the first d − 1 322 attributes of the datasets leads to the same filtered data points. Furthermore, the d-th dimension of the points differ by at most ϵ n , so by Lemma 26, we have ∥sDˆ − sD¯ ∥1 ≤ 2ϵ. Next, consider the term ∥sD − sDˆ ∥1. Define I i q,p an indicator function equal to 1 if a record p matches the first i dimensions in q = (c1, ..., cd, r1, ..., rd), that is if cj ≤ pj ≤ cj + rj for all j ∈ [i], and let D∗ = {D[i, d : d + 1] ∀i, I d−1 q,D[i] = 1} be a 2-dimensional dataset of the d-th and d + 1-th attribute of the records of D that matches the first d − 1 dimensions of q, and similarly define Dˆ ∗ = {Dˆ [i, d : d + 1] ∀i, I d−1 q,Dˆ [i] = 1} for Dˆ . Note that the d-th and d + 1-th dimension of Dˆ ∗ and D∗ are identical, so that they both contain a subset of elements of the d-th and d + 1- th dimension of D, where the subset is selected based on the indicator function I d−1 q,q . Thus, we apply Lemma 27 to Dˆ ∗ and D∗ (the masks in the lemma are induced based on the indicator functions, i.e., m1 = (I d−1 q,Dˆ [1], ..., I d−1 q,Dˆ [n] ) and m2 = (I d−1 q,D[1], ..., Iq,D[n] d−1 )) to obtain ∥sD∗ − sDˆ ∗ ∥1 ≤ 1 2 X i∈[n] |Id−1 q,D[i] − Id−1 q,Dˆ [i] |. Moreover, by definition, ∥sD(c1, ..., cd−1, ., r1, ..., rd−1, .) − sD′(c1, ..., cd−1, ., r1, ..., rd−1, .)∥1 = ∥sD∗ − sDˆ ∗ ∥1. 323 Combining the last two inequalities, we have ∥sD − sDˆ ∥1 ≤ Z c1 ... Z cd−1 Z r1 ... Z rd−1 X i∈[n] 1 2 |Id−1 q,D[i] − Id−1 q,Dˆ [i] | ≤ 1 2 X i∈[n] Z c1 ... Z cd−1 Z r1 ... Z rd−1 |Id−1 q,D[i] − Id−1 q,Dˆ [i] | ≤ (d − 1)ϵ Where the last inequality was shown in the proof of Theorem 22. Putting everything together, we have that ∥sD − sD¯ ∥1 ≤ 2ϵ + (d − 1)ϵ = (d + 1)ϵ. Proof of Lemma 29. We have ∥rD − rD′∥µ = R 1 0 |rD(q) − rD′(q)|dµ. Let Ii,q be an indicator variable denoting whether q ∈ [D[i], D′ [i]) (or [D′ [i], D[i]) if D′ [i] < D[i]). We have that |rD(q) − rD′(q)| = Pn i=0 Ii,q. Thus, we have ∥rD − rD′∥µ = R 1 0 Pn i=0 Ii,qdµ = Pn i=0 R 1 0 Ii,qdµ = Pn i=0 R pi,2 pi,1 dµ = P i µ([D[i], D′ [i]]). we have ∥rD −rD′∥1 = R 1 0 |rD(q)−rD′(q)|. Let Ii,q be an indicator variable denoting whether q ∈ [D[i], D′ [i]) (or [D′ [i], D[i]) if D′ [i] < D[i]). We have that |rD(q) − rD′(q)| = Pn i=0 Ii,q. Thus, we have ∥rD − rD′∥1 = R 1 0 Pn i=0 Ii,q = Pn i=0 R 1 0 Ii,q = Pn i=0 R D′ [i] D[i] 1 = P i |D[i] − D′ [i]|. Proof of Lemma 30. Let D∗ = D∪D′ , that is D∗ is a dataset over points p1, ..., pk each repeated z ∗ 1 , ..., z∗ k with z ∗ i = max{zi , z′ i}. We have ∥cD − cD′∥≤∥cD − cD∗ ∥1 + ∥cD′ − cD∗ ∥1. 324 By Lemma 32 (stated and proved below) we have ∥cD − cD∗ ∥1 ≤ 1 2 X i∈[k] (z ∗ i − zi) = 1 2 X i∈[k] (max{zi , z′ i} − zi) and similarly ∥cD′ − cD∗ ∥1 ≤ 1 2 X i∈[k] (z ∗ i − z ′ i ) = 1 2 X i∈[k] (max{zi , z′ i} − z ′ i ) so that ∥cD − cD′∥1 ≤ 1 2 X i∈[k] (max{zi , z′ i} − z ′ i ) + 1 2 X i∈[k] (max{zi , z′ i} − zi) = 1 2 X i∈[k] (max{zi , z′ i} − zi + max{zi , z′ i} − z ′ i ) = 1 2 X i∈[k] |zi − z ′ i | Proof of Lemma 31. Let D∗ = D ∪ D′ , that is D∗ is a dataset over points p1, ..., pk each repeated z ∗ 1 , ..., z∗ k with z ∗ i = max{zi , z′ i}. We have ∥sD − sD′∥≤∥sD − sD∗ ∥1 + ∥sD′ − sD∗ ∥1. By Lemma 33 (stated and proved below) we have ∥sD − sD∗ ∥1 ≤ 1 2 X i∈[k] (z ∗ i − zi) = 1 2 X i∈[k] (max{zi , z′ i} − zi) and similarly ∥sD′ − sD∗ ∥1 ≤ 1 2 X i∈[k] (z ∗ i − z ′ i ) = 1 2 X i∈[k] (max{zi , z′ i} − z ′ i ) 325 so that ∥sD − sD′∥1 ≤ 1 2 X i∈[k] (max{zi , z′ i} − z ′ i ) + 1 2 X i∈[k] (max{zi , z′ i} − zi) = 1 2 X i∈[k] (max{zi , z′ i} − zi + max{zi , z′ i} − z ′ i ) = 1 2 X i∈[k] |zi − z ′ i | Lemma 32. Consider a 1-dimensional database D of size n, such that points p1, ..., pk in D are each repeated z1, ..., zk times for k, zi ∈ [n], P i∈[n] zi = n. Consider another dataset D′ containing the a subset of points p1, ...pk, having 0 ≤ z ′ i ≤ zi and P i∈[n] z ′ i = n ′ so that n ′ ≤ n. Let t = P i∈[k] |zi − z ′ i |. We have that ∥cD − cD′∥1 = t R 1 r=0 r = 1 2 t. Proof of Lemma 32. Fix a value for r. Let X = {i ∈ [k], zi ̸= z ′ i}. Observe that cD and cD′ only differ for queries where there exists an i ∈ X for which c ∈ [pi − r, pi ]. Thus, ∥cD − cD′∥ = Z 1 r=0 Z 1 q=−1 X i∈[k] Ipi∈(q,r)(zi − z ′ i ) = Z 1 r=0 X i∈[k] (zi − z ′ i ) Z 1 q=−1 Ip∈(q,r) = Z 1 r=0 X p∈[k] (zi − z ′ i ) Z pi q=pi−r 1 = Z 1 r=0 rt = t 2 326 . Lemma 33. Consider a 2-dimensional database D of size n, such that points p1, ..., pk in D are each repeated z1, ..., zk times for k, zi ∈ [n], P i∈[n] zi = n. Consider another dataset D′ containing the a subset of points p1, ...pk, having 0 ≤ z ′ i ≤ zi and P i∈[n] z ′ i = n ′ so that n ′ ≤ n. Let t = P i∈[k] |zi − z ′ i |. We have that ∥sD − sD′∥1 = t R 1 r=0 r = 1 2 t. Proof of Lemma 33. Fix a value for r. Let X = {i ∈ [k], zi ̸= z ′ i}. Observe that sD and sD′ only differ for queries where there exists an i ∈ X for which c ∈ [pi [1] − r, pi [1]]. Thus, ∥sD − sD′∥ = Z 1 r=0 Z 1 c=−1 X i∈[k] Ipi[1]∈(c,r)(zi − z ′ i )pi [2] ≤ Z 1 r=0 X i∈[k] (zi − z ′ i ) Z 1 c=−1 Ipi[1]∈(c,r) = Z 1 r=0 X i∈[k] (zi − z ′ i ) Z pi[1] c=pi[1]−r 1 = Z 1 r=0 rt = t 2 . 327 Part IV Conclusion 328 Chapter 11 Conclusions In this thesis, we presented a thorough treatment of the function approximation view of database operations. We developed the NeuroDB framework that extends this function approximation view to build well-optimized learned database systems in practice and under real-world constraints. Moreover, we presented the first theoretical results in the subject, building the theoretical foundation for analyzing learned database operations that follow this function approximation view. Specifically, we presented the NeuroDB framework that considers the entire query answering pipeline in database systems as a function that can be approximated. The framework trains neural networks that take query as input and output query answer. We showed significant practical benefits to this frameowk in three different applications scenarios. We showed the NueroDB framework can be used to build efficient systems for approximate query processing, to answer queries accurately while preserving user privacy and to answer queries accurately when the data contains missing records. Moreover, we presented theoretical results showing why and when learned models perform well when answering database queries, showing theoretical guarantees on the performance of 329 the learned models across various operations. We showed that under mild assumptions on data distribution learned models are able to answer indexing query on a static array in O(log log n) expected query time, asymptotically faster than non-learned methods (e.g., B-trees and binary search) that perform the operations in O(log)n. We then extended the analysis to more operations, including cardinality estimation and sorting, and to dynamic datasets (i.e., when new points can be inserted) with distribution shift, presenting the distribution learnability framework for the analysis of learned database operations. Furthermore, we discuss required modeling choices for learned methods to perform well in practice, discussing the model size needed to be able to perform operations well to a desired accuracy. Overall, this thesis takes a significant step in the evolution of learned database systems both in practice and theoretically. The NeuroDB framework and our developed theoretical tools build a foundation for the future of learned database systems, where we hope they can be applied more broadly to new applications and database operations for better theoretical understanding and practical performance of learned database systems. 330 Bibliography [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep learning with differential privacy”. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016, pp. 308–318. [2] Serge Abiteboul, Paris Kanellakis, and Gosta Grahne. “On the representation and querying of sets of possible worlds”. In: SIGMOD. 1987. [3] John M. Abowd. “The U.S. Census Bureau Adopts Differential Privacy”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery, Data Mining. KDD ’18. London, United Kingdom, 2018, p. 2867. [4] Gergely Acs, Claude Castelluccia, and Rui Chen. “Differentially private histogram publishing through lossy compression”. In: 2012 IEEE 12th International Conference on Data Mining. IEEE. 2012, pp. 1–10. [5] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. “BlinkDB: queries with bounded errors and bounded response times on very large data”. In: Proceedings of the 8th ACM European conference on computer systems. 2013, pp. 29–42. [6] Atish Agarwala, Abhimanyu Das, Brendan Juba, Rina Panigrahy, Vatsal Sharan, Xin Wang, and Qiuyi Zhang. “One Network Fits All? Modular versus Monolithic Task Formulations in Neural Networks”. In: International Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=uz5uw6gM0m. [7] Ritesh Ahuja, Sepanta Zeighami, Gabriel Ghinita, and Cyrus Shahabi. “A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy”. In: Proceedings of the 2023 International Conference on Management of Data, SIGMOD ’23 (2023). arXiv preprint arXiv:2208.09744. 331 [8] Ritesh Ahuja, Sepanta Zeighami, Gabriel Ghinita, and Cyrus Shahabi. “A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy”. In: SIGMOD 1.1 (2023), pp. 1–25. [9] Anonymous. NeuroSketch Implementation. https://anonymous.4open.science/r/NeuroSketch-BCFF. Accessed Feb 21st, 2022. 2022. [10] Anonymous. “NeuroSketch: A Neural Network Method for Fast and Approximate Evaluation of Range Aggregate Queries (Technical Report)”. In: (2022). https://anonymous.4open.science/r/NeuroSketch-BCFF/neurosketch.pdf. [11] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. doi: 10.1017/CBO9780511624216. [12] Rudolf Bayer. “Symmetric binary B-trees: Data structure and maintenance algorithms”. In: Acta informatica 1.4 (1972), pp. 290–306. [13] Rudolf Bayer and Edward McCreight. “Organization and maintenance of large ordered indices”. In: Proceedings of the 1970 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control. 1970, pp. 107–141. [14] Daniel Berend and Aryeh Kontorovich. “A sharp estimate of the binomial mean absolute deviation with applications”. In: Statistics & Probability Letters 83.4 (2013), pp. 1254–1259. [15] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. “What is the state of neural network pruning?” In: arXiv preprint arXiv:2003.03033 (2020). [16] Helmut Bolcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. “Optimal approximation with sparsely connected deep neural networks”. In: SIAM Journal on Mathematics of Data Science 1.1 (2019), pp. 8–45. [17] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. Version 0.2.5. 2018. url: http://github.com/google/jax. [18] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. “STHoles: A Multidimensional Workload-Aware Histogram”. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. New York, NY, USA: Association for Computing Machinery, 2001, pp. 211–222. [19] Michael N Cantor and Lorna Thorpe. “Integrating data on social determinants of health into electronic health records”. In: Health Affairs 37.4 (2018), pp. 585–590. 332 [20] Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. “Creating embeddings of heterogeneous relational datasets for data integration tasks”. In: SIGMOD’20. 2020, pp. 1335–1349. [21] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. “Differentially private empirical risk minimization.” In: Journal of Machine Learning Research 12.3 (2011). [22] Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. “Optimized stratified sampling for approximate query processing”. In: ACM Transactions on Database Systems (TODS) 32.2 (2007), 9–es. [23] Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. “Efficient approximation of deep relu networks for functions on low dimensional manifolds”. In: Advances in neural information processing systems 32 (2019). [24] Eunjoon Cho, Seth A Myers, and Jure Leskovec. “Friendship and mobility: user movement in location-based social networks”. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 2011, pp. 1082–1090. [25] Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. “Data cleaning: Overview and emerging challenges”. In: SIGMOD. 2016. [26] Yeounoh Chung, Michael Lind Mortensen, Carsten Binnig, and Tim Kraska. “Estimating the impact of unknown unknowns on aggregate query results”. In: ACM Transactions on Database Systems (TODS) 43.1 (2018), pp. 1–37. [27] Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. “Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches”. In: Found. Trends Databases 4.1–3 (Jan. 2012), pp. 1–294. issn: 1931-7883. doi: 10.1561/1900000004. [28] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. “Differentially private spatial decompositions”. In: 2012 IEEE 28th International Conference on Data Engineering. IEEE. 2012, pp. 20–31. [29] Graham Cormode and Ke Yi. Small summaries for big data. Cambridge University Press, 2020. [30] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. [31] Nilesh Dalvi and Dan Suciu. “Efficient query evaluation on probabilistic databases”. In: The VLDB Journal 16.4 (2007), pp. 523–544. [32] Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. “Differentially private data cubes: optimizing noise sources and consistency”. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. 2011, pp. 217–228. 333 [33] Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. “ALEX: an updatable adaptive learned index”. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020, pp. 969–984. [34] David L Donoho. “Unconditional bases are optimal bases for data compression and for statistical estimation”. In: Applied and computational harmonic analysis 1.1 (1993), pp. 100–115. [35] Cynthia Dwork, Aaron Roth, et al. “The algorithmic foundations of differential privacy”. In: Foundations and Trends® in Theoretical Computer Science 9.3–4 (2014), pp. 211–407. [36] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. “Rappor: Randomized aggregatable privacy-preserving ordinal response”. In: Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. 2014, pp. 1054–1067. [37] Bilal Farooq, Michel Bierlaire, Ricardo Hurtubia, and Gunnar Flötteröd. “Simulation based population synthesis”. In: Transportation Research Part B: Methodological 58 (2013), pp. 243–263. [38] Su Feng, Aaron Huber, Boris Glavic, and Oliver Kennedy. “Uncertainty annotated databases-a lightweight approach for approximating certain answers”. In: SIGMOD. 2019. [39] Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. “Why are learned indexes so effective?” In: International Conference on Machine Learning. PMLR. 2020, pp. 3123–3132. [40] Paolo Ferragina and Giorgio Vinciguerra. “The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds”. In: Proceedings of the VLDB Endowment 13.8 (2020), pp. 1162–1175. [41] Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. “Auto-sklearn 2.0: The next generation”. In: arXiv preprint arXiv:2007.04074 24 (2020). [42] Jonathan Frankle and Michael Carbin. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”. In: International Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=rJl-b3RcF7. [43] Michael L Fredman, János Komlós, and Endre Szemerédi. “Storing a sparse table with 0 (1) worst case access time”. In: Journal of the ACM (JACM) 31.3 (1984), pp. 538–544. [44] Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. “Fiting-tree: A data-aware index structure”. In: Proceedings of the 2019 International Conference on Management of Data. 2019, pp. 1189–1206. 334 [45] Pierre Geurts, Damien Ernst, and Louis Wehenkel. “Extremely randomized trees”. In: Machine learning 63.1 (2006), pp. 3–42. [46] Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. “Principled evaluation of differentially private algorithms using dpbench”. In: Proceedings of the 2016 International Conference on Management of Data. 2016, pp. 139–154. [47] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. “Boosting the accuracy of differentially-private histograms through consistency”. In: arXiv preprint arXiv:0904.0942 (2009). [48] Joseph M Hellerstein, Peter J Haas, and Helen J Wang. “Online aggregation”. In: Proceedings of the 1997 ACM SIGMOD international conference on Management of data. 1997, pp. 171–182. [49] Benjamin Hilprecht and Carsten Binnig. “ReStore-Neural Data Completion for Relational Databases”. In: SIGMOD. 2021, pp. 710–722. [50] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. DeepDB Implementation. https://github.com/DataManagementLab/deepdb-public. Accessed May 21th, 2021. 2021. [51] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. “DeepDB: Learn from Data, not from Queries!” In: Proceedings of the VLDB Endowment 13.7 (2019). [52] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. “Deep models under the GAN: information leakage from collaborative deep learning”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 603–618. [53] Tin Kam Ho. “Random decision forests”. In: Proceedings of 3rd international conference on document analysis and recognition. Vol. 1. IEEE. 1995, pp. 278–282. [54] Housing dataset. https://tinyurl.com/3dbvtc6k. Accessed 7/22. [55] Xiao Hu, Yuxi Liu, Haibo Xiu, Pankaj K. Agarwal, Debmalya Panigrahi, Sudeepa Roy, and Jun Yang. “Selectivity Functions of Range Queries Are Learnable”. In: Proceedings of the 2022 International Conference on Management of Data. SIGMOD ’22. Philadelphia, PA, USA: Association for Computing Machinery, 2022, pp. 959–972. isbn: 9781450392495. doi: 10.1145/3514221.3517896. [56] Changcun Huang. “ReLU Networks Are Universal Approximators via Piecewise Linear or Constant Functions”. In: Neural Computation 32.11 (Nov. 2020), pp. 2249–2278. issn: 0899-7667. doi: 10.1162/neco_a_01316. eprint: https://direct.mit.edu/neco/article-pdf/32/11/2249/1865413/neco\_a\_01316.pdf. 335 [57] Svante Janson. “On the probability that a binomial variable is at most its expectation”. In: Statistics & Probability Letters 171 (2021), p. 109020. [58] William Kent. “Solving domain mismatch and schema mismatch problems with an object-oriented database programming language”. In: VLDB. Vol. 91. 1991, pp. 147–160. [59] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. “Private convex empirical risk minimization and high-dimensional regression”. In: Conference on Learning Theory. JMLR Workshop and Conference Proceedings. 2012, pp. 25–1. [60] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [61] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. “Learned cardinalities: Estimating correlated joins with deep learning”. In: CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research (2018). [62] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. “The case for learned index structures”. In: Proceedings of the 2018 International Conference on Management of Data. 2018, pp. 489–504. [63] Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, and Tim Kraska. “The case for a learned sorting algorithm”. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020, pp. 1001–1016. [64] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems 25 (2012), pp. 1097–1105. [65] Willis Lang, Rimma V Nehme, Eric Robinson, and Jeffrey F Naughton. “Partial results in database systems”. In: SIGMOD. 2014. [66] Tobin J Lehman and Michael J Carey. A study of index structures for main memory database management systems. Tech. rep. University of Wisconsin-Madison Department of Computer Sciences, 1985. [67] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. “A Data- and Workload-Aware Algorithm for Range Queries under Differential Privacy”. In: Proc. VLDB Endow. 7.5 (Jan. 2014), pp. 341–352. issn: 2150-8097. [68] Xuan Liang, Tao Zou, Bin Guo, Shuo Li, Haozhe Zhang, Shuyi Zhang, Hui Huang, and Song Xi Chen. “Assessing Beijing’s PM2. 5 pollution: severity, weather impact, APEC and winter heating”. In: Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 471.2182 (2015), p. 20150257. 336 [69] List of United States cities by population. https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population. Accessed July 2021. 2021. [70] Robin Lovelace, Mark Birkin, Dimitris Ballas, and Eveline Van Leeuwen. “Evaluating the performance of iterative proportional fitting for spatial microsimulation: new tests for an established technique”. In: Journal of Artificial Societies and Social Simulation 18.2 (2015). [71] Jianfeng Lu, Zuowei Shen, Haizhao Yang, and Shijun Zhang. “Deep network approximation for smooth functions”. In: SIAM Journal on Mathematical Analysis 53.5 (2021), pp. 5465–5506. [72] Yao Lu, Srikanth Kandula, Arnd Christian König, and Surajit Chaudhuri. “Pre-training summarization models of structured datasets for cardinality estimation”. In: Proceedings of the VLDB Endowment 15.3 (2021), pp. 414–426. [73] Min Lyu, Dong Su, and Ninghui Li. “Understanding the Sparse Vector Technique for Differential Privacy”. In: Proc. VLDB Endow. 10.6 (Feb. 2017), pp. 637–648. issn: 2150-8097. [74] Qingzhi Ma and Peter Triantafillou. DBEst Implementation. https://github.com/qingzma/DBEst_MDN. Accessed Dec 21th, 2020. 2020. [75] Qingzhi Ma and Peter Triantafillou. “Dbest: Revisiting approximate query processing engines with machine learning models”. In: Proceedings of the 2019 International Conference on Management of Data. 2019, pp. 1553–1570. [76] Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. “Benchmarking learned indexes”. In: Proceedings of the VLDB Endowment 14.1 (2020), pp. 1–13. [77] Pascal Massart. “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality”. In: The annals of Probability (1990), pp. 1269–1283. [78] Jirí Matoušek and Aleksandar Nikolov. “Combinatorial Discrepancy for Boxes via the gamma_2 Norm”. In: 31st International Symposium on Computational Geometry (SoCG 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 2015. [79] Ryan McKenna, Gerome Miklau, Michael Hay, and Ashwin Machanavajjhala. “Optimizing error of high-dimensional statistical queries under differential privacy”. In: arXiv preprint arXiv:1808.03537 (2018). [80] Ryan McKenna, Daniel Sheldon, and Gerome Miklau. “Graphical-model based estimation and inference for differential privacy”. In: International Conference on Machine Learning. PMLR. 2019, pp. 4435–4444. 337 [81] Michael Mitzenmacher. “A model for learned bloom filters and optimizing by sandwiching”. In: Advances in Neural Information Processing Systems. 2018, pp. 464–473. [82] Henry B Moss, David S Leslie, and Paul Rayson. “Using JK fold cross validation to reduce variance when tuning NLP models”. In: arXiv preprint arXiv:1806.07139 (2018). [83] movies dataset. https://tinyurl.com/5n6eh577. Access 7/22. [84] Nabil H Mustafa and Kasturi R Varadarajan. “Epsilon-approximations and epsilon-nets”. In: Chapter 47 in Handbook of Discrete and Computational Geometry, 3rd edition (2017). [85] Raghunath Othayoth Nambiar and Meikel Poess. “The Making of TPC-DS”. In: VLDB ’06. Seoul, Korea: VLDB Endowment, 2006, pp. 1049–1058. [86] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. “Learning Multi-dimensional Indexes”. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020, pp. 985–1000. [87] Gonzalo Navarro and Javiel Rojas-Ledesma. “Predecessor search”. In: ACM Computing Surveys (CSUR) 53.5 (2020), pp. 1–35. [88] Parimarjan Negi, Ryan Marcus, Andreas Kipf, Hongzi Mao, Nesime Tatbul, Tim Kraska, and Mohammad Alizadeh. “Flow-Loss: Learning Cardinality Estimates That Matter”. In: Proc. VLDB Endow. 14.11 (July 2021), pp. 2019–2032. issn: 2150-8097. doi: 10.14778/3476249.3476259. [89] Parimarjan Negi, Ziniu Wu, Andreas Kipf, Nesime Tatbul, Ryan Marcus, Sam Madden, Tim Kraska, and Mohammad Alizadeh. “Robust Query Driven Cardinality Estimation under Changing Workloads”. In: Proceedings of the VLDB Endowment 16.6 (2023), pp. 1520–1533. [90] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “Smooth sensitivity and sampling in private data analysis”. In: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. 2007, pp. 75–84. [91] Joseph O’Rourke. “An on-line algorithm for fitting straight lines between data ranges”. In: Communications of the ACM 24.9 (1981), pp. 574–578. [92] Optuna. https://optuna.org/. Accessed Feb 21st, 2022. 2022. [93] Laurel Orr, Samuel Ainsworth, Walter Cai, Kevin Jamieson, Magda Balazinska, and Dan Suciu. “Mosaic: A Sample-Based Database System for Open World Query Processing”. In: CIDR (2020). [94] Laurel Orr, Magdalena Balazinska, and Dan Suciu. “Sample debiasing in the themis open world database system”. In: SIGMOD. 2020. 338 [95] Parameter Queries (Visual Database Tools). https://docs.microsoft.com/en-us/sql/ssms/visual-db-tools/parameter-queriesvisual-database-tools?view=sql-server-ver15. Accessed Jun 30th, 2021. 2021. [96] Parameterized query. https://node-postgres.com/features/queries. Accessed Jun 30th, 2021. 2021. [97] Parameterized query. https://docs.data.world/documentation/sql/concepts/dw_ specific/parameterized_queries.html. Accessed Jun 30th, 2021. 2021. [98] Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. VerdictDB Implementation. https://github.com/verdict-project/verdict. Accessed Jul 6th, 2021. 2021. [99] Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. “Verdictdb: Universalizing approximate query processing”. In: Proceedings of the 2018 International Conference on Management of Data. 2018, pp. 1461–1476. [100] Mihai Pătraşcu and Mikkel Thorup. “Time-space trade-offs for predecessor search”. In: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing. 2006, pp. 232–240. [101] Rafael Pérez-Torres, César Torres-Huitzil, and Hiram Galeana-Zapién. “Full on-device stay points detection in smartphones for location-based mobile applications”. In: Sensors 16.10 (2016), p. 1693. [102] Philipp Petersen and Felix Voigtlaender. “Optimal approximation of piecewise smooth functions using deep ReLU neural networks”. In: Neural Networks 108 (2018), pp. 296–330. [103] Jeff M Phillips. “Algorithms for ε-approximations of terrains”. In: International Colloquium on Automata, Languages, and Programming. Springer. 2008, pp. 447–458. [104] Allan Pinkus. “Approximation theory of the MLP model in neural networks”. In: Acta numerica 8 (1999), pp. 143–195. [105] Michal Piorkowski, Natasa Sarafijanovic-Djukic, and Matthias Grossglauser. CRAWDAD data set epfl/mobility (v. 2009-02-24). 2009. [106] Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. “The intrinsic dimension of images and its impact on learning”. In: arXiv preprint arXiv:2104.08894 (2021). [107] Michael James David Powell. Approximation theory and methods. Cambridge university press, 1981. 339 [108] Wahbeh Qardaji, Weining Yang, and Ninghui Li. “Differentially private grids for geospatial data”. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE. 2013, pp. 757–768. [109] Wahbeh Qardaji, Weining Yang, and Ninghui Li. “Understanding hierarchical methods for differentially private histograms”. In: Proceedings of the VLDB Endowment 6.14 (2013), pp. 1954–1965. [110] Quick Summary of President’s FY 2023 Census Bureau Budget Request. https://rb.gy/ssk02. Accessed July 2023. 2022. [111] Jun Rao and Kenneth A Ross. “Cache conscious indexing for decision-support in main memory”. In: Proceedings of the 25th VLDB Conference (1999). [112] Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. “HoloClean: Holistic Data Repairs with Probabilistic Inference”. In: VLDB (2017). [113] ReStore Implementation. https://github.com/DataManagementLab/restore. Accessed July 2022. 2022. [114] Douglas A Reynolds. “Gaussian Mixture Models.” In: Encyclopedia of biometrics 741 (2009). [115] Regina T Riphahn and Oliver Serfling. “Item non-response on income and wealth questions”. In: Empirical economics 30 (2005), pp. 521–538. [116] SafeGraph dataset. https://docs.safegraph.com/v4.0/docs/places-schema#section-patterns. Accessed Dec 29th, 2020. 2020. [117] Safegraph Places. https://rb.gy/3jefg. Accessed: July 2022. 2021. [118] Rolfe R Schmidt and Cyrus Shahabi. “Propolyne: A fast wavelet-based algorithm for progressive evaluation of polynomial range-sum queries”. In: International Conference on Extending Database Technology. Springer. 2002, pp. 664–681. [119] Lisa Schwartz and Geoffrey Paulin. “Improving response rates to income questions”. In: American Statistical Association (ASA) Section on Survey Research Methods, Proceedings (2000), pp. 965–970. [120] Adam Sealfon and Jonathan Ullman. “Efficiently Estimating Erdos-Renyi Graphs with Node Differential Privacy”. In: Journal of Privacy and Confidentiality 11.1 (2021). [121] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. 340 [122] Michael Shekelyan, Anton Dignös, and Johann Gamper. “Digithist: a histogram-based data summary with tight error bounds”. In: Proceedings of the VLDB Endowment 10.11 (2017), pp. 1514–1525. [123] Zuowei Shen, Haizhao Yang, and Shijun Zhang. “Deep Network Approximation Characterized by Number of Neurons”. In: Communications in Computational Physics 28.5 (2020), pp. 1768–1811. issn: 1991-7120. doi: https://doi.org/10.4208/cicp.OA-2020-0149. [124] Zuowei Shen, Haizhao Yang, and Shijun Zhang. “Nonlinear approximation via compositions”. In: Neural Networks 119 (2019), pp. 74–84. [125] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference attacks against machine learning models”. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE. 2017, pp. 3–18. [126] Galen R Shorack and Jon A Wellner. Empirical processes with applications to statistics. SIAM, 1986. [127] Bruhathi Sundarmurthy, Paraschos Koutris, Willis Lang, Jeffrey Naughton, and Val Tannen. “m-tables: Representing missing data”. In: ICDT 2017. [128] Subhash Suri, Csaba D Tóth, and Yunhong Zhou. “Range counting over multidimensional data streams”. In: Proceedings of the twentieth annual symposium on Computational geometry. 2004, pp. 160–169. [129] Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, and Gautam Das. “Approximate query processing for data exploration using deep generative models”. In: 2020 IEEE 36th international conference on data engineering (ICDE). IEEE. 2020, pp. 1309–1320. [130] Beti Thompson, Sarah D Hohl, Yamile Molina, Electra D Paskett, James L Fisher, Ryan D Baltic, and Chasity M Washington. “Breast cancer disparities among women in underserved communities in the USA”. In: Current breast cancer reports 10 (2018), pp. 131–141. [131] Immanuel Trummer, Junxiong Wang, Ziyun Wei, Deepak Maram, Samuel Moseley, Saehan Jo, Joseph Antonakakis, and Ankush Rayabhari. “Skinnerdb: Regret-bounded query evaluation via reinforcement learning”. In: ACM Transactions on Database Systems (TODS) 46.3 (2021), pp. 1–45. [132] Laurens Van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE.” In: Journal of machine learning research 9.11 (2008). [133] Veraset. https://tinyurl.com/ypj5e9z2. Accessed: 2021-05-10. 341 [134] Veraset Movement Data for the OCONUS. https://datarade.ai/data-products/veraset-movement-data-for-the-oconus-thelargest-deepest-and-broadest-available-movement-dataset-veraset. Accessed: 2021-07-20. 2021. [135] Veraset Website. https://www.veraset.com/about-veraset. Accessed: 2020-10-25. 2020. [136] Roman Vershynin. High-dimensional probability: An introduction with applications in data science. Vol. 47. Cambridge university press, 2018. [137] Valentin Voillet, Philippe Besse, Laurence Liaubet, Magali San Cristobal, and Ignacio González. “Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework”. In: BMC bioinformatics 17.1 (2016), pp. 1–16. [138] Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. “Are we ready for learned cardinality estimation?” In: Proceedings of the VLDB Endowment 14.9 (2021), pp. 1640–1654. [139] Madanlal T Wasan. “Parametric estimation”. In: McGraw-Hill; (1970). [140] Zhewei Wei and Ke Yi. “Tight space bounds for two-dimensional approximate range counting”. In: ACM Transactions on Algorithms (TALG) 14.2 (2018), pp. 1–17. [141] Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. “Are updatable learned indexes ready?” In: Proceedings of the VLDB Endowment 15.11 (2022), pp. 3004–3017. [142] Peizhi Wu and Gao Cong. “A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation”. In: Proceedings of the 2021 International Conference on Management of Data. 2021, pp. 2009–2022. [143] Richard Wu, Aoqian Zhang, Ihab Ilyas, and Theodoros Rekatsinas. “Attention-based learning for missing data imputation in HoloClean”. In: MLSys (2020). [144] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. “Bolt-on differential privacy for scalable stochastic gradient descent-based analytics”. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, pp. 1307–1322. [145] Yonghui Xiao, Li Xiong, Liyue Fan, and Slawomir Goryczka. “Dpcube: differentially private histogram release through multidimensional partitioning”. In: arXiv preprint arXiv:1202.5358 (2012). 342 [146] Anatoly Yakovlev, Hesam Fathi Moghadam, Ali Moharrer, Jingxiao Cai, Nikan Chavoshi, Venkatanathan Varadarajan, Sandeep R Agrawal, Sam Idicula, Tomas Karnagel, Sanjay Jinturkar, et al. “Oracle automl: a fast and predictive automl pipeline”. In: Proceedings of the VLDB Endowment 13.12 (2020), pp. 3166–3180. [147] Dingqi Yang, Bingqing Qu, Jie Yang, and Philippe Cudre-Mauroux. “Revisiting user mobility and social relationships in lbsns: a hypergraph embedding approach”. In: The world wide web conference. 2019. [148] Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. “NeuroCard: one cardinality estimator for all tables”. In: Proceedings of the VLDB Endowment 14.1 (2020), pp. 61–73. [149] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. “Deep unsupervised cardinality estimation”. In: Proceedings of the VLDB Endowment 13.3 (2019), pp. 279–292. [150] Dmitry Yarotsky. “Error bounds for approximations with deep ReLU networks”. In: Neural Networks 94 (2017), pp. 103–114. [151] Dmitry Yarotsky. “Optimal approximation of continuous functions by very deep ReLU networks”. In: Conference on learning theory. PMLR. 2018, pp. 639–649. [152] Dmitry Yarotsky and Anton Zhevnerchuk. “The phase diagram of approximation rates for deep neural networks”. In: Advances in neural information processing systems 33 (2020), pp. 13005–13015. [153] Yang Ye, Yu Zheng, Yukun Chen, Jianhua Feng, and Xing Xie. “Mining individual life pattern based on location history”. In: 2009 tenth international conference on mobile data management: Systems, services and middleware. IEEE. 2009, pp. 1–10. [154] Jinsung Yoon, James Jordon, and Mihaela Schaar. “Gain: Missing data imputation using generative adversarial nets”. In: ICML. 2018. [155] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. “Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity”. In: Advances in Neural Information Processing Systems. 2019, pp. 15558–15569. [156] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. “A Neural Database for Differentially Private Spatial Range Queries”. In: Proc. VLDB Endow. 15.5 (Jan. 2022), pp. 1066–1078. issn: 2150-8097. doi: 10.14778/3510397.3510404. [157] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. Private STHoles Implementation. https://github.com/szeighami/stholes. 2021. 343 [158] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. SNH Implementation. https://github.com/szeighami/snh. 2021. [159] Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, and Cyrus Shahabi. SNH Technical Report. https://infolab.usc.edu/DocsDemos/snh.pdf. 2021. [160] Sepanta Zeighami, Gabriel Ghinita, and Cyrus Shahabi. “Secure dynamic skyline queries using result materialization”. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE. 2021, pp. 157–168. [161] Sepanta Zeighami, Raghav Seshadri, and Cyrus Shahabi. NeuroComplete Implementation. https://github.com/szeighami/NeuroComplete. [162] Sepanta Zeighami and Cyrus Shahabi. “NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond”. In: arXiv preprint arXiv:2107.04922 (2021). [163] Sepanta Zeighami and Cyrus Shahabi. “On Distribution Dependent Sub-Logarithmic Query Time of Learned Indexing”. In: Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research. PMLR, 23–29 Jul 2023, pp. 40669–40680. url: https://proceedings.mlr.press/v202/zeighami23a.html. [164] Sepanta Zeighami, Cyrus Shahabi, and John Krumm. “Estimating spread of contact-based contagions in a population through sub-sampling”. In: Proceedings of the VLDB Endowment 14.9 (2021), pp. 1557–1569. [165] Sepanta Zeighami, Cyrus Shahabi, and Vatsal Sharan. “NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural Networks”. In: Proceedings of the ACM on Management of Data 1.1 (2023), pp. 1–26. [166] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. “Privbayes: Private data release via bayesian networks”. In: ACM Transactions on Database Systems (TODS) 42.4 (2017), pp. 1–41. [167] Jun Zhang, Xiaokui Xiao, and Xing Xie. “Privtree: A differentially private algorithm for hierarchical decompositions”. In: Proceedings of the 2016 International Conference on Management of Data. 2016, pp. 155–170. [168] Lixi Zhang, Chengliang Chai, Xuanhe Zhou, and Guoliang Li. “Learnedsqlgen: Constraint-aware sql generation using reinforcement learning”. In: SIGMOD. 2022. [169] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. “Random erasing data augmentation”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 07. 2020, pp. 13001–13008. 344 [170] Barret Zoph and Quoc V Le. “Neural architecture search with reinforcement learning”. In: arXiv preprint arXiv:1611.01578 (2016). 345
Abstract (if available)
Abstract
Machine learning models have been recently used to replace various database components (e.g., index, cardinality estimator) and have shown substantial performance enhancements over existing non-learned alternatives. Such approaches take a function approximation view of the database operations. They consider a database operation as a function that can be approximated (e.g., an index is a function that maps items to their location in a sorted array) and learn a model to approximate the operation’s output. However, the theoretical characteristics of such approaches have not been well understood. This lack of theoretical guarantees on their performance greatly limits their practical applicability. Besides, from a practical perspective, existing approaches only optimize specific components within a database system, leaving the accuracy and efficiency of the entire database system as a whole unoptimized for a specific workload. This thesis addresses the above two shortcomings. It provides the first-ever theoretical guarantees for various learned database operations and presents novel practical solutions to improve the performance of learned database systems. From a practical perspective, we develop the Neural Database (NeuroDB) framework which extends the approximation view of database operations by considering the entire database system as a function that can be approximated. In this framework, we train neural networks that take queries as input and output query answer estimates. Using this framework, we show substantial performance benefits for various important database problems including approximate query processing, privacy-preserving query answering, and query answering on incomplete datasets. From a theoretical perspective, we present a pioneering theoretical study of the function approximation view of database operations, providing the first-ever theoretical analysis of various learned database operations, including indexing, cardinality estimation, sorting, and range-sum estimation. Our analysis provides theoretical guarantees on the performance of the learned models, showing why and when they perform well. Furthermore, we theoretically study the model size requirements, showing how model size needs to be set to be able to achieve a desired accuracy level. Our results build a foundation for theoretical analysis of learned database operations that enhances our understanding of the learned operations and provide the much-needed theoretical guarantees on their performance for robust practical deployment.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
No-regret learning and last-iterate convergence in games
PDF
Robust and adaptive online reinforcement learning
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
From matching to querying: A unified framework for ontology integration
PDF
Leveraging training information for efficient and robust deep learning
PDF
Modeling information operations and diffusion on social media networks
PDF
Efficient updates for continuous queries over moving objects
PDF
Reinforcement learning with generative model for non-parametric MDPs
PDF
Efficient reachability query evaluation in large spatiotemporal contact networks
PDF
Interactive learning: a general framework and various applications
PDF
Characterizing and improving robot learning: a control-theoretic perspective
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Robust and proactive error detection and correction in tables
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Scalable data integration under constraints
PDF
Practice-inspired trust models and mechanisms for differential privacy
Asset Metadata
Creator
Zeighami, Sepanta
(author)
Core Title
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-05
Publication Date
03/01/2024
Defense Date
02/27/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data management,data storage with neural networks,machine learning,theory
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Shahabi, Cyrus (
committee chair
), Chugg, Keith (
committee member
), Luo, Haipeng (
committee member
), Sharan, Vatsal (
committee member
)
Creator Email
sepzeighami@gmail.com,zeighami@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113841643
Unique identifier
UC113841643
Identifier
etd-ZeighamiSe-12674.pdf (filename)
Legacy Identifier
etd-ZeighamiSe-12674
Document Type
Thesis
Format
theses (aat)
Rights
Zeighami, Sepanta
Internet Media Type
application/pdf
Type
texts
Source
20240304-usctheses-batch-1127
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data management
data storage with neural networks
machine learning
theory