Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reliability and robustness in modern statistical learning
(USC Thesis Other)
Reliability and robustness in modern statistical learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Reliability and Robustness in Modern Statistical Learning by Mohammad Mehrabi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) August 2023 Copyright 2023 Mohammad Mehrabi Dedication To Monir and Alireza, my parents and Mehran, my brother ii Acknowledgements I have always found immense joy in mathematics and scientific curiosity. I consider myself incredibly fortunate to have been surrounded by remarkable individuals who guided me along this path, teaching me how to learn and inspiring me to pursue it. I would like to express my sincere gratitude to my advisor, Adel Javanmard, whose guidance and advice from our very first meetings have made me realize the extent of my good fortune. Each encounter with Adel has been enriching, and I greatly appreciate the opportunity to meet and engage in discussions with him about new problems and ideas. His insightful perspectives on academic research, as well as his deep knowledge and exceptional approach to problem-solving, have been an invaluable privilege. I would like to extend my heartfelt gratitude to my esteemed dissertation committee members, Jacob Bien and Meisam Razaviyayn. Their invaluable comments and feedback greatly enhanced the quality of this work. My time at USC Marshall provided me with the extraordinary opportunity to engage with ex- ceptional scholars, from whom I gleaned a wealth of knowledge. I am immensely thankful to Vishal Gupta, Gareth James, Wen Sun, Yingying Fan, Jinchi Lv, Gourab Mukherjee, Hamid Nazerzadeh, and Matteo Sesia for their indispensable advice and enlightening discussions that played a significant role in my growth as a scholar. iii My journey into academic research commenced during my undergraduate summer internship at CUHK, under the guidance of Pascal Vontobel. I am profoundly grateful for the invaluable lessons I learned from him, which instilled in me the confidence to pursue an academic career. Similarly, I express my deep appre- ciation to Aslan Tchamkerten, who supervised my internship a year later at Telecom ParisTech. From the very beginning, he has been a constant source of encouragement, motivation, and mentorship, nurturing my analytical thinking skills for research problems. I owe a profound debt of gratitude to the remarkable scholars who shaped my mathematical founda- tion and knowledge during my undergraduate years at Sharif University of Technology in Tehran, Iran. In particular, I am immensely thankful to Bijan Zohuri-Zangeneh and Amin Gohari for their invaluable insights shared in their courses on information theory and probability. My journey in the realm of mathematics was ignited by captivating brain-teaser mathematical ques- tions during my high school years. I am indebted to my brother, Mehran, for introducing me to these challenges and fostering my passion for mathematics. I would like to express my great appreciation to him for his constant motivation and support during those formative years. I cannot overstate the indispensable role that my friends and family have played in guiding me through the challenging path of my Ph.D. journey. In particular, I am deeply grateful to my parents, who have con- sistently demonstrated an exemplary work ethic and served as a perpetual source of inspiration throughout my academic endeavors. It is with utmost gratitude and admiration that I dedicate this dissertation to my family, whose unwavering support has served as the bedrock of my achievements. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Reliable Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Online Debiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1.1 Why offline debiasing fails? . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.1.2 Online debiasing estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1.1.3 Contribution and organizations . . . . . . . . . . . . . . . . . . . . . . . 12 1.1.2 Pearson Chi-squared Conditional Randomization Test . . . . . . . . . . . . . . . . 14 1.1.2.1 Background on Conditional Randomization Tests . . . . . . . . . . . . . 16 1.1.2.2 Related work on model-X CI tests . . . . . . . . . . . . . . . . . . . . . . 18 1.1.2.3 Summary of contributions and organizations . . . . . . . . . . . . . . . . 20 1.1.3 Goodness-of-fit testing for classificaiton learning . . . . . . . . . . . . . . . . . . . 22 1.1.3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.1.3.2 Summary of contributions and organization . . . . . . . . . . . . . . . . 29 1.2 Adversarial Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 1.2.1 Tradeoffs in distributionall adversarial setup . . . . . . . . . . . . . . . . . . . . . . 32 1.2.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 1.2.1.2 Further related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.2.2 Adversarial robustness for latent models . . . . . . . . . . . . . . . . . . . . . . . . 35 1.2.2.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 37 1.2.2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 v I ReliableInference 40 Chapter 2: Online Debiasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.1 Online debiasing for high-dimensional time series . . . . . . . . . . . . . . . . . . . . . . . 41 2.1.1 Constructing the online debiased estimator . . . . . . . . . . . . . . . . . . . . . . 44 2.1.2 Distributional characterization of online debiasing . . . . . . . . . . . . . . . . . . 49 2.2 Batched data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.2.1 Online debiasing: a distributional characterization . . . . . . . . . . . . . . . . . . 56 2.2.1.1 Revisiting the numerical example from Section 1.1.1.1. . . . . . . . . . . 60 2.3 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 3: Model-X Conditional Independence Testing . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1 Pearson Chi-squared randomization (PCR) test . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1.1 PCR test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1.2 Decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.1.3 Size of the PCR test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.2 A power analysis of the PCR test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 Parameter-free PCR test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.4 Robustness of the PCR test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.1 Real data experiment: Capital Bikeshare dataset . . . . . . . . . . . . . . . . . . . 88 3.6 Comparison with other conditional randomization tests . . . . . . . . . . . . . . . . . . . . 90 3.6.1 Numerical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Chapter 4: Goodness-of-fit Testing for Classification Learning . . . . . . . . . . . . . . . . . . . . . 99 4.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2 Distribution-free setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.1 Test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.2 Decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2.3 Size of the test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.2.4 P-values and one-sided confidence intervals . . . . . . . . . . . . . . . . . . . . . . 110 4.2.5 Choice of the score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2.6 Testing perfect fit of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3 Model-X setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.1 Test statistic and decision rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3.2 Size of the model-X GRASP test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.3.3 Choice of the score function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3.3.1 Generative Adversarial Nets (GANs) . . . . . . . . . . . . . . . . . . . . 123 4.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.5 Solving the optimization problem for GRASP decision rules . . . . . . . . . . . . . . . . . 136 II AdversarialRobustness 140 Chapter 5: Fundamental Tradeoffs in Distributionally Adversarial Training . . . . . . . . . . . . . 141 5.0.1 Adversarial setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 vi 5.0.2 Background on Wasserstein robust loss . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.0.2.1 Regularization effect of Wasserstein adversarial loss . . . . . . . . . . . . 145 5.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.1.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.1.2 Binary classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.1.3 Learning nonlinear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Chapter 6: Adversarial Robustness for Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.0.1 Data generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.0.2 Background on optimal classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.1.1 Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.1.2 Binary classification under generalized linear models . . . . . . . . . . . . . . . . 169 6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.2.1 Is it necessary to learn the latent structure to obtain a vanishing boundary risk? A simple case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.2.2 Effect of perturbation in the lifting matrix W . . . . . . . . . . . . . . . . . . . . . 176 6.3 Boundary risk of Bayes-optimal image classifiers . . . . . . . . . . . . . . . . . . . . . . . . 179 6.3.1 Learning image data with Gaussian Mixture Models (GMM) . . . . . . . . . . . . . 181 6.3.2 PGD and FGM adversarial attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 6.3.3 Main experiments and key findings . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Supplement to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 A.1 Sparse inverse covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 A.2 Estimating noise variance forVAR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 A.3 Robustness to the episode growth rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 A.4 Implementation and extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A.4.1 Iterative schemes to implement online debiasing . . . . . . . . . . . . . . . . . . . 215 A.4.1.1 Coordinate descent algorithms . . . . . . . . . . . . . . . . . . . . . . . . 216 A.4.1.2 Gradient descent algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 217 A.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 A.5.1 Real data experiments: a marketing application . . . . . . . . . . . . . . . . . . . . 222 A.6 Proofs of Section 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.6.1 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.6.2 Remarks on proof of Theorem 2.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.6.3 Proof of Lemma 2.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 A.6.4 Proof of Theorem 2.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 A.6.5 Proof of Lemma 2.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 A.6.6 Proof of Proposition 2.1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A.7 Proofs of Section A.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 A.7.1 Proof of Lemma A.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 vii A.7.2 Proof of Lemma A.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 A.7.3 Proof of Theorem A.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 A.7.3.1 Proof of Lemma A.7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 A.8 Proofs of Section 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 A.8.1 Consistency results for LASSO under adaptively collected samples . . . . . . . . . 250 A.8.1.1 Proof of Theorem 2.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 A.8.2 Bias control: Proof of Theorem 2.2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . 255 A.8.3 Central limit asymptotics: proofs of Proposition 2.2.9 and Theorem 2.2.10 . . . . . 259 A.8.4 Proofs for Gaussian designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 A.9 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Supplement to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 B.1 Proof of theorems and technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 B.1.1 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 B.1.2 Proof of Theorem 3.6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 B.1.3 Proof of Proposition B.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 B.1.3.1 Proof of Lemma B.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 B.1.3.2 Proof of Lemma B.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 B.1.3.3 Proof of Lemma B.1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 B.1.3.4 Proof of Lemma B.1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 B.1.4 Proof of Theorem 3.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 B.1.5 Proof of Remark 3.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 B.1.6 Proof of Proposition 3.2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 B.1.7 Proof of Theorem 3.2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 B.1.8 Proof of Theorem 3.2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 B.1.9 Proof of Theorem 3.6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 B.1.10 Proof of Theorem 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 B.1.11 Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Supplement to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 C.1 Proofs of theorems and lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 C.1.1 Proof of Lemma 4.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 C.1.2 Proof of Proposition 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 C.1.3 Proof of Proposition 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 C.1.4 Proof of Proposition 4.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 C.1.5 Proof of Theorem 4.2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 C.1.6 Proof of Proposition 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 C.1.7 Proof of Proposition 4.2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 C.1.8 Proof of Proposition 4.2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 C.1.9 Proof of Proposition 4.2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 C.1.10 Proof of Proposition 4.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 C.1.11 Proof of Theorem 4.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 C.1.12 Proof of Lemma 4.3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 viii C.1.13 Proof of Proposition 4.3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 C.2 Proof of Proposition 4.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Supplement to Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 D.1 Proof of theorems and technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 D.1.1 Proof of Proposition 5.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 D.1.2 Proof of Theorem 5.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 D.1.3 Proof of Corollary 5.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 D.1.4 Proof of Proposition 5.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 D.1.5 Proof of Remark 5.1.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 D.1.6 Proof of Corollary 5.1.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 D.1.6.1 Proof of Lemma D.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Appendix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Supplement to Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 E.1 Proof of theorems and technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 E.1.1 Proof of Corollary 6.0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 E.1.2 Proof of Theorem 6.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 E.1.3 Proof of Proposition 6.1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 E.1.4 Proof of Proposition 6.1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 E.1.5 Proof of Proposition 6.1.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 E.1.6 Proof of Theorem 6.1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 E.1.7 Proof of Theorem 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 ix ListofTables 3.1 Size (a = 0) and power (a = 4) of the robust PCR test for the setting of (3.18) and the approximate distribution (3.19) available for sampling the counterfeits. We consider a = 0,4 andn = 5000 data points. The PCR test is run withL = 4 number of labels. For each value of the discrepancy level η , we use (3.20) to get an upper bound δ on the expected total variation distanceE Z h d TV P X|Z (·|Z), b P X|Z (·|Z) i , and use it in constructing the robust statisticU n,L (δ ). Note thatη =0 impliesδ =0, and therefore the robust statisticU n,L (δ ) matches the statisticU n,L in Algorithm (1). Fora = 0 (true null hypothesis), the size is controlled at the significance level α =0.1. Reported numbers are obtained by averaging over1000 trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.2 P -values that are computed from the PCR test on the Capital Bikeshare dataset. The null hypothesis (1.10) is considered withX being the duration of the ride, and the confounder variable Z encoding the start and end locations, as well as the time of day at the start of the ride. We consider three different response values Y : (1) User type, (2) Date of the month, (3) Weekday. The p-values are obtained as per (3.5) with the number of labels L=10, and the counterfeit ratioK =200. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.1 Size of the distribution-free GRASP, cf. Algorithm 4. In this experiment,b η (x) = η (x) is considered with η (x) being a logistic model (4.18). We consider both decision rules (asymptotic (4.5)) and (finite (4.6)). The first and the second entry of each pair in the table represent the rejection rate of these decision rules, respectively. The numbers are averaged over200 independent experiments for three significance levels α =0.05,0.1,0.15 and for different sample sizes ( n) and number of labels (L). . . . . . . . . . . . . . . . . . . . . . . 127 4.2 Size of the model-X GRASP, cf. Algorithm 5. In this experiment,b η (x)=η (x) is considered withη (x) being a logistic model (4.18). Algorithm 5 is run with the score functionT(x,w) given in (4.19). We consider both decision rules (asymptotic (4.5)) and (finite (4.6)). The first and the second entry of each pair in the table represent the rejection rate of these decision rules, respectively. The numbers are averaged over200 independent experiments for three significance levels α = 0.05,0.1,0.15 and for different sample sizes ( n) and number of labels (L), and number of randomizations per label (K). . . . . . . . . . . . . . 127 x 4.3 Statistical power of distribution-free GRASP, as outlined in Algorithm 4, with the true model η (x) given by (4.18) and the test modelb η (x) in (4.20). In this experiment, we consider threef− divergences (KL, TV, and Hellinger as per Lemma 4.1.2) for different choices of sample size (n) and number of labels (L). Each cell has two numbers, where the first and second number respectively refer to the rejection rates under the asymptotic decision rule (4.5) and the finite decision rule (4.6), at significance level α = 0.1. The reported numbers are averaged over50 independent experiments . . . . . . . . . . . . . . 129 4.4 Statistical power of model-X GRASP, as outlined in Algorithm 5, with the true modelη (x) given by (4.18) and the test model[ η (x) in (4.20). We consider threef− divergences (KL, TV, and Hellinger as per Lemma 4.1.2) for different choices of sample size ( n) and number of labels (L), and number of randomizations per label (K). The model-X algorithm is run with the score functionT(x,w) given in (4.19). Each cell contains an ordered pair, where the first and second number respectively refer to the rejection rates under the asymptotic decision rule (4.5) and the finite decision rule (4.6), at significance level α = 0.1. The reported numbers are averaged over50 independent experiments. . . . . . . . . . . . . . . 129 A.1 Overall performance of online debiasing (test (2.34)) and the U-test and R-test proposed by [208] for testing the entries of the generative matrixA for aVAR(1) model. We consider A = diag(A 0 ,...,A 0 ) with A 0 given by (A.6). For each configuration, we report the true positive rate (TPR), false positive rate (FPR) and the running time for each test. The reported numbers are averaged out over10 independent realizations of each configuration. The running times are in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 A.2 Coefficient of variation (CV) for the p-values and the confidence interval lengths for the six selected entries of theA matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 A.3 Evaluation of the online debiasing approach for statistical inference on the coefficients of aVAR(d) model under different configurations. Here the noise terms ζ i are gaussian with covariance matrixΣ ζ (i,j) = 0.1 |i− j| . The results are reported in terms of four metrics: FPR (False Positive Rate), TPR (True Positive Rate), Coverage rate and Average length of confidence intervals (Avg CI length) at significance level α =0.05 . . . . . . . . . . . . . 221 A.4 Evaluation of the online debiasing approach for statistical inference on the coefficients of aVAR(d) model under different configurations. Here the noise terms ζ i are gaussian with covariance matrixΣ ζ (i,j) = 0.1 I(i̸=j) . The results are reported in terms of four metrics: FPR (False Positive Rate), TPR (True Positive Rate), Coverage rate and Average length of confidence intervals (Avg CI length) at significance level α =0.05 . . . . . . . . . . . . . . 222 xi ListofFigures 1.1 Histograms of the offline debiased estimate b θ off restricted to the support of θ 0 . The dashed line indicates the true coefficient size. Recall that the second batch is chosen based on an intermediate estimator b θ 1 computed on the first batch. (Left) b θ 1 is debiased LASSO on the first batch, (Right) b θ 1 is ridge estimate on the first batch. As we observe the offline debiasing (even with access to the precision matrix Ω of the random designs) has a significant bias and dose not admit a Gaussian distribution. . . . . . . . . . . . . . . . . . 7 1.2 Empirical behavior of noise term associated with the offline debiased estimate of a fixed coordinate ofVAR(d) model. In this example,d = 5,p = 15,T = 60,Σ ζ (i,j) = 0.5 |i− j| , andA (ℓ) matrices are diagonal with valueb = 0.15 on their diagonals. Plots 1.2a, 1.2b, and 1.2c show the QQ plot, PP plot, and the histogram of the offline debiased noise terms (red) over 1000 independent experiments, respectively and black curve/lines denote the ideal standard normal distribution. As we see offline debiasing method fails for statistical inference purposes when the samples are correlated. . . . . . . . . . . . . . . . . . . . . . 10 1.3 Schematic representations of the distribution of the normalized rankp under alternative hypothesis. The horizontal purple line indicates the uniform distribution. In (a), (b), the non-null p-value has higher tail compared to the null uniform distribution and the CRT (one-sided for (a) and two-sided for (b)) would have descent power. In (c), the non-null p-value is more likely to fall in the middle range which causes the CRT to have low power 18 2.1 Schematic for constructing the debiasing matricesM (ℓ) . We divide time intoK episodes E 0 ,...,E K− 1 ; in episodeℓ,M i is held constant atM (ℓ) , which is a function ofx t in all prior episodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.2 Heat maps of magnitudes of elements of covariance matrixΣ ≡ E(x i x T i ) (left plot), and precision matrixΩ=Σ − 1 (right plot). In this example. x i ’s are generated from aVAR(d) model with covariance matrix of noise Σ ζ (i,j) = ρ |i− j| with values d = 5, p = 15, T =60,ρ =0.5, and diagonalA (i) matrices withb=0.15 on diagonals. . . . . . . . . . . 48 xii 2.3 Plots 2.3a, 2.3b, and 2.3c show the QQ plots, PP plots, and the histogram of online debiased noise terms (blue) and offline debiased noise terms (red) over 1000 independent experiments, respectively and black curve/lines denote the ideal standard normal distribution. The solid and dash vertical lines in plot (c) indicate the location of the mean of offline and online debiased noise terms, respectively. . . . . . . . . . . . . . . . . . . . . 49 2.4 (a,c) Histograms of the online debiased estimate b θ on and the ridge debiased estimator [44], restricted to the support ofθ 0 . (b,d) Histograms of the offline debiased estimate onlyusing the first batch , b θ off ,1 given by (2.30) and the online debiased estimate b θ on . In (a,b), we use b θ 1 the debiased Lasso on the first batch and in (c,d) we use b θ 1 the ridge estimate on the first batch. The dashed line indicates the true coefficient size. Offline debiasing b θ off ,1 using only the first batch works well (green histograms called offlineFirstBatch), but then loses power in comparison. Online debiasing is cognizant of the adaptivity and debiases without losing power even in the presence of adaptivity. . . . . . . . . . . . . . . . . . . . 62 3.1 Size of PCR test applied on a dataset consisting of n = 100 samples generated from model (3.16), where X ⊥ ⊥ Y|Z. Three significance levels α = 0.05,0.1, and 0.15 are considered. Statistic U n,L is obtained from Algorithm 1 by using the score function T(x,z,y) = (y− x− z T 1) 2 , and the decision rule (3.1) is employed with the threshold θ finite L,α . Reported numbers are averaged out over10,000 independent realizations. . . . . . 85 3.2 Size of PCR test applied on a dataset consisting of n = 100 samples generated from model (3.16), where X ⊥ ⊥ Y|Z. Three significance levels α = 0.05,0.1, and 0.15 are considered. Statistic U n,L is obtained from Algorithm 1 by using the score function T(x,z,y) = (y− x− z T 1) 2 , and the decision rule (3.1) is employed with the threshold θ asym L,α . Reported numbers are averaged out over10,000 independent realizations. . . . . . 85 3.3 Power of PCR test for (left)n = 1000 and (right)n = 2000 data points. Data points are generated under the setup (3.16) and the conditional law (3.17). We consider the score function (y− x− z T 1) 2 and choose the significance level α = 0.1. We consider the decision rule (3.1) with both of the rejection thresholdsθ asym L,α andθ finite L,α . Each reported power is obtained by averaging over1000 trials. . . . . . . . . . . . . . . . . . . . . . . . . 86 3.4 Comparison between statistical power of PCR and a group of model-X CI tests for the data generating law (3.25) forlow-dimensional (left) andhigh-dimensional (right) settings. For the low-dimensional setting, we considern = 8000,p = 50 with the ordinary least square as the score function. For the high-dimensional setting, we considern=5000 and p=6000 with the cross-validated lasso as the score function. . . . . . . . . . . . . . . . . 97 3.5 Comparison between statistical power of PCR and a group of model-X CI tests for the data generating law (3.26). In this experiment, we considern = 2000,p = 50 with the ordinary least square as the score function. The results are averaged over100 experiments at significance level α =0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xiii 4.1 Statistical power of distribution free GRASP for trained score function T(x,w) = |w− x T b θ |. The linear model b θ is trained on an auxiliary dataset of size4000. The true modelη (x) is given in (4.18) and the test modelb η (x) is given in (4.20). In this case, the nominal values areτ KL 0 = 2.7819,τ TV 0 = 0.7330, andτ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run distribution free GRASP with two decision rules Φ asym n,L,α,τ and Φ finite n,L,α,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with different τ values. Here, sample size isn = 5000 with number of labelsL = 50, and the significance level is α = 0.1. Reported rates are averaged over 50 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.2 Statistical power of distribution free GRASP versus the number of labels L. The true model η (x) is given in (4.18) and the test model b η (x) is given in (4.20). In this case, the nominal values are τ KL 0 = 2.7819, τ TV 0 = 0.7330, and τ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run distribution free GRASP with two decision rules Φ asym n,L,α,τ and Φ finite n,L,α,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with the following τ values: τ KL asym = 0.84,τ KL finite = 0.73,τ TV asym = 0.46,τ TV finite = 0.42,τ H asym = 0.34, andτ H finite = 0.30. Here, sample size isn=5000, the significance level is α =0.1 and the reported rates are averaged over50 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.3 Statistical power of model-X GRASP versus the number of labelsL. The true modelη (x) is given in (4.18) and the test modelb η (x) is given in (4.20). In this case, the nominal values are τ KL 0 = 2.7819, τ TV 0 = 0.7330, and τ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run model-X GRASP with the score function T(x,w) as per (4.19) and the two decision rules Φ asym n,L,τ and Φ finite n,L,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with the followingτ values: τ KL asym = 1.87,τ KL finite = 1.6,τ TV asym = 0.67,τ TV finite = 0.62,τ H asym = 0.75, andτ H finite = 0.6. Here, sample size is n = 5000, with K = 1 randomizations per sample, and the significance level α =0.1. The reported rates are averaged over50 experiments. . . . . . 133 4.4 Comparison between the statistical power of model-X GRASP with a GAN-based score function and the agnostic score function (4.19). In this experiment, the test size is n = 5000, number of labelsL = 50, and randomizations per labelK = 1. Hypothesis testing problem (4.2) is considered for a variety ofτ values, threef-divergence functions, and two asymptotic and finite decision rules (4.5) and (4.6). In each setting, it can be observed that the GAN-based score function achieves a higher statistical power than the agnostic score function. The reported numbers are average over50 experiments. . . . . . . 135 5.1 The effect of feature dimension ( d), dependency across features (ρ ), and adversary’s power (ε) on Pareto optimal tradeoff between adversarial ( AR) and standard risks (SR) in linear regression setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 xiv 5.2 The effect of defined ℓ r norm on feature space, dependency across features (ρ ) and adversary’s powerε on Pareto optimal tradeoff between adversarial and standard risks in binary classification under Gaussian mixture model. . . . . . . . . . . . . . . . . . . . . . 154 5.3 Pareto-optimal tradeoff curves for learning random quadratic functions using random features model. Data is generated according to (5.27) withσ =2 andf d (x) given by (5.30). Here,d=10 andN is the number of random features (width of the neural network). . . . 159 6.1 Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes-optimal classifier with ℓ 2 perturbations under the Gaussian mixture model (6.5), where features lie on a low-dimensional manifold. Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over theM =100 realizations. . . . . . . . . . . . 168 6.2 Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes-optimal classifier of the generalized linear model (6.6) with ℓ 2 perturbations, in which features are coming from a low-dimensional manifold. Solid curves represent the average values, and the shaded areas represent one standard deviation above and below the corresponding curves overM =100 realizations. . . . . . . . . . . . . . . . . . . . . . 170 6.3 Effect of dimensions ratio d/k on the standard, adversarial, and boundary risks of the linear classifier h θ (x) = sign(x T θ ) with θ being the robust empirical risk minimizer (B.26). Samples are generated from the Gaussian mixture model (6.5) with balanced classes (π = 1/2), and with four choices of feature mappingφ: (a) φ(t) = t, (b)φ(t) = 3t/4+sign(t)t/4, (c)φ(t) = t+sign(t)t 2 and (d) φ(t) = tanh(t). In these experiments, the ambient dimensiond is fixed at 100, and the manifold dimensionk varies from1 to100. The sample size isn = 300 the classes averageµ and the weight matrix W have i.i.d. entries from N(0,1/k). The adversary’s power is fixed at ε = 1. For each fixed values of k andd, we considerM =20 trials of the setup. Solid curves represent the average results across these trials, and the shaded areas represent one standard deviation above and below the corresponding curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 6.4 Effect of dimensions ratio d/k on the boundary risk of the linear classifier h θ (x) = sign(x T θ ) withθ being the robust empirical risk minimizer (B.26). Samples are generated from the Gaussian mixture model (6.5) with balances classes (p = 1/2), and with four choices of feature mapping φ: (a) φ(t) = t, (b) φ(t) = 3t/4+sign(t)t/4, (c) φ(t)=t+sign(t)t 2 and (d) φ(t)=tanh(t). In these experiments, the ambient dimension d is fixed at 100, and the manifold dimensionk varies from1 to100. The sample size is n=300 the classes averageµ and the weight matrixW have i.i.d. entries fromN(0,1/k). We consider different levels of the adversary’s power ε∈{1,2,4}. For each fixed values ofk andd, we considerM = 20 trials of the setup. Solid curves represent the average results across these trials, and the shaded areas represent one standard deviation above and below the corresponding curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 xv 6.5 Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes-optimal classifier obtained from a noisy weight matrix c W withℓ 2 perturbations under the Gaussian mixture model (6.5), where features lie on a low-dimensional manifold. Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over theM =200 realizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 6.6 Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes-optimal classifier obtained from a noisy weight matrix c W withℓ 2 -perturbations under the generalized linear model (6.6). Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over theM =50 realizations. . . . . . . . . . . . . . . . . . . 178 6.7 Sample generated images from a GMM model fit on MNIST training set images with label six. Three GMM models with number of componentsK = 10, and manifold dimensions ℓ=1,10,100 (from left to right) are considered. . . . . . . . . . . . . . . . . . . . . . . . 183 6.8 Sample attacks and perturbed images for three different latent dimensions ℓ = 1,10,100, respectively from top to bottom. In each row, from left to right, the original sample, the adversarially crafted perturbation, and the perturbed image is exhibited. The original images are generated from a GMM with the number of componentsK =10. In all images under the PGD adversarial perturbation, we start with theℓ 2 − bounded adversarial power ε = 0, and incrementally increase it to the point that the Bayes-optimal classifier fails to infer the correct label. In this experiment, the Bayes-optimal misclassification on sampled images with ℓ = 1,10,100 occurs at ℓ 2 − bounded adversarial power ε = 22.6,11,7, respectively. It can be seen that samples coming from smaller latent dimensionℓ require stronger perturbation to get misclassified by the Bayes-optimal classifier. . . . . . . . . . . 186 6.9 Boundary risk of the Bayes-optimal classifier on 1000 test images generated from GMM models with number of components K = 1 and K = 10 (from left to right). In each experiment, the adversary’sℓ 2 -bounded perturbation power is fixed at ε = 12, and both adversarial attacks PGD and FGM are considered. It can be observed that the boundary risk of the Bayes-optimal classifier will converge to zero as the dimension ratio d/ℓ grows. 186 A.1 A Simple example of a VAR(d) process with parameters p = 25,d = 3,T = 70, and noise term covariance matrix Σ ζ s.t Σ ζ (i,j) = ρ |i− j| with ρ = 0.1. A i matrices have independent elements coming from b· Bern(q).Unif({+1,− 1}) formula with b=0.15,q =0.05. Normality of rescaled residuals (figures A.1b, A.1c, and A.1d) validates the successful performance of offline debiasing estimator under sparsity of precision matrixΩ ( figure A.1a) as we discussed in theorem A.1.2. . . . . . . . . . . . . . . . . . . . 208 A.2 Boxplots of p-values and CI lengths for3 zero and3 nonzero coordinates of autoregressive matrix A, for 11 choices of episode growth rate β ∈ {2,2.2,...,3.8,4}. It can be seen that the variation in each box is small. This implies the robustness of online debiasing framework with respect to the choice of episode growth rateβ . . . . . . . . . . . . . . . . 214 xvi A.3 A simple example of an online debiased Var(3) process with dimension p = 20 and T =50 sample data points. Plots A.3a, A.3c, A.3e demonstrate respectively the histogram, QQ-plot, and PP plot of noise values of alldp 2 = 1200 entries ofA i matrices in linear time series model (1.5). Plots A.3b, A.3d, A.3f are histogram, QQ-plot, and PP-plot of rescaled residuals of all coordinates as well. Alignment of data points in these plots with their corresponding standard normal(0,1) line corroborates our theoretical results on the asymptotic normal behavior of noise terms and rescaled residuals discussed in corollary 2.1.7 and proposition 2.1.8, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 A.4 Figures A.4a, and A.4b respectively show thep-values for cross-category effects of sales and prices ofx− axis categories on sales ofy− axis categories after one week. . . . . . . . . 224 A.5 Figures A.5a, and A.5b showp− values for cross-category effects of promotions of x− axis categories on sales ofy− axis categories, after one week and two weeks. . . . . . . . . . . 225 A.6 Figures A.6a, and A.6b respectively showp-values for cross-category effects of sales and prices ofx-axis categories on sales ofy− axis categories after two weeks. . . . . . . . . . . 226 xvii Abstract The generation of large-scale fine-grained datasets is experiencing an unprecedented surge, propelling the extensive deployment of modern statistical learning models across a wide range of safety-critical ap- plications. These include recommendation and forecasting systems, drug discovery, fraud detection, and personalized healthcare procedures. However, despite their widespread usage, decision-makers harbor reservations when it comes to confidently relying on such models. These concerns arise due to the repli- cability crisis observed in scientific discoveries, where the quest for reliable data-driven findings faces challenges in terms of reproducibility. As a result, findings from one study frequently struggle to be repli- cated in other independent investigations. In addition, there is a growing apprehension regarding the vulnerability of modern statistical learning models to minor perturbations in input data, as well as their limited ability to effectively generalize to unseen populations. These challenges underscore the utmost importance of prioritizing the reliability and robustness of modern statistical learning methods. By ad- dressing these key concerns, we can fortify the confident deployment of these methods across diverse domains. xviii NotationalConventions We use the shorthands [n] = {1,2,...,n} for an integer n ≥ 1, also a∧ b = min{a,b}, and a∨ b ≡ max(a,b). We use the capital letters for random variables and the small letters for the specific values they may take. For random variables or vectorsU,V ,L(U) represents the probability law (distribution) ofU andL(U|V) represents the conditional distribution ofU given V. We writeU d =V to indicate thatU and V have the same distribution. For an eventE, we denotes its probability byP(E). We use p ⇒ to indicate convergence ‘in probability’ and d ⇒ for convergence ‘in distribution’. Throughout,ϕ (t) = e − t 2 /2 / √ 2π is the Gaussian density andΦ( u)= R u −∞ ϕ (t)dt is the Gaussian distribution. We denote the distribution of a Bernoulli random variable with success parameter p with Bern(p), and let multi(p 1 ,...,p L ) denote a multinoulli distribution with L categories, where the probability of observing categoryℓ ∈ [L] isp ℓ . For the case ofp 1 = ··· = p L = 1/L we use the shorthand multi(L). In addition, for positive real valuesa 0, we can find a positive integer ℓ such that ford≥ ℓ, we havef(d)/g(d)≤ δ . In addition, letN(µ, Σ) denote the probability density of a multivariate normal distribution with meanµ and covarianceΣ . xx Chapter1 Introduction 1.1 ReliableInference Statistical inference serves as a fundamental pillar in data-driven discoveries, enabling researchers to draw statistically valid conclusions and infer underlying patterns from experimental and observational data. In classic statistics, this involves considering a family of parametric models and conducting hypothesis test- ing on components of the model, such as identifying non-null covariates or constructing confidence in- tervals for the model parameters. However, because of problems such as overfitting, high-dimensionality, adaptivity in data collection, and model misspecification, previously developed tools in classical statis- tics—particularly parametric models for iid data in low-dimensional settings—are rendered unreliable and ineffective. In the perspective of reliable statisticalinference, we consider three prevailing settings: i) Section 1.1.1 for reasoning the uncertainty in high-dimensional adaptively collected data, ii) Section 1.1.2 for model- X conditional independence testing, and iii) Section 1.1.3 for distribution-free goodness-of-fit testing for 1 classification learning problems.In the following sections, we present a concise overview of each setting, delineate the problem formulation, examine related problems, and outline our contributions. 1.1.1 OnlineDebiasing Modern data collection, experimentation and modeling are often adaptive in nature. For example, clinical trials are run in phases, wherein the data from a previous phase inform and influence the design of future phases. In commercial recommendation engines, algorithms collect data by eliciting feedback from their users; data which is ultimately used to improve the algorithms underlying the recommendations and so influence the future data. In such applications, adaptive data collection is often carried out for objectives correlated to, but distinct from statistical inference. In clinical trials, an ethical experimenter might pre- fer to assign more patients a treatment that they might benefit from, instead of the control treatment. In e-commerce, recommendation engines aim to minimize the revenue loss. In other applications, collect- ing data is potentially costly, and practitioners may choose to collect samples that are a priori deemed most informative. Since such objectives are intimately related to statistical estimation, it is not surprising that adaptively collected data can be used to derive statistically consistent estimates, often using stan- dard estimators. The question of statistical inference however, is more subtle: on the one hand, consistent estimation indicates that the collected samples are informative enough. On the other hand, adaptive col- lection induces endogenous correlation in the samples, resulting in bias in the estimates. In Chapter 2, we address the following natural question raised by this dichotomy: Can adaptively collected data be used 2 for ex post statistical inference? We focus on the linear model where the samples (y 1 ,x 1 ),(y 2 ,x 2 ),... , (y n ,x n ) satisfy: y i =⟨x i ,θ 0 ⟩+ε i , ε i iid ∼ N(0,σ 2 ). (1.1) Hereθ 0 ∈R p is an unknown parameter vector relating the covariatesx i to the responsey i , and the noise ε i are i.i.d. fromN(0,σ 2 ). In vector form, we write Eq.(1.1) asy =Xθ 0 +ε, wherey =(y 1 ,y 2 ,...,y n ),ε= (ε 1 ,ε 2 ,...,ε n ) and the design matrixX ∈R n× p has rowsx T 1 ,...,x T n . When the samples are adaptively collected, the data point(y i ,x i ) is obtainedafterviewingthepreviousdatapoints(y 1 ,x 1 ),... ,(y i− 1 ,x i− 1 ) ∗ . In the ‘sample-rich’ regime whenp<n, the standard approach would be to compute the least squares estimate b θ LS = (X T X) − 1 X T y. While the estimator b θ LS is consistent under fairly weak conditions, even under an adaptive setting [103], its asymptotic distribution can become highly complicated due to adap- tivity in data collection. Adaptive data collection creates correlation between the responsesy i (therefore ε i ) and covariate vectorsx i+1 ,x i+2 ,...,x n observed in the future. Due to this correlation, the asymptotic distribution of b θ LS is in general hard to characterize and heavily depends on the specific data collection procedure. In particular, in stark contrast with the non-adaptive setting, b θ LS = θ 0 +(X T X) − 1 X T ε is in general a biased estimate ofθ 0 . In the context of multi-armed bandits, where the estimator b θ LS for model (1.1) reduces to sample averages, [198, 187] observed such bias empirically, and [131, 162] characterized and developed upper bounds on the bias. While bias is an important problem, estimates may also show higher-order distributional defects that complicate inferential tasks. This phenomenon is exacerbated in the high-dimensional or ‘feature-rich’ regime whenp > n. Here the design matrixX becomes rank-deficient, and consistent parameter estimation requires (i) additional ∗ Formally, we assume a filtration (Fi) i≤ n to which the sequence (yi,xi) i≤ n is adapted, and with respect to which the sequence(xi) i≤ n is predictable 3 structural assumptions on θ 0 and (ii) regularized estimators beyond b θ LS , such as the LASSO. Such esti- mators are non-linear, non-explicit and, consequently it is difficult to characterize their distribution even with strong random design assumptions [91]. In analogy to the low-dimensional regime, it is relatively easier to develop consistency guarantees for estimation using the LASSO whenp > n. Given the sample (y 1 ,x 1 ),...(y n ,x n ) one can compute the LASSO estimate b θ L = b θ L (y,X;λ n ) b θ L =argmin θ n 1 2n ∥y− Xθ ∥ 2 2 +λ n ∥θ ∥ 1 o , (1.2) Ifθ 0 is sparse with at mosts 0 ≪ p non-zero entries and the designX satisfies some technical conditions, the LASSO estimate, for an appropriate choice ofλ n has estimation error∥ b θ L − θ 0 ∥ 2 2 of orderσ 2 s 0 (logp)/n, with high probability [12]. In particular the estimate is consistent provided the sparsity satisfies s 0 = o(n/logp). This estimator is biased though because of two distinct reasons. The first is the regularization imposed in Eq.(1.2), which disposes b θ L to have small ℓ 1 norm. The second is the correlation induced between X and ε due to adaptive data collection. To address the first source, [200, 90, 185] proposed a so-called “debiased estimate" of the form b θ off = b θ L + 1 n MX T (y− X b θ L ), (1.3) whereM is chosen as an ‘approximate inverse’ of the sample covariance b Σ = X T X/n. The intuition for this idea is the following decomposition that follows directly from Equations (1.1), (1.3): † b θ off − θ 0 =(I p − M b Σ)( b θ L − θ 0 )+ 1 n MX T ε. (1.4) † The notation b θ off stands for “offline" debiasing. We use this notation/terminology to highlight its main difference from the “online" debiasing that will be introduced later in Chapter 2. 4 When the data collection is non-adaptive,X andε are independent and therefore, conditional on the design X, MX T ε/n is distributed as N(0,σ 2 Q/n) where Q = M b Σ M T . Further, the bias in b θ off is isolated to the first term, which intuitively should be of smaller order than the second term, provided both b θ L − θ 0 andM b Σ − I p are small in an appropriate sense. This intuition suggests that, if the second term dominates the first term in b θ off , we can produce confidence intervals for θ 0 in the usual fashion using the debiased estimate b θ off [90, 91, 185]. In the so-called ‘random design’ setting –when the rows ofX are drawn i.i.d. from a broad class of distributions– this approach to inference via the debiased estimate b θ off enjoys several optimality guarantees: the resulting confidence intervals have minimax optimal size [83, 90, 27], and are semi-parametrically efficient [185]. This line of argument breaks down when the samples are adaptively collected, as the debiased estimate b θ off still suffers the second source of bias. Indeed, this is exactly analogous to b θ LS in low dimensions. Since M,X and the noiseε are correlated, we can no longer assert that the termMX T ε/n is unbiased. Indeed, characterizing its distribution can be quite difficult, given the intricate correlation between M, X and ε induced by the data collecting policy and the procedure for choosingM. We illustrate the failure of offline debiasing in two scenarios of interest in Chapter 2:(i) batched data collection and(ii) autoregressive time series. 1.1.1.1 Whyofflinedebiasingfails? Batcheddatacollection Consider an adaptive data collection setting where the experimenter (or analyst) collects data in two phases or batches. In the first phase, the experimenter collects an initial set of samples (y 1 ,x 1 ),...,(y n 1 ,x n 1 ) of size n 1 < n where the responses follow Eq.(1.1) and the covariates are i.i.d. from a distribution 5 P x . Following this, she computes an intermediate estimate b θ 1 ofθ 0 and then collects additional samples (y n 1 +1 ,x n 1 +1 ),...,(y n ,x n ) of sizen 2 = n− n 1 , where the covariatesx i are drawn independently from the law ofx 1 , conditional on the event{⟨x 1 , b θ 1 ⟩≥ ς}, whereς is a threshold, that may be data-dependent. This is a typical scenario where the responsey i represents an instantaneous reward that the experimenter wishes to maximize, as in multi-armed bandits [102, 25]. For instance, clinical trials may be designed to be response-adaptive and allocate patients to treatments that they are likely to benefit from based on prior data [209, 98, 68]. The multi-armed bandit problem is a standard formalization of this trade-off, and a vari- ety of bandit algorithms are designed to operate in distinct phases of ‘explore–then exploit’. Data collected from one arm in such a bandit algorithm can be modeled within our setting. Beyond multi-armed bandit algorithms for response-adaptive trials, the FDA guidance document [68] outlines other motivations for adaptive designs, such as ‘enrichment’, where the trial is targeted to an adaptively chosen subpopulation. Such settings can be modeled within our framework by replacing b θ 1 with another ‘enrichment direction’ vector, that can be computed from the data from the first batch. With the full samples(y 1 ,x 1 ),...,(y n ,x n ) at hand, the experimenter would like to perform inference on a fixed coordinate θ 0,a of the underlying parameter. As a numerical example, we considerθ 0 ∈{0,1} 600 with exactlys 0 =10 non-zero entries. We obtain the first batch (y 1 ,x 1 ),...,(y 500 ,x 500 ) of observations withy i = ⟨x i ,θ 0 ⟩+ε i , x i iid ∼ N(0,Σ) andε i iid ∼ N(0,1) where the diagonal ofΣ is all ones, Σ a,b = 0.1, if|a− b| = 1 and Σ a,b = 0 otherwise. Based on this data, we construct an intermediate estimator b θ 1 on (y (1) ,X 1 ) using two different strategies: (i) debiased LASSO and (ii) ridge regression with cross- validation. With this estimate we now sample new covariatesx 501 ,...,x 1000 independently from the law ofx| ⟨x, b θ 1 ⟩≥⟨ b θ 1 ,Σ b θ 1 ⟩ 1/2 and the corresponding outcomesy 501 ,...,y 1000 are generated according to Eq.(1.1). Unconditionally,⟨x, b θ 1 ⟩∼ N(0,⟨ b θ 1 ,Σ b θ 1 ⟩), so this choice of threshold corresponds to sampling covariates 6 (a) with b θ 1 the debiased LASSO on first batch (b) with b θ 1 the ridge estimate on first batch Figure 1.1: Histograms of the offline debiased estimate b θ off restricted to the support ofθ 0 . The dashed line indicates the true coefficient size. Recall that the second batch is chosen based on an intermediate estimator b θ 1 computed on the first batch. (Left) b θ 1 is debiased LASSO on the first batch, (Right) b θ 1 is ridge estimate on the first batch. As we observe the offline debiasing (even with access to the precision matrix Ω of the random designs) has a significant bias and dose not admit a Gaussian distribution. that correlate with b θ 1 at least one standard deviation higher than expected unconditionally. This procedure yields two batches of data, each ofn 1 =n 2 =500 data points, combining to a set of1000 samples. From the full dataset (y 1 ,x 1 ),...,(y 1000 ,x 1000 ) we compute the LASSO estimate b θ L = b θ L (y,X;λ ) withλ = 2.5λ max (Σ) p (logp)/n. Offline debiasing yields the following prescription to debias the Lasso estimator: b θ off = b θ L + 1 n Ω( b θ 1 )X T (y− X b θ L ), whereΩ( b θ ) is the population precision matrix: Ω( b θ 1 ) − 1 = 1 2 E{xx T }+ 1 2 E n xx T ⟨x, b θ 1 ⟩≥∥ Σ 1/2 b θ 1 ∥ o . We generate the dataset for1000 Monte Carlo iterations and compute the offline debiased estimate b θ off for each iteration. Figure 1.1 shows the histogram of the entries b θ off on the support ofθ 0 for the two choices of b θ 1 . As we see b θ off still has considerable bias, due to adaptivity in the data collection. 7 Autoregressivetimeseries A vector autoregressive (VAR) time series model posits that data pointsz t evolve according to the dynam- ics: z t = d X ℓ=1 A (ℓ) z t− ℓ +ζ t , (1.5) where A (ℓ) ∈ R p× p are time invariant coefficients and ζ t is the noise term satisfyingE(ζ t ) = 0 (zero- mean),E(ζ t ζ T t ) = Σ ζ (stationary covariance), andE(ζ t ζ T t− k ) = 0 fork > 0 (no serial correlation). Given the data z 1 ,...,z T , the task of interest is to perform statistical inference on the model parameters, i.e., coefficient matrices A (1) ,...,A (d) . Clearly, the samples z t are ‘adaptively collected’, in the sense that there is serial correlation in the samples. Indeed, the data point z t depends on the previous data points z t− 1 ,z t− 2 ,...,z 1 . As in the batched data example, we will carry out a simple illustration. We generate data from a VAR(d) model with p = 15, d = 5, T = 60, and diagonal A (ℓ) matrices with value b = 0.15 on their diagonals. We also generate ζ t iid ∼ N(0,Σ ζ ), with Σ ζ,ij = 0.5 I(i̸=j) for i,j ∈ [p] . Note that this is a high-dimensional setting as the number of parameters dp 2 exceeds the sample size (T − d)p. As we will discuss in Section 2.1, theVAR model can be represented as a linear regression model with covariate vectors x t = (z T t+d− 1 ,...,z T t ) T ∈R dp and noise term ε t = ζ d+t,i , if the focus is on estimating the i-th rows of coefficient matrices. 8 We focus on the noise component of the offline debiased estimate, i.e., W off = 1 √ T− d M P T− d t=1 x t ε t , withM denoting the decorrelating matrix in the debiased estimate as per (1.3). The offline debiasing matrix M has rows(m a ) a∈[dp] , wherem a is constructed from the following optimization problem minimize m T b Σ n m subject to ∥ b Σ n m− e a ∥ ∞ ≤ µ n , (1.6) with b Σ n = 1 n n P t=1 x t x T t . Theoretical analysis of [90] suggests the choice ofµ n ≍ p log(dp)/n. In simula- tions we first solve a linear optimization to find the smallest µ n that makes the optimization (1.6) feasible, and then constructM using optimization (1.6) for that value ofµ n . To build some insight regarding optimization (1.6), let us invoke the bias-variance decomposition in (1.4). For any fixed coordinate a, the corresponding entry of bias∆ a is upper bounded by |∆ a |≡| (e a − m T a b Σ)( b θ L a − θ 0,a )|≤| b Σ m a − e a |·| b θ L a − θ 0,a | (1.7) while the noise termZ a ≡ 1 n m T a X T is distributed asN(0, 1 n m T a b Σ m a ). Optimization (1.6) on the one hand tries to control| b Σ m a − e a | and hence by (1.7) controls the non-Gaussianity and bias of b θ off . On the other hand it minimizes the noise variance m T a b Σ m a , which controls the length of confidence interval for θ 0,a and the statistical power when testing for the null hypothesisH 0,a :θ 0,a =0. In Figure 1.2, we show the QQ-plot, PP-plot and histogram ofW off 1 (corresponding to the entry(1,1) of matrix A 1 ) for 1000 different realizations of the noise ζ t . As we observe, even the noise component W off is biased because the offline construction of M depends on all features x t and hence endogenous noiseζ t . Recall that for the setting with an i.i.d sample, the noise component is zero mean gaussian for any finite sample size n. This further highlights the challenge of high-dimensional statistical inference with 9 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2 3 Normal offline Sample Theoretical (a) QQ plot 0 1 0 1 Normal offline Sample Theoretical (b) PP plot 0.0 0.1 0.2 0.3 0.4 0.5 −4 −2 0 2 4 Lines offline mean Histograms Offline Curves Normal Noise Terms Density (c) Histogram Figure 1.2: Empirical behavior of noise term associated with the offline debiased estimate of a fixed coordinate of VAR(d) model. In this example, d = 5,p = 15,T = 60, Σ ζ (i,j) = 0.5 |i− j| , andA (ℓ) matrices are diagonal with value b = 0.15 on their diagonals. Plots 1.2a, 1.2b, and 1.2c show the QQ plot, PP plot, and the histogram of the offline debiased noise terms (red) over 1000 independent experiments, respectively and black curve/lines denote the ideal standard normal distribution. As we see offline debiasing method fails for statistical inference purposes when the samples are correlated. adaptively collected samples and demonstrate why the classical debiasing approach will not work in this case. It is worth noting that the classical (offline) debiasing may still work for some adaptive data settings under further structural assumption. In Appendix A.1 we study adaptive settings where the population in- verse covariance (precision matrix) is sufficiently sparse and prove that the offline debiasing asymptotically removes the bias in the Lasso estimate, leading to valid statistical inference in this case. 1.1.1.2 Onlinedebiasingestimator Motivated by the estimate (1.3), we propose online debiased estimator b θ on = b θ on (y,X;(M i ) i≤ n ,λ ) that takes the form b θ on ≡ b θ L + 1 n n X i=1 M i x i (y i − x T i b θ L ). (1.8) The term ‘online’ comes from the first crucial constraint of predictability imposed on the sequence(M i ) i≤ n . 10 Definition 1.1.1 (Predictability). Without loss of generality, there exists a filtration (F i ) i≥ 0 so that, for i = 1,2,...,n,(i)ε i are adapted toF i andε i is independent ofF j forj < i. We assume that the sequences (x i ) i≥ 1 and(M i ) i≥ 1 are predictablewithrespecttoF i ,i.e. foreachi,x i andM i aremeasurablewithrespect toF i− 1 . With predictability, the data points (y i ,x i ) are adapted to the filtration (F i ) i≤ n and, moreover, the covariates x i are predictable with respect toF i . Intuitively, the σ -algebraF i contains all information in the data, as well as potential external randomness, that is used to query the new data covariate x i+1 . Predictability ensures that only this information may be used to construct the matrixM i+1 . Analogous to Eq.(1.4) we can decompose b θ on into two components: b θ on =θ 0 + 1 √ n B n ( b θ L − θ 0 )+W n (1.9) where B n ≡ √ n I p − 1 n X i M i x i x T i , and W n ≡ 1 √ n X i M i x i ε i . Predictability of(M i ) i≤ n ensures thatW n is unbiased and the bias in b θ on is contained entirely in the first termB n ( b θ L − θ 0 ). Suppose that, analogous to offline debiasing, we prove that the bias term B n ( b θ L − θ 0 ) is of smaller order than the variance term W n . We are then left with the problem of characterizing the asymptotic distribution of the sequenceW n . As the sequence √ nW n = P i M i x i ε i is a martingale with respect to the filtration F i , one might expect thatW n is asymptotically Gaussian. The following ‘stability’ property, identified first by Lai and Wei [103] in this context, is crucial to ensure that this intuition is correct. 11 Definition1.1.2 (Stability). Considerasquareintegrabletriangularmartingalearray{Z i,n } i≤ n,n≥ 1 adapted to a filtration F i and its quadratic variation V n = P i≤ n E{(Z i,n − Z i− 1,n ) 2 |F i− 1 }. Note that V n is non- negative random variable, measurable with respect toF n− 1 . We say that the martingale array{Z i,n } i≥ 1 is stable if there exists a constantv ∞ >0 wherelim n→∞ V n =v ∞ in probability. For sums of independent random variables, guaranteeing that (i) the mean of summands vanishes and (ii) each individual summand is negligible in comparison to the sum (e.g. as in the classical Lindeberg condi- tion) are necessary and sufficient for asymptotic normality. Martingales are, however, sums of dependent random variables and, therefore, (i) and (ii) are not sufficient for asymptotic normality. One needs, in gen- eral, a further condition on the predictable quadratic variationV n (see [71, 51]). Notion of ‘stability’ given in Definition 1.1.2 is a commonly used such condition. 1.1.1.3 Contributionandorganizations An important contribution of our work is to develop online debiasing estimators b θ on whose underlying martingales are stable. The specifics of construction of predictable sequence (M i ) i≤ n and deriving the distributional characterization of the debiased estimator b θ on depend on the context of the problem at hand. In Chapter 2 ‡ , we instantiate this idea in two concrete contexts: (i) time series analysis (Section 2.1) and (ii) batched data collection (Section 2.2). For both of these settings, 1. We first establish estimation results for the LASSO estimate, showing that even with adaptive data collection, the LASSO estimate enjoys good estimation error (Theorems 2.1.2 and 2.2.2). These results draw significantly on prior work in high-dimensional estimation [12, 26]. ‡ This chapter is based on Publication [43]: Yash Deshpande, Adel Javanmard, and Mohammad Mehrabi. Online debiasing for adaptively collected high-dimensional data with applications to time series analysis. JournaloftheAmericanStatisticalAssociation, 118(542):1126–1139, 2023. 12 2. Next, we propose constructions for the online debiasing sequence(M i ) i≤ n , using an optimization pro- gram that trades off variance with bias, while ensuring stability. This optimization program is a novel modification of the approximate inverse construction in [90]. The important change is the inclusion of anℓ 1 constraint in the program, which ensures stability of the underlying martingales, and allows the use of a martingale CLT theorem i n our analysis. 3. We establish a distributional characterization of the resulting online debiased estimate b θ on (Theorems 2.1.8 and 2.2.10). Informally, this demonstrates that coordinates of b θ on are approximately Gaussian with a covariance computable from data. In Section 2.3, we demonstrate how the online debiased estimate b θ on can be used to compute standard inferential primitives like confidence intervals and p-values. In Appendix A.4 we develop computation- ally efficient iterative descent methods to construct the online debiasing sequence (M i ) i≤ n . Due to space constraints, we deferred our numerical experiments to Appendix A.5, where we demonstrate the valid- ity the methodology on both synthetic and real data. In the interest of reproducibility, we made an R implementation of our algorithm publicly available at http://faculty.marshall.usc.edu/Adel-Javanmard/ OnlineDebiasing/. Proof of theorems and technical lemmas are given in Appendices A.6 to E.1. Our proposal of online debiasing approach builds on the insight in [44], which has studied a similar problem for low-dimensional settings (p < n). We provide a detailed discussion of this this work in Section 3.6, highlighting the main distinctions and the inefficacy of that method for high-dimensional setting to further motivate our work and contributions. 13 1.1.2 PearsonChi-squaredConditionalRandomizationTest Understanding the statistical relationship between random variables is a cornerstone of many scientific experiments. Various measures of dependency were developed in the statistics literature to capture the association between random variables, such as the mutual information and information theoretic coef- ficients [148], the kernel-based measures [143, 207], the correlation coefficients that are based on sample ranks [49, 41, 191], and the dependency metrics that are based on copulas [204, 161]; We refer to the survey by [96] for other dependency measures. Inferential tasks in data science and statistics often require a more thorough analysis of the associations between random variables. In particular, a desired analysis must control for the presence of confounding factors. This happens when (an often unmeasured) factorZ effects both of the variables of interest (say X andY ), and hence can lead to misleading conclusions about the association of the variables. For example, in genome-wide association studies (GWAS), researchers are interested in finding loci that are causal for the trait. However, spurious association can arise due to ancestry-induced correlations between causal and non-causal loci, or when ancestry is correlated with both the genotype and the trait [28, 22]. Conditional independence (CI) testing controls for the effect of such confounding factors. To further highlight the significance of the CI problem, it is worth noting that many important problems in statistics can indeed be cast as a CI testing problem, with examples ranging from the classic concepts of sufficient and ancillary statistics [40], to the well-known concepts in graphical models [99, 58, 46], and the causal discovery problems [140, 205, 141], where at the heart of all these settings, one can find a CI testing problem. In the recent work of [159], it is argued that the CI testing is provably a hard problem without assump- tions being placed on the distribution of variables. Concretely, [159] shows that no uniformly valid test § § a test that controls the type I error at a predetermined significance level α for all absolutely continuous (with respect to the Lebesgue measure) random variables(X,Z,Y) that are conditionally independent 14 can have nontrivial power (power exceedingα ) against any alternative hypothesis (a triple(X,Z,Y) that are not conditionally independent). By and large, this impossibility result can be perceived as a conse- quence of an interesting phenomenon that happens in the CI testing problem: while the space of the null distributions are separated from the alternatives, in fact the convex hull of the null space is a dense set in the alternative space with respect to the total variation metric [159]. The discouraging result of [159] highlights the crucial role of the assumptions on the distribution of (X,Z,Y) in the CI testing problem. This is a noteworthy observation that such assumptions may make the null space smaller, so the aforementioned no-free-lunch theorem can not be applied anymore. During the past few years, several methods have been developed for CI testing under different setups, such as [129] for one-dimensional variables satisfying certain smoothness assumptions, and [30] for discrete variables. Also there exists quite a large body of work on model-specific methods, where a parametric model is assumed between the response and the covariates (assumptions on the lawL(Y|X,Z)) [114, 37, 16]. There is also other concurrent work which goes beyond testing for the conditional independence and aims at measuring the strength of dependency when the CI hypothesis does not hold; e.g., [206, 4]. Another complementary line that has been pursued in the past few years is the model-X perspec- tive [29]. In this framework, contrary to the classic setup no assumption is made on the conditional law L(Y|X,Z), rather it shifts the focus on(X,Z) and requires an extensive knowledge on the lawL(X,Z). To emphasize the importance of the model-X setup, one should note that a set of CI tests that have been developed for a certain family of distributionL(Y|X,Z) leads to type I error inflation under model mis- specification. On the other hand, in many settings, you may have access to abundant unlabeled data which allows for good approximation ofL(X,Z). For example, in genetic studies [142, 35] the joint distribution of covariates can be well approximated. In particular, [192] proposed an estimator to approximate the 15 covariance matrix of covariates for the genome-wide association study (GWAS), in which genetic distance information is used. In Chapter 3, we will focus on CI testing in the model-X setup. In this setting, we would like to examine the independence of a covariateX ∈R and a response valueY ∈R, while controlling for the effect of a potentially high-dimensional confounding covariate vector Z ∈ R q . This is formalized via a hypothesis testing problem: H 0 :X ⊥ ⊥Y|Z, H A :X ̸⊥ ⊥Y|Z. (1.10) In the model-X CI testing problem, we are given access to the conditional law P X|Z along with n i.i.d. observations(X i ,Z i ,Y i ) as data, while the conditional lawsY|X,Z orY|Z are unknown. A large body of proposed CI tests in the model-X setup, such as the conditional randomization test (CRT) [29], and the holdout-randomization test (HRT) [175] are based on constructing counterfeit data sets using the law P X|Z , and scoring them by a certain score functionT . We proceed with providing a detailed description of CRT and the motivation behind our proposal, followed by a discussion of related work on the model-X CI tests. 1.1.2.1 BackgroundonConditionalRandomizationTests For a data set (X,Z,Y) consisting of n independent samples with X ∈ R n , Z ∈ R n× q an Y ∈ R n , the CRT constructsM counterfeits( e X 1 ,Z,Y),...,( e X M ,Z,Y) where e X j is sampled independently from the conditional lawP X|Z (·|Z). (By independence of samples, this means that the entries e X j,ℓ are drawn 16 independently from the lawX|Z, forℓ∈{1,2,...,n}.) Then, for score functionT define the normalized rank: p= 1+ M P j=1 I{T(X,Z,Y)≥ T( e X j ,Z,Y)} M +1 . (1.11) Given that e X|Z,Y∼L (X|Z), under the null hypothesis we have T(X,Z,Y) d =T( e X,Z,Y)|Z,Y, so the original and counterfeit scores are i.i.d and so exchangeable, and therefore the normalized rankp follows a uniform distribution, provided that the number of counterfeits is sufficiently large. The CRT perceives the extreme values of the normalized rank (close to 0 or 1) as evidence against the null hypothesis. Specifically, it rejects the null hypothesis for p smaller thanα (interpret it as ap-value) or rejects two extremeα/ 2 tails (two-sided CRT [190]). Note that working withM counterfeits, the smallest value the normalized rank can achieve is1/(M +1), and soM should be chosen sufficiently large to test at small significance level α . A salient feature of CRT is that it is a valid test (controlling type I error) for any choice of score function T . Of course, the specific choice of T would impact the power of the test. Indeed, [97] prove that the most powerful model-X conditional independence test against any given point alternative is a CRT, and this is obtained by takingT to be the corresponding likelihood score, which requires knowing the alternative distribution. There are some common choices for the score function, such as marginal covariance [196, 121] or the absolute value of the Lasso coefficient for X [196], which do not require to know the alternative distribution. 17 (a) One-sided inflation (b) Two-sided inflation (c) Middle-value inflation Figure 1.3: Schematic representations of the distribution of the normalized rankp under alternative hypothesis. The horizontal purple line indicates the uniform distribution. In (a), (b), the non-null p-value has higher tail compared to the null uniform distribution and the CRT (one-sided for (a) and two-sided for (b)) would have descent power. In (c), the non-null p-value is more likely to fall in the middle range which causes the CRT to have low power An observation about the CRT which also partly motivates our proposal is that under alternative, it only looks for the extreme values of the normalized rank to reject the null. Nonetheless, the deviation from the uniform distribution may occur at central values (See Figure 1.3 for a schematic illustration). In this work, we propose a novel conditional test, called the Pearson Chi-squared Conditional Randomization (PCR) test, which is built on the idea of CRT but aims to examine for deviation of the normalized rank from the uniform distribution across the entire range [0,1] and obtain high-resolution p-values with a very small number of conditional randomizations. In addition, instead of computing scores of the entire data set (or a counterfeit of it), our PCR test offers the flexibility of scoring subsets of data samples, using the i.i.d property of samples. 1.1.2.2 Relatedworkonmodel-XCItests The Conditional Randomization Test (CRT) was originally proposed by [29] as a generic framework that exploits the distributional informationX|Z to control the type I error for any score function, and number of data points (valid finite-sample results). This flexibility of the CRT allows for using any advanced black box predictive model, which plays a key role in achieving high statistical power for the CI testing problem. 18 In [190] the authors analyze the power of CRT in a high-dimensional linear regression setting for three different score functions: marginal covariance based scores, the ordinary least square coefficient and the LASSO [178]. Further, [97] shows that any valid MX CI test must also be valid conditionally onY,Z, and this conditioning allows to reduce the composite null to a point null. Also as a result of Neyman-Pearson lemma it is argued that the CRT based on the likelihood score is the optimal test against a point alternative. This work also considers MX(2) model under which only the first two moments of X|Z are known (as compared to the vanilla CRT which requires the full knowledge of the law ofX|Z), and proposes a MX(2) F-test building upon the generalized covariance measure statistics of [159]. In addition, this work derives the asymptotic power of the CRT against local semiparamteric alternatives of the formH 1 :L(Y|X,Z)= N(X T β +g(Z),σ 2 ). On the computational side, using advanced black box predictive models in the CRT can be prohibitively daunting, due to the repetitive fittings of the score function on the resampled data. This issue is even ex- acerbated in multiple testing, where the CRT is used for the feature selection problem. In this approach, the CRT is run for each covariate separately to test its relevance to the response, conditioned on the other covariates. Such multiple usage of the CRT is computationally prohibitive in high-dimensional problems. Alternatively, one can use the model-X knockoff approach proposed by [29] to circumvent this issue, which of course assumes the knowledge of the covariates joint distribution. Several recent works extended this procedure beyond the multivariate Gaussian distribution for a broader range of the covariates joint popu- lation, see [156] for hidden Markov models, and [13] which introduced the Metropolis knockoff sampling for cases where the covariates are continuous and follow a graphical model. Despite the fact that the model-X knockoff procedure has alleviated the CRT computational burden, this benefit often comes at the cost of a lower statistical power [29, Section 5.3]. For high-dimensional linear models, [190] shows that the CRT provably dominates model-X knockoffs in the variable selection problem. More precisely, they show 19 that under the high-dimensional linear setup, when the Benjamini–Hochberg (BH) procedure [19], or the adaptive p-value thresholding (AdaPT) procedure [110] is applied on the CRT p-values, a higher statistical power is achieved in comparison to the model-X framework. Several other methods have also been proposed recently to improve the heavy computational cost of CRT, such as the Holdout Randomization Test (HRT) [175] and the Conditional Randomization Test with Distillation (dCRT) [117]. In [21] the authors have proposed the Conditional Permutation Test (CPT) to enhance the robustness of CRT with respect to approximation errors in the law ofX|Z. In addition, to use CRT for variable selection with FDR control guarantee, a natural choice is to apply the (BH) procedure [19] on the p-values returned by the CRT. However, this can be challenging for problems with large number of predictors p, because at a significance level α , in order to make at least one rejection the number of randomizationsM should be large enough such that 1 M+1 ≤ α p . The reason is that the CRTp-values are inherently discrete and belong to the set{1/(M +1),2/(M +1),...,1}. For this end, [112] proposes sequential CRT that combines CRTp-values with Selective SeqStep+ procedure [7] to address the variable selection problem. Our method can be seen as an alternate approach, where we leverage the i.i.d property of data samples to construct high-resolutionp-values, using a small number of randomizations. . 1.1.2.3 Summaryofcontributionsandorganizations In Chapter 3 ¶ , we introduce a novel method called thePearson Chi-squaredConditionalRandomization (PCR) test, which is a general method for investigating the conditional independence of two variables X,Y in the presence of a high-dimensional confounding variable Z. The proposed method is designed ¶ This chapter is based on Publication [87]: Adel Javanmard and Mohammad Mehrabi. Pearson chi-squared conditional randomization test. arXiv preprint arXiv:2111.00027, 2021 20 for model-X framework where the conditional lawX|Z is assumed to be known, but the method assumes nothing whatsoever aboutY|(X,Z). Similar to the CRT, the PCR test uses randomization to construct multiple counterfeits of the data and rank the original data among the counterfeits according to a score function. The score function can be based on arbitrary (potentially complex) predictive models. Unlike the CRT whose score function takes in the entire dataset, the PCR test works with score function that applies to subgroups of data, where by changing the size of groups, it can go from sample level to the entire dataset level. The PCR test includes a Pearsonχ 2 -test at heart which allows to scrutinize the entire support of the ranks distribution to detect statistically significant deviations from the discrete uniform, while the CRT focuses on the deviations at the tails of the distribution. This allows to obtain high-resolutionp-values. As we show in numerical experiments, the PCR test achieves relatively high power even with a small number of counterfeits, while other model-X CI tests require substantially a large number of counterfeits to achieve high power. In Chapter 3, we present the following contributions: 1. Section 3.1: We present the PCR test statistic, and provide two rejection thresholds for it to control the size of the test under a target levelα . One threshold indicated byθ finite L,α is guaranteed to control the size even in finite-sample regime, while the other threshold θ asymp L,α controls the size for large enough sample size (asymptotic regime). Of course, the former turns out to be more conservative and in our numerical study we observe that forn of order a few hundreds, the size of test is already controlled using the thresholdθ asymp L,α . 2. Section 3.2: We provide a power analysis of the PCR test. Distance of alternative distributions to the set of null distributions is measured via a notion called ‘conditional relative density’, which depends 21 on both the joint lawL(X,Z,Y) as well as the score function. Our analysis reveals the role of different factors, such as sample size, number of counterfeits and number of labels which are the input parameters for the PCR test. 3. Section 3.3: As our power analysis reveals, the number of labels (L) used in the PCR test affects its power in a non-trivial way. Here, we suggest to run PCR test for different choices of L and then use Bonferroni’s correction to combine the resulting p-values into a valid p-value for the conditional independence hypothesis. 4. Section 3.4: While in the model-X framework it is assumed that the conditional law L(X|Z) is known, in practice one may need to estimate this distribution (e.g., from unlabeled data). In this section, we provide a more conservative version of the PCR test which is more robust to errors in estimatingL(X|Z), and avoids inflation in the type I error. 5. Section 3.5: We evaluate the performance of the PCR test on several simulated data and on the Capital Bikeshare data set. 6. Section 3.6: We discuss the potential advantages of our PCR test over other CRT-type procedures, both via a set of theoretical results as well as thorough numerical examples considering different alternative hypotheses and different choices of score function. 1.1.3 Goodness-of-fittestingforclassificaitonlearning In classification learning, one is given a set of training data {(x i ,y i )} i≤ n (with x i representing multi- dimensional feature vector and y i representing label variables), and aims to learn a model which can be 22 used to predict the labels on new feature vectors. Classification algorithms are backbone of machine learn- ing systems and undoubtedly one of the prominent statistical learning tools in data processing. There has been a plethora of classification methods proposed in the literature ranging from logistic regression and generalized linear models to more complex models such as boosting, random forests, and neural networks. In practice, the performance of these methods is often assessed in terms of accuracy on a test (hold-out) dataset, with the hope that it is a good indicator of the predictive performance of the model on unseen data points. Despite being used widely, the classification accuracy alone does not necessarily characterize the deviations of the learnt model from the underlying data generating process. Indeed relying solely on it as a measure of performance can be misleading due to model misspecification, and over/under fitting. This leads to the following fundamental question: (*) How well does a classifier learn the ground truth data generating law between the feature vector and the label? Developing a statistical test for the above question has a myriad of applications. It can flag the inherent and systematic flaws of a model, and its poor generalization to unseen population. Further, it provides a more holistic and honest assessment of the model performance, which is of paramount importance with the rise of reproducibility issues in modern data analysis. In particular, with the practice of data sharing, many datasets are used routinely as benchmark to compare different models. However, perpetual use of public datasets, without proper mechanism to access them to ensure validity of inferences, causes spurious discoveries and overfitting; learning models which performs well on benchmark datasets but generalize 23 poorly to unseen datasets [152, 93, 53, 52]. Another application of such test is for models built by com- mercial “machine learning as a service” providers such as Google and Amazon. They provide platforms where one can upload a dataset and a data classification task and pay to construct a model. Therefore, it is important to decide if the current model is sufficiently good (in a statistically sound sense) or to continue training process, which incurs additional cost. Among other applications, the proposed methodology can be used inK-fold framework to choose the optimal model with respect to its goodness-of-fit, rather than its empirical accuracy on the hold-out set. In statistics, question (*) is often formulated as goodness-of-fit test. However, most of the existing literature focus on specific parametric models, such as logistic regression, and do not apply to more com- plicated models such as neural network or random forest. An exception is the very recent seminal work of [203]. We refer to Section 1.1.3.1 for a detailed discussion. In Chapter 4 , we develop a goodness-of-fit test for a broad class of data generating rules (unknown) and black-box models, with only query access. We propose a novel methodology named GRASP (short for Goodness-of-fit with Randomization and Scoring Procedure) for this end, which controls type I error in finite sample settings, no matter the features, and does not make any parametric assumption (distribution- free). We also propose a (less conservative) variant test which comes with asymptotic validity. Both vari- ants show high power in identifying deviations of the classification procedure from the true conditional law of the labels. We also consider model-X settings where no knowledge of the conditional distribution labels (Y|X) is assumed, but we do assume the joint distribution of the features X is known, e.g., by having access to abundant unlabeled data. This setting has been studied in several recent work on vari- able selection and conditional testing; see [29, 8, 14, 87] for a non-exhaustive list. We propose model-X GRASP which leverages this information to obtain a better statistical power. 24 Our focus will be on the binary classification setup. We denote the feature vectors by x∈X ⊂ R d and binary labels byy∈{0,1}, and define the underlying conditional rule as η (x)=P(Y =+1|X =x). This simply reflects the likelihood of a feature vector x admitting the positive label. We have access toη (x) only through a set of queriesD consisting ofn samples drawn i.i.d. from a common lawP overX×{ 0,1}. We are provided with an estimate modelb η :X → [0,1], and we would like to assess the performance of this model. We assume that this model has been trained on a dataset disjoint fromD (D has no share in the training procedure.) The training mechanism is optional, e.g., can be a neural network or decision tree, among many others. As a first step towards answering (*), we impose the following hypothesis testing problem: H 0 :E[D f (Bern(η (X))∥Bern(b η (X))]≤ τ, (1.12) where expectation is with respect to the distribution ofX. Thef-divergenceD is a given metric to measure the distance betweenb η (·) andη (·) (See equation (4.1) for formal definition). Different divergence functions f lead to different distance measures. For instance, setting f(t) = 1 2 |t− 1| gives us the total variation distanceE[|b η (X)− η (X)|]. By lettingτ be zero, one can test for the perfect alignment of the test model b η and the ground truth ruleη . 1.1.3.1 Relatedworks Goodness-of-fit. Once a model is fitted to data, it is important to assess the quality of the fit. Several methods have been developed for testing goodness-of-fit of generalized linear models under the low- dimensional setting (d ≪ n) with a focus on logistic regression and multinomial models [76, 105, 180, 25 171, 115, 75, 134, 56]. For the high-dimensional setting, [158] proposed a framework for testing goodness- of-fit of high-dimensional linear models by using parametric bootstrap to calibrate the estimate model for scaled residuals. [82] proposed the generalized residual prediction (GRP) test for goodness-of-fit testing of high-dimensional generalized linear models. The aforementioned works focus on a class of parametric models, particularly for linear and generalized linear models. For a non-parametric setting, [203] proposed the binary adaptive goodness-of-fit test (BAGofT), where it tests for the perfect match of the test model and the ground-truth conditional probability, in an asymptotic regime. While the existing methods for testing goodness-of-fit mostly focus on contexts where the estimate model has asymptotic convergence to the ground truth law, we propose a method that is flexible for arbi- trarily complex classifiers independent of their predictive performance. In addition, it can be used under the high-dimensional setting (d≫ n) as well and still generates a high resolution p-value for moderately large number of samples. In addition, we consider the tolerance testing scheme which is more general than the perfect match testing (i.e. η =b η , a.s.), making it more useful in practice. Our proposed method also allows to consider a variety of metrics, including the average absolute distance, and the excess cross entropy, among many others. We would like to highlight a distinction between our problem and the conventional setup for testing goodness-of-fit. In our setup, the model estimate b η is learned on the training data and is subsequently evaluated on an independent test data. In hypothesis (4.2), the model estimateb η is fixed and the randomness stems solely from the test data. This differs from other goodness-of-fit setups, such as those described in [203, 82], where the statistical inference takes into account the variability of the training set. For example, the phrase ‘finite sample size’ in those work refers to the size of the training data, while in our setting it refers to the size of the test data. 26 Predictionerror. Estimating the prediction performance of a model is one of the core tasks in data-driven applications [72]. In particular, it can serve as a benchmark for model selection. Covariance penalty (CP) and cross validation (CV) are two of widely used methods to estimate the prediction performance of a regression model [2, 120, 54, 55]. However, It has been shown recently that such methods are not statisti- cally accurate in measuring the out-of-the sample performance of models. In fact, [153, 189] showed that covariance penalty and AIC statistic [2] methods reflect the in-sample prediction error. In addition, [15] revealed potential inaccuracies of cross validation, and argued that it indeed measures average prediction accuracy over many hypothetical datasets. In this work, we take another perspective on model’s predic- tion error. For a broad class of models (e.g., neural networks, decision trees, boosting algorithms, etc), and for the widely used negative likelihood loss (a.k.a. cross entropy), our framework allows to compare the model loss with the optimal oracle loss. Formally, for the estimate probabilistic modelb p Y|X (y|x), for specific choice of f-divergence function, we can test the null hypothesis testing: E (x,y)∼P [− logb p Y|X (y|x)]− min e p Y|X E (x,y)∼P [− loge p Y|X (y|x)] ≤ τ. Note that the minimum for the second term is achieved at ˜ p Y|X =p Y|X . Calibration. With the rise of deploying machine learning systems in real-life, the confidence of these systems in their predictions is of a great importance. Classification procedures often output a confidence valueb π ∈[0,1] along with their predicted valueb y, which is supposed to indicate the model certainty about b y being equal to y (correct prediction). Calibration of a model refers to aligning such certainty with its long-run accuracy. A rather surprising observation made recently is that many modern machine learning 27 methods are not well calibrated [69, 1, 182, 193, 100]. Expected calibration error (ECE) [127] is a well-known metric to measure the calibration of models, where it is formally defined as E b π [|P(b y =y|b π )− b π |]. In [106], authors proposed a method for testing for the perfect calibration of generic predictive mod- els in multi-class classification problems. The initial intuition behind model calibration is to test for the closeness of classification models to the ground truth law, but mathematically has a different formulation for the goodness-of-fit testing problem considered in the current work. In an extreme instance, a model is well calibrated ifE[η (X)|b η (X) = η ] = η (regression setting), but in (1.12) with e.g, the average absolute difference metric, a perfect fit ( τ =0) corresponds tob η (X)=η (X), almost surely. Hypothesis testing for nonparametric regression functions. In [126], the authors construct confi- dence sets for the regression function in nonparametric binary regression with an unknown design den- sity. It is assumed that the the regression function η (.) and the marginal probability density function of the features belong to a continuous class of Sobolev type spaces. Other than results on adaptive parameter estimation, this work provides a framework for testing the null hypothesis that the regression function is equal to 1/2 versus its alternative, while allowing the marginal density function to be a general function in a Sobolev type space. Furthermore, it is shown that the complexity of the null hypothesis does not affect the minimal rate of separation between the null and the alternative. More on this line of research, [111] considers the detection problem for a response functionf in a stochastic modeldX(t)=f(t)dt+εdW(t) withW(t) being the standard Wiener process. The detection problem corresponds to testing the null hy- pothesis f = 0 against its alternative. In addition, [79] considers simple hypothesis testing of the form f = f 0 in a multivariate setupy = f(x)+ε with isotropic Gaussian noise in[0,1] d . We refer to [78] for more details on nonparametric hypothesis testing on stochastic Gaussian models. 28 1.1.3.2 Summaryofcontributionsandorganization In Chapter 4 ∥ , we introduce a novel method for tolerance testing of average distance of classifiers to the underlying conditional law of labels. Our proposed method, called GRASP can be used for arbitrarily complex black-box models (with only query access), with no parametric assumption whatsoever, and no matter the features. GRASP can be used in high-dimensional setting where the features dimension can exceed the sample size. We propose two variants of GRASP , one controls the type I error in finite sample settings, and the other (less conservative) version controls the type I error in asymptotic setting. Through extensive numerical experiments, we show that GRASP achieves high statistical power. For model-X settings, where the distribution of features can be well approximated, we propose model- X GRASP which is built upon similar ideas as in the distribution-free version but harnesses the knowledge of features distribution to improve statistical power. Here is an outline of Chapter 4: • Section 4.1: We first provide a brief review on f-divergence functions, and then formulate the goodness-of-fit test as a tolerance hypothesis testing problem. We end this section by a short re- view on some convex analysis definitions. • Section 4.2: We start by focusing on the distribution-free regime, and propose high-level intuitions behind the GRASP algorithm. We next move to formally introduce GRASP procedure along with its test statistics and decision rules. Next we prove that the type I error of GRASP (size of the test) can be controlled in finite sample settings, for arbitrary data generative rules, and general classifiers. We also introduce a less conservative version of GRASP that has asymptotic control on the type I error. ∥ This chapter is based on Publication [88]: Adel Javanmard and Mohammad Mehrabi. GRASP: A goodness-of-fit test for classification learning. arXiv preprint arXiv:2209.02064, 2022. 29 The GRASP test uses a score functionT in forming the test statistics. It is worth noting that the size of the test is controlled for arbitrary score functions. The choice of score function though impacts the power of the test and we will discuss some choices in Section 4.3.3. We conclude Section 4.2 by characterizing one-sided confidence intervals and p-values for hypothesis (1.12). • Section 4.3: We move to the model-X setting, where abundant unlabeled data points are available. We propose model-X GRASP that uses this data to learn the features distribution and utilize it for a statistically more powerful procedure. Similar to the distribution-free setting, we show that the size of the test is controlled under the pre-determined level, for finite number of data points. Further, a less conservative decision rule is introduced which has asymptotic control on the type I error. In Section 4.3.3 we discuss the role of score function T on the power. We derive the form of the optimal score function, which depends on the data generative lawη (x). We discuss two approaches: (i)model-agnostic which replacesη (x) by1/2 (random guessing) in the optimal score function;(ii) GAN-based approach which uses generative adversarial networks (GANs) to estimate the required densities to use in the optimal score function. • Section 4.4: We provide extensive numerical experiments to evaluate the performance of the distribution- free and model-X GRASP (type I control), the power of these tests and the advantage of model-X framework in achieving a higher power, as well as the impact of the score functionT on the power. 1.2 AdversarialRobustness Modern machine learning algorithms, and in particular deep neural networks, have demonstrated break- through empirical performance, and have been deployed in a multitude of applications domains ranging 30 from visual object classification to speech recognition, robotics, natural language processing and health- care. The common practice to train these models is by empirical loss minimization on the training data. Nonetheless, it has been observed that the resulting models are surprisingly vulnerable to minute discrep- ancies between the test and the training data distributions. There are several well documented examples of such behavior in computer vision and image processing where small imperceptible manipulations of im- ages can significantly compromise the performance of the state-of-the-art classifiers [173, 23]. Other exam- ples include well-designed malicious content like malware which can be labeled legitimate by the classifier and harm the system [33, 137], or adversarial attacks on speech recognition systems, such as GoogleNow or Siri, which consists in voice commands that are incomprehensible or even completely inaudible to hu- man and can still control the systems [31, 183, 201]. It is evident that in practice such vulnerability can have catastrophic consequences. By studying adversarial samples, one can in turn improve the robustness of machine learning algo- rithms against adversarial attacks. In the past few years, there has been a significant research on generating various adversarial samples [32, 3, 66, 138] and defenses [118, 34, 139]. Among the considerable effort to im- prove the adversarial robustness of algorithms, adversarial training is one of the most effective techniques. Adversarial training is often formulated as a minimax optimization problem, where the inner maximization aims to find an adversarial example that maximizes a predictive loss function, while the outer minimization aims to train a robust estimator using the generated adversarial examples [67, 101, 118, 145, 195]. 31 1.2.1 Tradeoffsindistributionalladversarialsetup While adversarial training techniques have been successful in improving the adversarial robustness of the models, their full effect on machine learning systems is not well understood. In particular, some stud- ies [118] observed that the robustness virtue of adversarial training comes at the cost of worsening the performance on natural unperturbed inputs, i.e, increasing generalization error. However, [181] observes empirically that when there are very few training data, adversarial training can help with reducing the generalization error. Complicating matters further, [146] argues that additional unlabeled data can miti- gate the tension between adversarial risk (predictive performance against adversarial perturbations) and the standard risk (predictive performance when there is no adversary, a.k.a generalization error). These observations raise the following important question regarding adversarial training: Istherea‘fundamental’tradeoffbetweenadversarialriskandstandardrisk? Ordothereexistmodelsthat areoptimalwithrespecttobothofthesemeasures? Whataretherolesofdifferentfactors,suchasadversary’s power, problem dimension and the complexity of the model class (e.g., number of neurons) in the interplay between standard risk and adversarial risk? Here, by ‘fundamental tradeoff’ we mean a tradeoff that holds given unlimited computational power and infinite training data to train a model. In this work, we answer these questions for adversarial distri- bution shifts, where the adversary can shift the test data distribution, making it different from the training data distribution. The test data distribution can be an arbitrary but fixed distribution in a neighborhood of the training data distribution and the radius of this neighborhood is a measure of adversary’s power. 32 1.2.1.1 Contributions We next summarize our contributions in Chapter 5 ∗∗ : • We characterize the fundamental tradeoff between standard risk and adversarial risk for distribution- ally adversarial training for the settings of linear regression and binary classification (under a Gaussian mixtures model). We focus on infinite data limit (n→∞) with finite feature dimension (d) and hence our analysis is at population level. The fundamental tradeoff is characterized by studying the Pareto optimal fronts for the achievability region in the two dimensional standard risk-adversarial risk region. The Pareto optimal front consists in the set of estimators for which one cannot decrease both standard and adversarial risk by deviating from these estimators. Similar tradeoffs have been derived for linear re- gression setting with norm bounded adversarial perturbation and isotropic Gaussian features [95]. Here we focus on distribution perturbing adversaries and consider general anisotropic Gaussian features. • For the binary classification we consider a Gaussian mixtures model with general feature covariance and a distribution perturbing adversary, where the perturbation is measured in terms of the Wasserstein met- ric with generalℓ r norm. (We refer to Sections 5.0.2 and 5.1.2 for further details and formal definitions). Our analysis shows how the fundamental tradeoff between standard and adversarial risk is impacted by a variety of factors, such as adversary’s power, feature dimension, features correlation and the choice of ℓ r perturbation norm. An interesting observation is that for r = 2 the tradeoff between standard and adversarial risk vanishes. In other words, there exists a model which achieve both the optimal standard risk and the optimal adversarial risk. ∗∗ This chapter is based on Publication [122]: Mohammad Mehrabi, Adel Javanmard, Ryan A.Rossi, Anup Rao, and Tung Mai. Fundamental tradeoffs in distributionally adversarial training. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 7544–7554. PMLR, 2021. 33 • We also study the Pareto optimal tradeoffs between the standard and adversarial risks for the problem of learning anunknownfunction over thed-dimensional sphere using random features model. This can be represented as linear models withN random nonlinear features of the formσ (w T a x),1≤ a≤ N, with σ (·) a nonlinear activation. Equivalently this can be characterized as fitting a two-layer neural network with random first-layer. Building upon approximation formula for adversarial risk, we study the effect of network widthN on the tradeoff between standard and adversarial risks. 1.2.1.2 Furtherrelatedwork Very recent work [94, 174] have focused on binary classification, under Gaussian mixtures model and proposed a precise characterization of the standard and adversarial risk achieved by a specific class of adversarial training approach [181, 119]. These work consider an asymptotic regime where the sample size grows in proportion to the problem dimensiond and focus on norm bounded adversarial perturbation. In comparison, we consider a fixed d, infinite n setting and consider distribution perturbing adversary. Also we focus on fundamental tradeoffs achieved by any linear classifier, while [94, 174] work with a specific class of saddle point estimator. The other work [47] also considers norm bounded adversarial perturbation for the classification problem and studies the optimal ℓ 2 andℓ ∞ robust linear classifiers assuming access to the class averages. Furthermore, it also studies the tradeoff between standard and robust accuracies from a Bayesian perspective by contrasting this optimal robust classifier with the Bayes optimal classifier in a non-adversarial setting. 34 1.2.2 Adversarialrobustnessforlatentmodels As discussed in the previous section, in response to the observed fragility of machine learning systems, a growing body of work in the past few years has sought to improve the robustness of machine learning systems against adversarial attacks. Despite remarkable progress in designing robust training algorithms and certifiable defenses, it is often observed that these methods compromise the statistical accuracy on unperturbed test data (i.e., test data drawn form the same distribution as training data). Such observation had led prior work to speculate a tradeoff between the two fundamental notions of robustness andgeneral- ization (for a non-exhaustive list see e.g, [118, 146, 124, 122]). For example, the highest obtainedℓ ∞ -robust accuracy on CIFAR10 (without using additional data) withε ∞ =8/255 is 60%, with standard accuracy of 85% (which is 10% less than state-of-the-art standard accuracy forε ∞ =0). Some of the promising adversarial training methods, such as TRADES [202] acknowledge such tradeoff by including a regularization parameter which allows to tune between these two measures of performance. There has been also recent line of work [94, 95] which provides precise asymptotic theory for this tradeoff and how it is quantitively shaped by different components of the learning problem (e.g, adversary’s power, geometry of perturbations set, overparamterization, noise level in training data, etc.) For the setting of lin- ear regression and binary classification it is proved that there is an inherent tradeoff between robustness and standard accuracy (generalization) which holds at population level and for any (potentially computa- tionally intensive) training algorithms [95, 47, 122]. Nonetheless, these work make strong assumptions on the distribution of data (e.g, Gaussian or Gaussian mixture models), which fail to capture various natural structures in data. This stimulates the following tantalizing question: 35 (*) Are there natural data generative models under which the tradeoff between robustness and thestandardaccuracy(generalization)vanishes,inthesensethatonecanfindmodelswhichare performing well (or even optimal) with respect to both measures? As a step toward answering this question, [199] show that when data is well separated, there is no inherent conflict between standard accuracy and robustness. It also provides numerical experiments on a few image datasets to argue that these data are indeedr-well separated for some valuer larger than the perturbation radii used in adversarial attacks (i.e., data from different classes are at least r distance apart in the pixel domain.) In [197], adversarially robust estimators are studied for the setup of linear regression and a lower bound on their statistical minimax rate is derived. The minimax rate lower bound for sparse model is much smaller than the one for dense model, whereby [197] argues the importance of incorporating sparsity structure in improving robustness. The current work takes another perspective towards question (*) by considering the low-dimensional manifold structures in data. Many high-dimensional real-world data sets enjoy low-dimensional structures, and learning low-dimensional representations of raw data is a common task in information processing. In fact, the entire field of dimensionality reduction and manifold learning has been developed around this task. To give concrete examples, the MNIST database of handwritten digits consists of images of size28× 28 (i.e., ambient dimension of 784), while its intrinsic (manifold) dimension is estimated to be≈ 14, based on local neighborhoods of data. Likewise, the CIFAR10 database consists of color images of size32× 32 (i.e. ambient dimension of3,072), but its intrinsic dimension is estimated to be≈ 35 [36, 154, 166]. The high-level message of the current work is that the low-dimensional structures in data can mitigate the tradeoff between standard accuracy and robustness, and potentially enable training models that perform gracefully (or even optimal) with respect to both measures. 36 1.2.2.1 Summaryofcontributions In Chapter 6 †† , we focus on two widely used models for binary classification, namely Gaussian-mixture model and the generalized linear model, where we also assume that the feature vectors lie on ak-dimensional manifold in ad-dimensional space (k <d). We consider adversarial setting with norm-bounded perturba- tion (inℓ p norm), for generalp≥ 2. We use the minimum nonzero singular value of the ‘lifting matrix’W ∈R d× k (between the manifold and the ambient space) as a measure of low-dimensional structure of data; cf. (6.5) and (6.6). We assess the generalization property of a model through the notion of standard risk, and its robustness against adversarial perturbation through the notion of adversarial risk (see Chapter 6 for the formal definition) Our main contributions are summarized as follows: • Under both data generative models, we derive the Bayes-optimal estimators, which provably at- tain the minimum standard risk. We prove that as long as σ min (W) diverges as d → ∞, with a growth rate that depends on the adversary’s power ε p and the perturbation norm ℓ p , then the Bayes-optimal estimator asymptotically achieves the minimum adversarial risk as well. This implies that the tradeoff between robustness and generalization asymptotically disappears as data becomes more structured. • While the gap between the optimal standard risk and optimal adversarial risk shrinks for data with low-dimensional structure, we show that these two risk measures (as functions of estimators) stay away from each other. Specifically, we come up with an estimator for which the standard and ad- versarial risks remain away from each other by a constantc>0 independent ofk,d. †† This chapter is based on Publication [86]: Adel Javanmard and Mohammad Mehrabi. Adversarial robustness for latent models: Revisiting the robust-standard accuracies tradeoff. arXiv preprint arXiv:2110.11950, 2021. 37 • In Section 6.2.1, we consider an adversarial training method based on robust empirical loss mini- mization. While this algorithm is structure-agnostic we provably show that it results in models that are robust and also generalize well, under low-dimensional data structure. Note that the data struc- tures (distribution), even if not deployed by the training procedure, still comes into picture as the adversarial risk and standard risk are defined with respect to this data distribution. • We corroborate our theoretical findings with several synthetic simulations. We also train Mixture of Factor Analyzers (MFA) models on the MNIST image dataset. This results in low-rank models from which we can generate new images. Furthermore, the Bayes-optimal classifier can be precisely computed for the MFA model. We show empirically that as the ratio of ambient dimension to the rank diverges (data becomes more structured) the gap between standard risk and adversarial risk vanishes for the Bayes-optimal classifier. In other words, Bayes-optimal estimator becomes optimal with respect to both risks. 1.2.2.2 Relatedwork There is a growing body of work on the tradeoff between robustness and generalization (see e.g., [181, 118, 202, 146, 199, 124, 122]). In particular, [47] consider the isotropic Gaussian-mixture model with two and three classes, and derive Bayes-optimal robust classifiers for ℓ 2 and ℓ ∞ adversaries. This work proves a tradeoff between the standard and the robust risks, which grows when the classes are imbalanced. The prior work [81, 165, 170] proposed the concept of on-manifold attack, where the adversarial per- turbations are done in the latent low-dimensional space. In [170], it is argued that on-manifold adver- sarial examples are acting as generalization error and adversarial training against such attacks improve the generalization of the model as well. In addition, a so-called on-manifold adversarial training (based 38 on minimax formulation) has been proposed which is similar to the adversarial training method of [118] but tailored to perturbations in the manifold space. The subsequent work [116] proposes dual manifold adversarial training (DMAT) method which considers adversarial perturbations in both the manifold and the image space to robustify models against a broader class of adversarial attacks. In this terminology, in our current work we consider out-of-manifold perturbations (in the ambient space). Also let us emphasize that [170, 116] are based on empirical studies on image databases and more on an algorithmic front. The current work contributes to this literature by developing a theory for the role of manifold structure of data in the interplay between robustness and generalization, under specific binary classification setups (viz. Gaussian-mixture model and generalized linear model) 39 PartI ReliableInference 40 Chapter2 OnlineDebiasing 2.1 Onlinedebiasingforhigh-dimensionaltimeseries Vector autoregressive (VAR) model is among the most popular models in time series analysis. These models are extensively used across science and engineering (see [59, 169, 74, 157] for notable examples in macroe- conomics, genomics and neuroscience). Given the data z 1 ,...,z T , the fundamental task is to estimate the parameters of theVAR model, viz. the matricesA (1) ,...,A (d) in Equation (1.5). The estimates of the parameters can be used in a variety of ways depending on the context: to detect or test for stationarity, forecast future data, or suggest causal links. Since each matrix isp× p, this forms a putative total ofdp 2 parameters, which we estimate from a total of(T − d)p linear equations (Eq.(1.5) witht=d+1,...,T ). For thei θ coordinate ofz t , Eq.(1.5) reads z t,i = d X ℓ=1 ⟨z t− ℓ ,A (ℓ) i ⟩+ζ t,i , (2.1) 41 where A (ℓ) i denotes the i θ row of the matrix A (ℓ) . This can be interpreted in the linear regression form, Eq.(1.1), in dimensiondp withθ 0 ∈R dp ,X ∈R (T− d)× dp ,y,ε∈R T− d identified as: θ 0 =(A (1) i ,A (2) i ,...,A (d) i ) T , X = z T d z T d− 1 ... z T 1 z T d+1 z T d ... z T 2 . . . . . . . . . . . . z T T− 1 z T T− 2 ... z T T− d , y =(z d+1,i ,z d+2,i ,...,z T,i ), ε=(ζ d+1,i ,ζ d+2,i ,...,ζ T,i ). (2.2) We fix i and in order to lighten the notation, we hereafter remove the dependence oni. We letσ 2 ≡ Σ ζ t,i the variance of entries ofε, and denote the rows ofX byx 1 ,...,x n ∈R dp , withn=T− d. Given sufficient data, or whenT is large in comparison withdp, it is possible to estimate the parameters using least squares [163, 103]. In [12], Basu and Michailidis consider the problem of estimating the parameters when number of time pointsT is small in comparison with the total number of parametersdp, with the proviso that the matricesA (ℓ) are sparse. Their estimation results build on similar ideas as [26, Theorem 6.1], relying on proving a restricted eigenvalue property for the designX T X/n. This result hinges on stationary properties of the model (1.5), which we summarize prior to stating the estimation result. We recall the following definition from [12]. Definition2.1.1 (Stability and invertibility ofVAR(d) Process [12]). AVAR(d) process with an associated reverse characteristic polynomial A(γ )=I− d X ℓ=1 A (ℓ) γ ℓ , (2.3) 42 is called stable and invertible if det(A(γ )) ̸= 0 for all γ ∈ C with|γ | = 1. Based on this characteristic polynomial, we also define the following spectral parameters: µ min (A)= min |γ |=1 λ min (A ∗ (γ )A(γ )) µ max (A)=max |γ |=1 λ max (A ∗ (γ )A(γ )) Theorem2.1.2 (Estimation Bound). Recalltherelationy =Xθ 0 +ε,whereX,y,θ 0 aregivenby (2.2)and let b θ L be the Lasso estimator b θ L =argmin θ ∈R dp n 1 2n ∥y− Xθ ∥ 2 2 +λ n ∥θ ∥ 1 o . (2.4) Assume that|supp(θ 0 )|≤ s 0 , and define ω = dλ max (Σ ζ ) λ min (Σ ζ ) · µ max (A) µ min (A) , α = λ min (Σ ζ ) µ max (A) . ThereexistsauniversalconstantC >0,suchthatforanyn≥ Cαω 2 s 0 log(dp)andλ n =λ 0 p log(dp)/n, withλ 0 ≥ 4λ max (Σ ζ )(1∨µ max (A))/µ min (A)thefollowinghappens. Withprobabilityatleast1− (dp) − 6 , the estimate satisfies: ∥ b θ L − θ 0 ∥ 1 ≤ C λ 0 α r s 2 0 log(dp) n . In short, given the standardized setting whereλ 0 ,α are order one, theℓ 1 estimation error rate is of order s 0 p log(dp)/n, which is the same obtained in data without temporal dependence. Our proof is similar to that of Basu and Michailidis [12], and relies on establishing a now-standard restricted eigenvalue property 43 for the design X T X/n. The spectral characteristics of the time series quantified in Definition 2.1.1 play an important part in establishing this. We refer the reader to Appendix A.6 for the proof, as well as a discussion of the differences with the proof of [12]. 2.1.1 Constructingtheonlinedebiasedestimator Our task now is to construct a predictable sequence of debiasing matrices{M i } i≤ n . One simple approach is the ‘sample-splitting’ approach: construct a generalized inverse M based on the first n/2 data points using, for example, the program of [90] and let the sequence{M i } i≤ n be defined by M i = 0, ifi≤ n/2 andM i = M, ifn/2 < i≤ n. It is easy to see that this is a valid predictable sequence. However, due to sample-splitting, it does not make an efficient use of the data and loses power. More importantly, it is not clear that the underlying martingale (the noise component of the debiased estimator √ nW n = P i M i x i ε i ) will be stable in the sense of Definition 1.1.2. Our proposal generalizes sample-splitting via an episodic structure and, importantly, regularizes to ensure stability. We partition the time indices[n] intoK episodesE 0 ,...,E K− 1 , withE ℓ of lengthr ℓ , so that P K− 1 ℓ=0 r ℓ = n. Over an episodeℓ, we keep the debiasing matrixM i =M (ℓ) to be fixed over time points in the episode. Moreover,M (ℓ) is constructed using all the time points inprevious episodesE 0 ,...,E ℓ− 1 in the following way. Let n ℓ = r 0 +...+r ℓ− 1 , for ℓ = 1,...,K; hence, n K = n. Define the sample covariance of the features in the first ℓ episodes. b Σ (ℓ) = 1 n ℓ X t∈E 0 ∪...∪E ℓ− 1 x t x T t , The matrixM (ℓ) has rows(m ℓ a ) a∈[dp] as the solution of the optimization: 44 … … " ($%&) Episode (−2 (length + $%, ) Episode (−1 (length + $%& ) Episode ( (length + $ ) Episode (+1 (length + $/& ) Debias Debias Debias " ($) " ($/&) Figure 2.1: Schematic for constructing the debiasing matricesM (ℓ) . We divide time intoK episodesE 0 ,...,E K− 1 ; in episodeℓ,M i is held constant atM (ℓ) , which is a function ofx t in all prior episodes. minimize m T b Σ (ℓ) m subject to ∥ b Σ (ℓ) m− e a ∥ ∞ ≤ µ ℓ , ∥m∥ 1 ≤ L, (2.5) for appropriate values ofµ ℓ ,L > 0. We then construct the online debiased estimator for coordinatea of θ 0 as follows: b θ on = b θ L + 1 n K− 1 X ℓ=1 X t∈E ℓ M (ℓ) x t (y t −⟨ x t , b θ L ⟩). (2.6) Figure 2.1 shows a schematic illustration on this construction. As we discuss later, the tuning parameters µ ℓ andL control the bias of b θ on (See bias-variance decomposition of b θ on given by (2.11) and Lemma A.6.5 for rigorous claim.) The objective functionm T b Σ (ℓ) m also directly impacts the variance of b θ on and hence controls the length of confidence intervals, or the statistical power when testing the null hypothesis H 0,a : θ 0,a =0. In other words, the tuning parametersµ ℓ control the interplay between bias and variance of b θ on . We would like to chooseµ ℓ small enough to reduce the bias, but large enough so that the optimization (2.5) 45 is still feasible. The following lemma shows that, with high probability, µ ℓ of order ω p log(dp)/n ℓ is sufficient to make the optimization feasible. Recall that ω is defined in Theorem 2.1.2. Lemma2.1.3. LetΩ=Σ − 1 =(E{x t x T t }) − 1 betheprecisionmatrixofthetimeseries. Thereexistsuniversal constants C,C ′ such that the following happens. Suppose that n ℓ ≥ Cω 2 log(dp). Then with probability 1− (dp) − 6 : max i,j |Ω b Σ (ℓ) − I(i=j)|≤ C ′ ω s log(dp) n ℓ . The proof of Lemma 2.1.3 is given in Appendix A.6.3. In simulations and in our developed software package ∗ , we first solve a linear optimization to find the smallestµ ℓ that makes the optimization (2.5) feasible, and then constructM (ℓ) using that value ofµ ℓ . Choiceofepisodelength. In Section 2.1.2, we show that the constructed online debiased estimator b θ on is asymptotically unbiased and admits a normal distribution. To do that we provide a high probability bound on the bias of b θ on (See Lemma A.6.5). This bound is in terms of the batch sizesr ℓ , from which we propose the following guideline for choosing them to achieve the optimal order of the bias:r 0 = √ n andr ℓ =β ℓ , for a constantβ > 1, andℓ≥ 1. Here, we would like to explain the role of batch sizes qualitatively. Note that the online debiasing does not debias the samples in the first episode. On the one side, we want to keepr 0 small enough to control the contribution of this episode to the bias of b θ on . On the other hand, we want to have a larger 0 so that the concentration of b Σ (ℓ) kicks in faster, which yields a smallerµ ℓ for next episodes (See Lemma 2.1.3) and therefore a better control on the bias in the optimization (2.5). For other values ofr ℓ as the indexℓ progresses, the debiasing matrixM (ℓ) becomes more effective as µ ℓ gets smaller and hence we would like to use it over a longer episode. As we show in Theorem 2.1.4, for any constant ∗ http://faculty.marshall.usc.edu/Adel-Javanmard/OnlineDebiasing/ 46 β > 1, the online debiased estimator b θ on is asymptotically unbiased assumings 0 log(dp)/ √ n=o(1). For finite sample size, however, different choices of β may have different effect in removing the bias. In our software package, we have left it as a user-defined input parameter. We refer to Appendix A.3 for further discussion on robustness of the inference outputs to the choice ofβ Before proceeding into the distributional characterization of the online debiased estimator forθ 0 (en- tries of coefficient matrices A (ℓ) ), we revisit the numerical example from Section 1.1.1.1 in which the (of- fline) debiased estimator of [90] does not display an unbiased normal distribution. However, as we will observe the constructed online debiased estimator empirically admits an unbiased normal distribution. RevisitingthenumericalexamplefromSection1.1.1.1 In Section 1.1.1.1, we considered aVAR(d) model withp = 15, d = 5, T = 60, and diagonalA (ℓ) matrices with valueb = 0.15 on their diagonals. The covariance matrixΣ ζ of the noise termsζ t is chosen asΣ ζ (i,j)=ρ I(i̸=j) withρ =0.5 andi,j∈[p]. The population covariance matrix of vectorx t = (z T t+d− 1 ,...,z T t ) T is adp bydp matrixΣ consisting of d 2 blocks of size p× p with Γ z (r− s) as block (r,s). The analytical formula to compute Γ z (ℓ) is given by [12]: Γ z (ℓ)= 1 2π π Z − π A − 1 (e − jθ )Σ ζ (A − 1 (e − jθ )) ∗ e jℓθ dθ, whereA(γ ) is given in equation (2.3). Figure 2.2 shows the heat maps of magnitudes of the elements ofΣ and the precision matrixΩ=Σ − 1 for the on handVAR(5) process. As evident from Figure 1.2, the noise component of offline debiased estimator is biased. Here, we look into the noise component of the online debiased estimator given by W on = 1 √ n K− 1 X ℓ=1 M (ℓ) X t∈E ℓ x t ε t , (2.7) 47 0.4 0.6 0.8 1.0 1.2 1.4 (a) Heat map ofΣ 0.5 1.0 1.5 (b) Heat map ofΩ Figure 2.2: Heat maps of magnitudes of elements of covariance matrixΣ ≡ E(x i x T i ) (left plot), and precision matrix Ω = Σ − 1 (right plot). In this example. x i ’s are generated from a VAR(d) model with covariance matrix of noise Σ ζ (i,j) = ρ |i− j| with values d = 5, p = 15, T = 60, ρ = 0.5, and diagonal A (i) matrices with b = 0.15 on diagonals. with M (ℓ) constructed from the solutions to optimization (2.5) for ℓ = 1,...,K − 1. Also, recall that ε=(ζ d+1,i ,ζ d+2,i ...,ζ T,i ) by equation (2.2). In Figure 2.3, we show the QQ-plot, PP-plot and histogram of W on 1 and W off 1 (corresponding to the entry (1,1) of matrix A 1 ) for 1000 different realizations of the noise ζ t . As we observe, even the noise componentW off is biased because the offline construction of M depends on all featuresx t and hence on endogenous noise ζ t . However, the online construction of decorrelating matrices M (ℓ) , makes the noise term a martingale and henceW on converges in distribution to a zero mean normal vector, allowing for a distributional characterization of the online debiased estimator. 48 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Normal offline online Sample Theoretical (a) QQ plot 0 1 0 1 Normal offline online Sample Theoretical (b) PP plot 0.0 0.1 0.2 0.3 0.4 0.5 −4 −2 0 2 4 Histograms Offline Online Lines offline mean online mean Curves Normal Noise Terms Density (c) Histogram Figure 2.3: Plots 2.3a, 2.3b, and 2.3c show the QQ plots, PP plots, and the histogram of online debiased noise terms (blue) and offline debiased noise terms (red) over 1000 independent experiments, respectively and black curve/lines denote the ideal standard normal distribution. The solid and dash vertical lines in plot (c) indicate the location of the mean of offline and online debiased noise terms, respectively. 2.1.2 Distributionalcharacterizationofonlinedebiasing We start our analysis of the online debiased estimator b θ on by considering a bias-variance decomposition. Usingy t =⟨x t ,θ 0 ⟩+ε t in the definition (2.6): b θ on − θ 0 = b θ L − θ 0 + 1 n K− 1 X ℓ=1 X t∈E ℓ M (ℓ) x t x T t (θ 0 − b θ L )+ 1 n K− 1 X ℓ=1 X t∈E ℓ M (ℓ) x t ε t = I− 1 n K− 1 X ℓ=1 X t∈E ℓ M (ℓ) x t x T t ( b θ L − θ 0 )+ 1 n K− 1 X ℓ=1 X t∈E ℓ M (ℓ) x t ε t . (2.8) With the shorthandR (ℓ) = (1/r ℓ ) P t∈E ℓ x t x T t for the sample covariance of features in episodeℓ and the biasB n and variance termW n below B n ≡ √ n I− 1 n K− 1 X ℓ=1 r ℓ M (ℓ) R (ℓ) , (2.9) W n ≡ 1 √ n K− 1 X ℓ=1 M (ℓ) X t∈E ℓ x t ε t , (2.10) 49 we arrive at the following decomposition b θ on =θ 0 + 1 √ n B n ( b θ L − θ 0 )+W n . (2.11) Our first set of results concern the bias of b θ on , establishing that it is of smaller order compared to the bias of the LASSO estimate. The following theorem uses Lemma 2.1.3 to control the bias of the online debiased estimator. Theorem 2.1.4. (Bias control) Consider the VAR(d) model (1.5) and let b θ on be the debiased estimator (2.6) where the decorrelating matrices M (ℓ) are computed according to Eq.(2.5), with µ ℓ = c 1 ω p (log(dp)/n ℓ and L ≥ ∥ Ω ∥ 1 . Further assume that the base estimator is b θ L computed with λ = λ 0 p log(dp)/n where λ 0 ≥ 4λ max (Σ ζ )(1∨µ max (A))/µ min (A). Then, under the sample size conditionn≥ Cω 2 s 0 log(dp), we have √ n( b θ on − θ 0 )=W n +∆ n , (2.12) whereE{W n }=0 and P n ∥∆ n ∥ ∞ ≥ C 1 λ 0 (ω+Lγ ) α s 0 log(dp) √ n o ≤ (dp) − 4 , (2.13) The parameters ω,α are defined in Theorem 2.1.2, and γ = dλ max (Σ ζ )/µ min (A). Further, the bias satisfies ∥E{ b θ on − θ 0 }∥ ∞ ≤ C 1 λ 0 (ω+Lγ ) α s 0 log(dp) n + C 2 ∥θ 0 ∥ 1 (dp) 6 50 We refer to Appendix A.6.4 for the proof of Theorem 2.1.4. Note that the above theorem bounds the bias term∆ n for finite sample size n. To study these bounds in an asymptotic regime, we make the following assumption to simplify our presentation. Assumption2.1.5. Suppose that 1. The parameters λ min (Σ ζ ), λ max (Σ ζ ), µ min (A) and µ max (A) are bounded away from 0 and∞, as n,p→∞. 2. With Ω = Σ − 1 = (E{x t x T t }) − 1 the precision matrix of the data points{x t }, and s 0 the sparsity of θ 0 =(A (1) i ,...,A (d) i ) T , we assume that∥Ω ∥ 1 =o( √ n/log(dp)). Under Assumption 2.1.5 the spectral quantities ω,γ,α and therefore λ 0 are order one. We can also ignore the lower order term∥θ 0 ∥ 1 /(dp) 6 in the high-dimensional regime. Indeed, the denominator(dp) 6 can be changed to(dp) c for arbitrary largec>0, by adjusting constantC 1 and the tail bound in Eq.(2.13). Therefore, as far as∥θ 0 ∥ 1 grows polynomially atp, then this term vanishes asymptotically. The theorem, hence, shows that the bias of the online debiased estimator is of orderLs 0 (logp)/n. On the other hand, re- call the filtration F t generated by{ε 1 ,...,ε t } and rewrite (2.10) asW n = P t v t ε t , wherev t =M (ℓ) x t / √ n (Samplet belongs to episodeℓ). We use Assumption 2.1.5 in Lemma 2.1.6 below, to show that for each co- ordinatea∈[dp], the conditional variance P n t=1 E(ε 2 t v 2 t,a |F t− 1 )=(σ 2 /n) P n t=1 ⟨m ℓ a ,z t ⟩ 2 is of order one. Hence∥∆ n ∥ ∞ is asymptotically dominated by the noise variance whens 0 =o √ n Llog(dp) . Another virtue of Lemma 2.1.6 is that it shows the martingale sumW n is stable in an appropriate sense. This is a key technical step that allows us to characterize the distribution of the noise termW n by applying the martingale CLT (e.g., see [71, Corollary 3.2]) and conclude that the unbiased componentW n admits a Gaussian limiting distribution. 51 Lemma2.1.6. (StabilityofmartingaleW n )Let b θ on bethedebiasedestimator (2.6)withµ ℓ =τ p (logp)/n ℓ andL = L 0 ∥Ω ∥ 1 , for an arbitrary constantL 0 ≥ 1. Under Assumption 2.1.5, and for any fixed sequence of integersa(n)∈[dp], † we have V n,a ≡ σ 2 n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a ,x t ⟩ 2 =σ 2 Ω a,a +o P (1). (2.14) In addition, we have max n 1 √ n |⟨m ℓ a ,x t ⟩ε t |: ℓ∈[K− 1], t∈[n− 1] o =o P (1). (2.15) Proof of Lemma 2.1.6 is given in Appendix A.6.5. With Lemma 2.1.6 in place, we apply a martingale central limit theorem [71, Corollary 3.2] to obtain the following result. Corollary2.1.7. ConsidertheVAR(d)model (1.5)fortimeseriesandlet b θ on bethedebiasedestimator (2.6) withµ ℓ = C 1 ω p (logp)/n ℓ andL = L 0 ∥Ω ∥ 1 , for an arbitrary constantL 0 ≥ 1. For any fixed sequence of integersa(n)∈[dp], define the conditional variance V n as V n,a ≡ σ 2 n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a ,x t ⟩ 2 . Under Assumption 2.1.5, for any fixed coordinate a∈[dp], and for allx∈R we have lim n→∞ P n W n,a p V n,a ≤ x o =Φ( x), (2.16) whereΦ is the standard Gaussian cdf. † We index the sequence with the sample size n that is diverging. Since we are in high-dimensional setting p ≥ n is also diverging. 52 For the task of statistical inference, Theorem 2.1.4 and Corollary 2.1.7 suggest to consider the scaled residual √ n( b θ on a − θ 0,a )/ p V n,a as the test statistics. Our next proposition characterizes its distribution. The proof is straightforward given the result of Theorem 2.1.4 and Corollary 2.1.7 and is deferred to Ap- pendix A.6.6. In its statement we omit explicit constants that can be easily derived from Theorem 2.1.4. Theorem2.1.8. Consider theVAR(d) model (1.5) for time series and let b θ on be the debiased estimator (2.6) with µ ℓ = C 1 ω p (logp)/n ℓ , λ = λ 0 p log(dp)/n, and L = L 0 ∥Ω ∥ 1 , for an arbitrary constant L 0 ≥ 1. Suppose that Assumption 2.1.5 holds and s 0 = o √ n ∥Ω ∥ 1 log(dp) , then the following holds true for any fixed sequence of integersa(n)∈[dp]. For allx∈R, we have lim n→∞ P √ n( b θ on a − θ 0,a ) p V n,a ≤ x − Φ( x) =0. (2.17) Remark2.1.9. Note that the conditional varianceV n,a involves the noise varianceσ 2 , which is unknown in practice, and needs to be estimated from data. In Appendix A.2 we give a proposal to estimateσ 2 form data. 2.2 Batcheddatacollection Recall the setting of adaptive data collection in batches from Section 1.1.1.1. In this setting, the samples naturally separate into two batches: the first n 1 data points where the covariates are i.i.d from a distribu- tionP x , and the second batch of n 2 data points, where the covariates x i are drawn independently from the law ofx 1 , conditional on the event{⟨x 1 , b θ 1 ⟩≥ ς}, withς potentially a data-dependent threshold. The following theorem is a version of Theorem 6.1 in [26] and is proved in an analogous manner. It demon- strates that even with adaptive data collection consistent estimation using the LASSO is possible. Before proceeding to theorem statement, for the sake of reader’s convenience and to be self-contained let us recall the notion of compatibility. 53 Definition 2.2.1. For a symmetric matrix Γ ∈ R p× p and a subset S ⊆ [p], we say that Γ is (ϕ 0 ,S)- compatible if min θ ∈R p |S|⟨θ, Γ θ ⟩ ∥θ S ∥ 2 1 : θ ∈R p , ∥θ S c∥ 1 ≤ 3∥θ S ∥ 1 ≥ ϕ 0 . Theorem 2.2.2 ([26, Theorem 6.1]). Suppose that the true parameter θ 0 is s 0 -sparse and the distribution P x is such that with probability one the following two conditions hold: (i) the covariance E{xx T } and E{xx T |⟨x, b θ 1 ⟩≥ ς} are(ϕ 0 ,supp(θ 0 ))-compatible and(ii)x as well asx| ⟨x, b θ 1 ⟩≥ ς areκ -subgaussian. Sup- pose thatn≥ C 1 κ 4 ϕ 2 0 s 2 0 logp. Then, the LASSO estimate b θ L (y,X;λ n ) withλ n = C 2 κσ p (logp)/n satisfies the following bound, with probability exceeding1− p − 3 , ∥ b θ L − θ 0 ∥ 1 ≤ C ′ s 0 λ n ϕ 0 = Cκσ ϕ 0 s 0 r logp n . Remark2.2.3. (Estimatingthenoisevariance)ForthecorrectestimationrateusingtheLASSO,Theorem2.2.2 requiresknowledgeofthenoiselevelσ ,whichisusedtocalibratetheregularizationλ n . Otherestimatorslike the scaled LASSO [172] or the square-root LASSO [17] allow to estimate σ consistently when it is unknown. This can be incorporated into the present setting, as done in [90]. For simplicity, we focus on the case when the noise level is known. However, the results hold as far as a consistent estimate of σ is used. Formally, a consistent estimator refers to an estimateb σ =b σ (y,X) of the noise level satisfying, for anyε>0, lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P b σ σ − 1 ≥ ε =0. (2.18) Remark2.2.4. AttheexpenseofincreasingtheabsoluteconstantsinTheorem2.2.2,theprobability1− p − 3 can be made1− p − C for any arbitrary constantC >1. 54 Let X 1 and X 2 denote the design matrices of the two batches and, similarly, y (1) and y (2) the two responses vectors. In this setting, we use an online debiased estimator as follows: b θ on = b θ L + 1 n M (1) X T 1 (y (1) − X 1 b θ L )+ 1 n M (2) X T 2 (y (2) − X 2 b θ L ), (2.19) where we will constructM (1) as a function ofX 1 andM (2) as a function ofX 1 as well asX 2 . The proposal in Eq.(2.19) follows from the general recipe in Eq.(1.8) by setting • M i =M (1) fori=[n 1 ] andM i =M (2) fori=n 1 +1,...,n. • FiltrationsF i constructed as follows. Fori<n 1 ,y 1 ,...,y i ,x 1 ,...x n 1 andε 1 ,...,ε i are measurable with respect toF i . Fori ≥ n 1 , y 1 ,...,y i , x 1 ,...,x n andε 1 ,...ε i are measurable with respect to F i . By construction, this choice satisfies the predictability condition, given by Definition 1.1.1. Note that Eq.(2.19) nests an intuitive ‘sample splitting’ approach. Indeed, debiasing b θ L using exactly one of the two batches is equivalent to setting one ofM (1) orM (2) to 0. While sample splitting can be shown to work under appropriate conditions, our approach is more efficient with use of the data and gains power in comparison. We constructM (1) andM (2) using a modification of the program used in [90]. Let b Σ (1) = (1/n 1 )X T 1 X 1 and b Σ (2) = (1/n 2 )X T 2 X 2 be the sample covariances of each batch; let M (1) have rows (m (1) a ) 1≤ a≤ p and similarly forM (2) . Using parametersµ ℓ ,L>0 that we set later, we choosem (ℓ) a , thea θ row ofM (ℓ) , as a solution to the program minimize ⟨m, b Σ (ℓ) m⟩ subject to ∥ b Σ (ℓ) m− e a ∥ ∞ ≤ µ ℓ , ∥m∥ 1 ≤ L. (2.20) 55 Heree a is the vector with one at thea θ coordinate and zero everywhere else. The intuition for the program (2.20) is simple. The first constraint ensures that b Σ (ℓ) m is close, inℓ ∞ sense to thee a , thea θ basis vector and as we will see in Theorem 2.2.7 it controls the bias term∆ of b θ on . The objective is a multiple of the variance of the martingale term W in b θ on (cf. Eq. (2.24)). We wish to minimize this as it directly affects the power of the test statistic or the length of valid confidence intervals constructed based on b θ on . The ℓ 1 constraint on m, which is missing in [90], is crucial for our adaptive data setting. This constraint ensures that the value of the program⟨m (ℓ) a , b Σ (ℓ) m (ℓ) a ⟩ is stable, and does not fluctuate much from sample to sample (this is formalized as the ‘stability condition’ in Lemmas A.8.8 and 2.1.6). It is this stability that ensures that the martingale part of the residual displays a central limit behavior. Note that in the non-adaptive setting, inference can be performed conditional on designX, and fluc- tuation in⟨m (ℓ) a , b Σ (ℓ) m (ℓ) a ⟩ is conditioned out. In the adaptive setting, this is not possible: one effectively cannot condition on the design without conditioning on the noise realizationε, and therefore we perform inference unconditionally onX. 2.2.1 Onlinedebiasing: adistributionalcharacterization We begin the analysis of the online debiased estimator b θ on by a decomposition that mimics the classical debiasing. b θ on =θ 0 + 1 √ n B n ( b θ L − θ 0 )+W n , (2.21) B n = √ n I p − n 1 n M (1) b Σ (1) − n 2 n M (2) b Σ (2) (2.22) W n = 1 √ n X i≤ n 1 M (1) x i ε i + 1 √ n X n 1 0 so that the eigenvalues ofE{xx T } andE{xx T |⟨x, b θ 1 ⟩≥ ς} are bounded below byΛ 0 . 2. The laws ofx andx| ⟨x, b θ 1 ⟩≥ ς areκ -subgaussian for a constantκ> 0. 3. TheprecisionmatricesΩ= E{xx T } − 1 andΩ (2) ( b θ 1 )=E{xx T |⟨x, b θ 1 ⟩≥ ς} − 1 satisfy∥Ω ∥ 1 ∨∥Ω (2) ( b θ 1 )∥ 1 ≤ L. 4. TheconditionalcovarianceΣ (2) (θ )=E{xx T |⟨x,θ ⟩≥ ς}isK-Lipschitzinitsargumentθ ,i.e.∥Σ (2) (θ ′ )− Σ (2) (θ )∥ ∞ ≤ K∥θ − θ ′ ∥ 1 . The first two conditions of Assumption 2.2.5 are for ensuring that the base LASSO estimator b θ L has small estimation error. In addition, our debiasing makes use of the third and fourth constraints on the precision matrices of the sampling distributions. In the above, we will typically allowL = L n to diverge withn. In the following Example we show that Gaussian random designs satisfy all the conditions of Assump- tion 2.2.5. We refer to Section A.8.4 for its proof. Example 2.2.6. LetP x = N(0,Σ) and b θ be any vector such that ∥ b θ ∥ 1 ∥ b θ ∥ ∞ ≤ L Σ λ min (Σ) ∥ b θ ∥/2 and ∥Σ − 1 ∥ 1 ≤ L Σ /2. Thenthedistributionsofxandx| ⟨x, b θ ⟩≥ ς ,withς = ¯ς⟨ b θ, Σ b θ ⟩ 1/2 foraconstant¯ς ≥ 0satisfy the conditions of Assumption 2.2.5 with Λ 0 =λ min (Σ) , κ =3λ 1/2 max (Σ)(¯ ς∨¯ς − 1 ), K = √ 8(1+¯ς 2 ) λ max (Σ) 3/2 λ min (Σ) 1/2 , L=L Σ . 57 Under Assumption 2.2.5 we provide anon-asymptotic bound on the bias of the online debiased estima- tor b θ on . Theorem 2.2.7. (Non-asymptotic bound on bias) Under Assumption 2.2.5, there exists universal constants C 1 ,C 2 ,C 3 so that, whenn≥ C 1 κ 4 s 2 0 logp/ϕ 2 0 andn 1 ∧n 2 ≥ C 1 (Λ 0 /κ 2 +κ 2 /Λ 0 )logp, we have that √ n( b θ on − θ 0 )=W n +∆ n , (2.24) whereE{W n }=0 and P n ∥∆ n ∥ ∞ ≥ C 2 κ 2 Λ 0 3/2 σs 0 logp √ n o ≤ p − 3 . (2.25) Further we have ∥E{ b θ on − θ 0 }∥ ∞ ≤ C 2 κ 2 Λ 0 3/2 σs 0 logp n + C 3 ∥θ 0 ∥ 1 p 2 . (2.26) The proof of Theorem 2.2.7 is given in Appendix A.8.2. Note that, in the high-dimensional setting of n ≪ p, the term ∥θ 0 ∥ 1 /p 2 will be of lower order as compared to s 0 logp/n. Therefore, when the parameters Λ 0 ,σ,κ are of order one, the theorem shows that the bias of the online debiased estimator is of order s 0 logp/n, This may be compared with the LASSO estimator b θ L whose bias is typically of orderλ ≍ σ p logp/n. In particular, in the regime whens 0 = o( p n/logp), this bias is asymptotically dominated by the variance, which is of orderσ/ √ n. In order to establish asymptotic Gaussian behavior of the online debiased estimate b θ on , we consider a specific asymptotic regime for the problem instances. 58 Assumption2.2.8. (Asymptoticregime)Weconsiderprobleminstancesindexedbythesamplesizen,where n,p,s 0 satisfy the following: 1. liminf n→∞ n 1 ∧n 2 n ≥ c,forapositiveuniversalconstantc∈(0,1]. Inotherwords,bothbatchescontain at least a fixed fraction of data points. 2. The parameters satisfy: lim n→∞ 1 Λ 0 s 0 r logp n L 2 K∨ r logp Λ 0 ! =0. (2.27) The following proposition establishes that in the asymptotic regime, the unbiased componentW n has a Gaussian limiting distribution. The key underlying technical idea is to ensure that the martingale sum inW n is stable in an appropriate sense. Proposition2.2.9. Suppose that Assumption 2.2.5 holds and consider the asymptotic regime of Assumption 2.2.8. Leta = a(n) ∈ [p] be a fixed sequence of coordinates. Define the conditional variance V n,a of thea θ coordinate as V n,a =σ 2 n 1 n ⟨m (1) a , b Σ (1) m (1) a ⟩+ n 2 n ⟨m (2) a , b Σ (2) m (2) a ⟩ . (2.28) Then, for any bounded continuousφ:R→R lim n→∞ E n φ W n,a p V n,a o =E{φ(ξ )}, whereξ ∼ N(0,1). 59 The proof of Proposition 2.2.9 is deferred to Appendix A.8.3. The combination of Theorem 2.2.7 and Proposition 2.2.9 immediately yields the following distributional characterization for b θ on . Theorem2.2.10. UnderAssumptions2.2.5and2.2.8,theconclusionofProposition2.2.9holdswith √ n( b θ on a − θ 0 ) in place ofW n . In particular, lim n→∞ P n r n V n,a ( b θ on a − θ 0,a )≤ x o =Φ( x), (2.29) whereV n,a is defined as in Proposition 2.2.9. To compare the sample size requirements made forℓ 1 -consistent estimation and those in Assumption 2.2.8, it is instructive to simplify to the case whenκ,ϕ 0 ,Λ 0 are of order one. Thenℓ 1 -consistency (Theorem 2.2.2 in Appendix A.8) requires thatn 1 ∨n 2 = Ω( s 2 0 logp), i.e. at least one of the batches is larger than s 2 0 logp. However, Theorem 2.2.10 makes the same assumption onn 1 ∧n 2 , or both batches exceeds 2 0 logp in size. For online debiasing, this is the case of interest. Indeed ifn 1 ≫ n 2 (or vice versa), we can apply offline debiasing to the larger batch to obtain a debiased estimate. Conversely, whenn 1 andn 2 are comparable as in Assumption 2.2.8, this ‘sample-splitting’ approach leads to loss of power corresponding to a constant factor reduction in the sample size. This is the setting addressed in Theorem 2.2.10 via online debiasing. 2.2.1.1 RevisitingthenumericalexamplefromSection1.1.1.1. In the batched data example discussed in Section 1.1.1.1, we observed that the classical offline debiasing fails in providing unbiased estimate of the true parameters. Here, we will repeat the same experiment and numerically characterize the distribution of the proposed online debiased estimator. 60 Figure 2.4 (left panel) shows the histogram of the entries of online debiased estimator b θ on on the support ofθ 0 (blue) along with the corresponding histogram of entries of the debiased estimator b θ off (red). As we see for both choices of b θ 1 (debiased LASSO and ridge estimate on the first batch), the online debiased estimator b θ on is appropriately centered around the true coefficients. One can also split samples in the following way. Since the second batch of data was adaptively collected while the first batch was not, we can compute a debiased estimate using only the first, non-adaptive batch: b θ off ,1 ≡ b θ L (y (1) ,X 1 )+ 1 n Ω X T 1 (y (1) − X 1 b θ L (y (1) ,X 1 )). (2.30) Figure 2.4 (right panel) shows the histogram of the entries of b θ off ,1 restricted to the support of θ 0 , and the comparison with b θ on . As can be expected, both b θ off ,1 and b θ on are appropriately centered around the true coefficient 1. However, as is common with sample-splitting, b θ off ,1 displays a larger variance and correspondingly loses power in comparison with b θ on since it uses only half of the data. The power loss becomes even more pronounced when there are more than two phases of data collection, or if the phases are particularly imbalanced. Comparison with ridge-type debiasing approach of [44]. We would like to discuss two points in comparing the contribution of [44] with ours: 1. The method of [44] is tailored to low-dimensional setting where the number of covariatesp is less than the sample size (p < n). Specifically, denoting by λ min (n) the minimum eigenvalue of X T X, [44] considers a setting where λ min (n) → ∞ almost surely. Note that for the batched data example, this amounts to √ n− √ p→∞. 61 (a) (b) (c) (d) Figure 2.4: (a,c) Histograms of the online debiased estimate b θ on and the ridge debiased estimator [44], restricted to the support ofθ 0 . (b,d) Histograms of the offline debiased estimate onlyusingthefirstbatch , b θ off ,1 given by (2.30) and the online debiased estimate b θ on . In(a,b), we use b θ 1 the debiased Lasso on the first batch and in (c,d) we use b θ 1 the ridge estimate on the first batch. The dashed line indicates the true coefficient size. Offline debiasing b θ off ,1 using only the first batch works well (green histograms called offlineFirstBatch), but then loses power in comparison. Online debiasing is cognizant of the adaptivity and debiases without losing power even in the presence of adaptivity. 2. The work [44] proposes a different method of debiasing which albeit being valid in low-dimensional setting it comes with fundamental challenges to be generalized to high-dimensional setting. Letting b θ OLS the least square estimator, [44] constructs a debiased estimator b θ d as follows: b θ d = b θ OLS +W n (y− X b θ OLS ), (2.31) where the matrixW n is constructed recursively asW n =[W n− 1 |w n ] andX n =[X n− 1 |x n ] with w n =arg min w∈R p ∥I− W n− 1 X n− 1 − wx T n ∥ 2 F +λ ∥w∥ 2 2 . (2.32) Therefore, the decorrelating matrixW n is constructed in an online way as it is a predictable sequence according to Definition 1.1.1. Note that w i corresponds toM i x i in our notation. One can potentially think of using the ridge-type debiased estimator (2.32) in high-dimensional setting with using b θ L instead of b θ OLS . In Figure 2.4, we include the histogram of such estimate (gray histogram under the name “ridgeOnline"). As we see the corresponding histogram is biased and deviates from a normal distribution which implies that this approach does not extend to high-dimensional setting. 62 Some intuition for this may be seen by following the argument of [44]. Considering the bias-variance decomposition of b θ d − θ 0 = b+v withb=(I− W n X n )( b θ OLS − θ 0 ) andv =W n ε n , the above optimization aims at minimizing a weighted sum of the bias and the variance of b θ d in an online manner. The analysis of [44] controls bias as follows ∥b∥≤∥ I− W n X n ∥ op ∥ b θ OLS − θ 0 ∥ 2 ≤∥ I− W n X n ∥ F ∥ b θ OLS − θ 0 ∥ 2 . However, in high-dimension this bound is vacuous. Since W n X n ∈ R p× p is of rank at most n < p, I− W n X n has eigenvalue 1 with multiplicity at leastp− n. Therefore∥I− W n X n ∥ F ≥ p− n→∞ and ∥I− W n X n ∥ op ≥ 1. Thus, even a refinement of [44] would only yield an insufficient bias bound of the type ∥b∥ 2 ≤∥ b θ L − θ 0 ∥ 2 ≈ σ r s 0 logp n , which dominants the variance componentVar(v)=O(1/ √ n). Our scheme of online debiasing overcomes this obstacle by adapting to the geometry of the high-dimensional regime. In particular, it yields the bias bound of order∥E{ b θ on − θ 0 }∥ ∞ = O(s 0 (logp)/n) which is dominated by the noise term, provided that s 0 =o( √ n/logp). 2.3 Statisticalinference An immediate use of distributional characterizations (2.17) or (2.29) is to construct confidence intervals and also provide valid p-values for hypothesis testing regarding the model coefficients. Throughout, we make 63 the sparsity assumption s 0 = o( √ n/logp 0 ), with p 0 the number of model parameters (for the batched data collection settingp 0 =p, and for theVAR(d) modelp 0 =dp). Confidenceintervals: For fixed a∈[p 0 ] and significance level α ∈(0,1), let J a (α )≡ [ b θ on a − δ (α,n ), b θ on a +δ (α,n )], δ (α,n )≡ Φ − 1 (1− α/ 2) q V n,a /n, Further, note that the length of confidence interval J a (α ) is of order O(σ/ √ n) (using Lemma A.8.8 for the batched data collection setting and Lemma 2.1.6 for the time series). It is worth noting that this is the minimax optimal rate [91, 83] and is of the same order of the length of confidence intervals obtained by the least-square estimator for the classical regimen>p with i.i.d samples. Hypothesistesting: Another consequence of Proposition 2.1.8 is that it allows for testing hypothesis of formH 0 :θ 0,a =0 versus the alternativeH A :θ 0,a ̸=0 and provide validp-values. Recall thatθ 0 denotes the model parameters, either for the batched data collection setting or theVAR(d) model (which encodes the entries A (ℓ) i,j in model (1.5)). Such testing mechanism is of crucial importance in practice as it allows to diagnose the significantly relevant covariates to the outcome. In case of time series, it translates to understanding the effect of a covariate z t− ℓ,j on a covariatez t,i , and to provide valid statistical measures (p-values) for such associations. We construct two-sidedp-values for testingH 0 , using our test statistic as follows: P a =2 1− Φ √ n| b θ on a | p V n,a !! . (2.33) 64 Our testing (rejection) rule given the p-valueP a is: R(a)= 1 ifP a ≤ α (rejectH 0 ), 0 otherwise (fail to rejectH 0 ). (2.34) Employing the distributional characterizations (2.29) or (2.17), it is easy to verify that the constructed p- valueP a is valid in the sense that under the null hypothesis it admits a uniform distribution:P θ 0,a =0 (P a ≤ u)=u for allu∈[0,1]. Groupinference In many applications, one may want to do inference for a group of model parameters, θ 0,G ≡ (θ 0,a ) a∈G simultaneously, rather than the individual inference. This is the case particularly, when the model covariates are highly correlated with each other or they are likely to affect the outcome (in time series application, the future covariate vectors) jointly. To address group inference, we focus on the time series setting. The setting of batched data collection can be handled in a similar way. We first state a simple generalization of Proposition 2.1.8 to a group of coordinates with finite size as n,p→∞. The proof is very similar to the proof of Proposition 2.1.8 and is omitted. Lemma2.3.1. LetG = G(n) be a sequence of setsG(n)⊂ [dp] with|G(n)| = k fixed as n,p→∞. Also, let the conditional varianceV n ∈R dp× dp be defined by (2.14) for theVAR(d) model, that is: V n ≡ σ 2 n K− 1 X ℓ=1 X t∈E ℓ (M (ℓ) x t )(M (ℓ) x t ) T . (2.35) 65 Under the assumptions of Proposition 2.1.8, for allu=(u 1 ,...,u k )∈R k we have lim n→∞ P n √ n(V n,G ) − 1/2 ( b θ on G − θ 0,G )≤ u o − Φ k (u) =0, (2.36) whereV n,G ∈R k× k isthesubmatrixobtainedbyrestrictingV n totherowsandcolumnsinG. Here(a 1 ,...,a k )≤ (b 1 ,...,b k ) indicates thata i ≤ b i fori∈[k] andΦ k (u)=Φ( u 1 )...Φ k (u). Much in the same way as individual inference, we can use Lemma 2.3.1 for simultaneous inference on a group of parameters. Concretely, letS k,α ⊆ R k be any Borel set withk-dimensional Gaussian measure at least1− α . Then for a groupG⊂ [dp], with size|G|=k, we construct the confidence set J G (α )⊆ R k as follows J G (α )≡ b θ on G + 1 √ n (V n,R ) 1/2 S k,α . (2.37) Then, using Lemma 2.3.1 we conclude thatJ G (α ) is a valid confidence region, namely lim n→∞ P(θ 0,G ∈ J G (α ))=1− α . 66 Chapter3 Model-XConditionalIndependenceTesting 3.1 PearsonChi-squaredrandomization(PCR)test Motivated by the issues of CRT discussed in section 1.1.2.1, in this chapter we propose a novel test, called Pearsonχ 2 conditional randomization (PCR) test. We start by describing the PCR test and its test statistic. We then characterize the null distribution of its statistic by which we propose two rejection thresholds, for finite and infinite sample regimes. 3.1.1 PCRteststatistic We construct the PCR test statistic in four main steps: Datagrouping. We first split the entire data set D = {(X j ,Z j ,Y j )} j=1:n into n g groups of equal size {G i } i=1:ng . This means that|G i | = n/n g , for i ∈ [n]. In this step, n g is an input value which is known upfront, and for simplicity we assume thatn is divisible byn g (otherwise remove the extra samples). Ideally, we want to have a moderately large value for n g as it will be used later as the 67 number of samples for the uniformity testing problem in the multinomial model with the Pearson chi-squared test statistic. Counterfeitsampling. This is a common step in model-X conditional independence testing methods, where for each group of data pointsG i , for exampleG i ={(X j ,Z j ,Y j )} j=1:n/ng , several counter- feits of the form e G i ={( e X j ,Z j ,Y j )} j=1:n/ng are constructed by sampling e X j ∼L X|Z (.|Z j ) while keepingY j ,Z j intact. As we will discuss a main distinction of our PCR test with other CRT approach is that the PCR test works with few number of counterfeits while, in CRT approach, one requires a large number of counterfeits (at least of order1/α ), given that the normalized rank statistic (1.11) is intrinsically discrete. Scoreandlabel. Given a score functionT , we first score each group e G i and then label the groups based on the relative ranking of the score of the original groups among the scores of its counterfeits. Specifically, we partition the range of possible ranks in to L subsets, S 1 ,...,S L , of equal size and assign labelℓ to groups whose score rank falls in S ℓ . Special cases of this idea (with L = 2 labels and unbalanced groups) can be traced back in the conformal inference literature [188, 108, 109, 151], where the sample quantile of the non-conformity scores are compared to a certain threshold to construct prediction intervals. Uniformitytestinginamultinomialmodel. Under the null hypothesis (1.10), by using the exchange- ability of data scores and their counterfeits scores, it is straightforward to see that each label occurs with equal frequency (with expected count of each label beingn g /L). In this step, we use the Pear- son Chi-squared test statisticU ng,L to test uniformity of label occurrences in a multinomial model withn g samples andL labels. Note that, in generalL can scale withn g , and as discussed in [5], the χ 2 test can have bad power due to the fact that the variance of theχ 2 statistics is dominated by small 68 entries of the multinomial. A truncated version ofχ 2 statistic has been proposed by [5] to mitigate this issue by limiting the contribution to the variance from each label. However, when testing for a uniform distribution, as in our case, the truncation becomes superfluous. This implies that in this case, the usualχ 2 statistic inherits several appealing properties of the truncatedχ 2 statistic. In par- ticular, [5] showed that truncatedχ 2 test is globally minimax optimal for the multinomial problem. It is worth noting that for the multinomial testing problem in high dimension (L growing withn g ), the upper and lower bounds on the critical radiusε has been established in [135, 184]. Concretely, it is shown thatO( √ L/ε 2 ) number of samples are sufficient and information-theoretically necessary for distinguishing uniform distributions from alternatives that areε far in theℓ 1 -ball, with success probability larger than2/3. Generalization to the CRT extensions. It is worthwhile noting that in cases that the groups of the PCR test statistic have moderately large size, more complex functions such as fitted LASSO coefficient, or fitted neural network loss can be used as score functions. This means that many of the extensions made to improve the computational complexity or robustness of the CRT, e.g., the hold-out randomization test (HRT) [175], the distilled CRT [117], or the conditional permutation test (CPT) [21] can also be used to score groups of data points and therefore enhance the overall performance of the PCR test. A detailed description for construction of the PCR statistic is given in Algorithm 1. 69 3.1.2 Decisionrule We introduce two rejection thresholds for the hypothesis testing problem (1.10) with the statistic U ng,L given in Algorithm 1. At significance level α , the decision rule is based on the test statistic: ϕ (X,Z,Y)= 1 U ng,L ≥ θ L,α (rejectH 0 ), 0 otherwise (acceptH 0 ). (3.1) For the thresholdθ L,α we consider two proposals: θ asym L,α :=χ 2 L− 1 (1− α ), θ finite L,α =L+ r 2L α , (3.2) whereχ 2 L− 1 (1− α ) denotes the1− α quantile of aχ 2 distribution withL− 1 degrees of freedom. As we show in the next section, the size of PCR test is controlled asymptotically (asn→∞) with usingθ asym L,α . In addition, by usingθ finite L,α , we prove that the size is controlled at finite sample settings. As clear form its description, and similar to the CRT, the PCR test looks for statistically significant deviations between the distribution of the rank of original scores and the uniform distribution. While CRT only examines the tails of the distributions, the PCR test examines the entire support by comparing the two distributions onL bins (corresponding to labels) of equal size and is able to capture deviations occurring in the middle range as well as at the tails. 70 Input: n data points(X j ,Z j ,Y j )∈R× R 1× q × R, a positive integern g as the number of groups (let s=n/n g ∈Z), a real-valued score functionT :R s × R s× q × R s 7→R, and integersK,L≥ 1 (let M =KL− 1). Output: Test statisticsU ng,L for testing the conditional independence hypothesis (1.10). • Split the data inton g groups{G j =(X j ,Z j ,Y j )} j=1:ng of equal sizes, where G j ∈R s × R s× q × R s . forj∈[n g ]do • DrawM i.i.d. samples e X (1) j ,..., e X (M) j fromL X|Z (·|Z j ). • ConstructM counterfeit groups{ e G (i) j =( e X (i) j ,Z j ,Y j )} i=1:M . • UseT to score the initial groupG j and itsM counterfeits e G (1:M) j . T j = T(G j ), e T (i) j = T( e G (i) j ), fori∈[M]. • LetR j denote the rank ofT j among{T j , e T (1) j ,..., e T (M) j }: R j =1+ M X i=1 I n T j ≥ e T (i) j o • Partition[M +1]=S 1 ∪...∪S L withS ℓ :={(ℓ− 1)K +1,...,ℓK}. Assign label ℓ j ∈{1,2,...,L} to groupG j ifR j ∈S ℓ j . end forℓ∈{1,2,...,L}do • LetW ℓ be the number of groups with labelℓ: W ℓ := j∈{1,2,...,n g }:ℓ j =ℓ . end • Define the test statistic U ng,L as follows U ng,L = L n g L X ℓ=1 W ℓ − n g L 2 . Algorithm1: PCR test statistic 71 3.1.3 SizeofthePCRtest Under the null hypothesis, the original and counterfeit scores are coming from a similar population. Our next assumption on the continuity of random variables ensures that the different data points achieve dis- tinct score values, with probability one. This symmetry on distinct values implies that each data point gets labelℓ∈ [L] uniformly at random. In short, we change the problem of conditional independence testing into the uniformity testing problem on data points coming from a multinomial distribution. Assumption3.1.1. For a score functionT :R s × R s× q × R s , assume that the following conditional cumu- lative distribution functions are continuous, for every pair(z,y)∈R s× q × R s : F T|ZY (t;z,y) := P X|ZY (T(X,z,y)≤ t|Z= z,Y = y) , (3.3) F T|Z (t;z,y) := P X|Z (T(X,z,y)≤ t|Z= z,Y = y) . (3.4) Note that bothF T|ZY andF T|Z are conditional onY,Z, and randomness is coming fromX. The difference is that inF T|ZY , we haveX∼L (X|ZY), while inF T|Z , we haveX∼L (X|Z). It is worth noting that the above assumption, which is used to transform the conditional independence testing problem into a multinomial uniformity testing problem, is indeed a weak assumption. It is used to avoid ties when ranking the scores, and alternatively one can use a random tie-breaking decision rule and remove this assumption. In the next theorem, we show that by usingθ asym L,α in the decision rule (3.1) asymptotic control on type I error is guaranteed. It is an immediate consequence of characterizing the asymptotic distribution ofU ng,L statistic in Algorithm 1. Furthermore, we show that deploying the rejection threshold θ finite L,α results in finite-sample control on the type I error. 72 Theorem3.1.2. Under the null hypothesis (1.10) and Assumption (3.1.1) , the statisticU ng,L constructed in Algorithm 1 will converge toχ 2 distribution withL− 1 degrees of freedom, forL≥ 2: U ng,L d ⇒χ 2 L− 1 , asn g →∞. In addition, for everyn g ≥ 1, we have P U ng,L ≥ θ finite L,α ≤ α, withθ finite L,α =L+ q 2L α . We refer to Section B.1.4 for the proof of Theorem 3.1.2. Based on the above characterization of the null distribution, in finite sample and asymptotic regimes, we can construct the followingp-values for the testing problem (1.10): P finite ng,L = 1, U ng,L ≤ L, min 2L (U ng,L − L) 2 ,1 , otherwise. (3.5) P asym ng,L =1− F L− 1 (U ng,L ), (3.6) whereF k is the cdf of a chi-squared random variable withk degrees of freedom. 73 Note that under the null hypothesis,p-valueP asym ng,L is asymptotically uniform, whereasP finite ng,L is super- uniform for finite n g . Formally, for allt∈[0,1]: lim ng→∞ P P asym ng,L ≤ t =t, P P finite ng,L ≤ t ≤ t, ∀n g ≥ 1. 3.2 ApoweranalysisofthePCRtest We next provide a power analysis of the PCR test . To this end, we need a notion of distance between a prob- ability density functionp XZY (x,z,y) and its corresponding conditional independence densityp X (x)p Z|X (z|x)p Y|Z (y|z), wherep Y|Z (y|z) is obtained by marginalizing outX, i.e.,p Y|Z (y|z) = R p Y|XZ (y|x,z)p X|Z (x|z)dx. As expected, the larger this distance, the easier to discern the conditional dependency. The metric that we use here to analyze the power of PCR test is a generalization of the notion of ordinal dominance curve (ODC) [77, 6]. For two densitiesp andq defined on the real line, the ODC is given by F p (F − 1 q (t)), where F p ,F q respectively denote the cdfs corresponding top andq. In other words, the ODC is the population analogous of the PP plot. The derivative of the ODC (if exists) is given byf p (F − 1 q (t))/f q (F − 1 q (t)) and is called the relative density function ([177], Section 2.4). We next define the conditional ODC and the conditional relative density function, along with two assumptions. Let us emphasize that the upcoming assumptions are made to facilitate the power analysis, and the validity of the PCR test (control on type I error) holds even without these assumptions. 74 Definition 3.2.1. (Conditional ODC and relative density function). For a score function T defined in As- sumption3.1.1,recalltheconditionalcdfsF T|ZY (t;z,y)andF T|Z (t;z,y)givenbyequations (3.3)and (3.4). We define the conditional ordinal dominance curve R T :[0,1]→[0,1] as follows: R T (u)=E (Z,Y)∼L (Z,Y) h F T|ZY F − 1 T|Z u;Z,Y ;Z,Y i . For Differentiable R T , we call its derivative the conditional relative density function: r T (u):= ∂ ∂u R T (u), for u∈(0,1). Assumption3.2.2. Assume the conditional relative density function r T (u) isC-Lipschitz continuous. This also implies thatr T (u) is uniformly bounded, i.e., sup u∈[0,1] |r T (u)|≤ B, for some constantB >0. Our next assumption is a sufficient condition to replace the order of the expectation and the derivative in the definition of r T (u) (see (B.16)). Assumption3.2.3. We assume that Z 1 0 E (Z,Y)∼L (Z,Y) h ∂ ∂u F T|ZY F − 1 T|Z u;Z,Y ;Z,Y i du<∞. We are now ready to define a distance between the distribution of (X,Z,Y) and ( e X,Z,Y) where e X∼L (X|Z), independently ofY. Note that the two densities match under the null hypothesis (1.10). 75 Definition3.2.4. ForascorefunctionT anditsrelativedensityfunctionr T (.),defineconditionaldependency power as ∆ T (L(X,Z,Y))= 1 Z 0 |r T (u)− 1|du. We next state some properties of the measure∆ T (L(X,Z,Y)). Recall that for two random variables U,V with density functions p,q (with respect to the Lebesgue’s measure), the total variation distance is defines as d TV = 1 2 Z ∞ −∞ |p(t)− q(t)|dt. Remark3.2.5. The followings hold for the measure∆ T (L(X,Z,Y)). (a) Underthenullhypothesis (1.10),foranyscorefunctionT satisfyingAssumption3.1.1wehave∆ T (L(X,Z,Y))= 0. (b) The following upper bound holds in general: ∆ T (L(X,Z,Y))≤ E (Z,Y)∼L (Z,Y) h 2d TV (T( e X,Z,Y)|Z,Y),(T(X,Z,Y)|Z,Y) i , withX∼L (X|Z,Y) and e X∼L (X|Z). We refer to Section B.1.5 for the proof of Remark 3.2.5. As discussed earlier, the PCR test transforms the conditional independence problem into the problem of uniformity testing under a multinomial model. That said, in order to analyze the power of PCR test we focus on the later problem. We use the results of [5] which characterize the power of truncated χ 2 -test for a high-dimensional multinomial model, in terms of theℓ 1 distance between the nominal probabilities and the uniform distribution over the categories. However, it is not clear how the nominal probabilities 76 in the multinomial model are related to the distribution of(X,Z,Y) in the original conditional indepen- dence testing problem. Our next proposition answers this question and relates theℓ 1 distance between the nominal probabilities and the discrete uniform distribution, in the multinomial problem, to the measure ∆ T (L(X,Z,Y)) given in Definition 3.2.4. Proposition 3.2.6. Under Assumption 3.2.3, in Algorithm 1, each groupG i ∼ L (X,Z,Y) admits label t∈{1,2,...,L}, independently from other data points with probability p t = tK− 1 X j=(t− 1)K M j 1 Z 0 u j 1− u M− j r T (u)du, (3.7) where r T (.) is the conditional relative density function given by Definition 3.2.1. Under the null hypothesis (1.10),wehavep t = 1 L . Inaddition,underAssumption3.2.2,thepartialsumsof{p t } ℓ t=1 satisfiesthefollowing bounds: i) For everyℓ∈[L], we have ℓ X t=1 p t ≥ R T ℓ L , (3.8) whereR T (u) is the conditional dominance curve given by Definition 3.2.1. ii) LetD =C/2+2BwithB,CgivenaccordingtoAssumption3.2.2andintroduceν K :=2 9D 2 logK √ K 2/5 . Then forK sufficiently large such that ν K <1, we have ℓ X t=1 p t ≤ R T ℓ L +ν K . (3.9) iii) We have L X ℓ=1 p ℓ − 1 L ≥ ∆ T (L(X,Z,Y))− Lν k − C L . (3.10) 77 Proof of Proposition 3.2.6 is given in Section B.1.6. With Proposition 3.2.6 in place, we are now ready to state the main result about the statistical power of our PCR test. We start by analyzing the power of the PCR test when it is used with the finite-sample thresholdθ finite L,α . Theorem 3.2.7. Let U ng,L be the PCR test statistic –output of Algorithm 1– with the number of labels L, numberofgroupsn g ,andnumberofcounterfeitspersampleM,whereM =KL− 1,andascorefunctionT thatsatisfiesAssumptions3.2.2and3.2.3withparameters B,C. Supposethatforsomeβ > 0,theconditional dependency power∆ T (L(X,Z,Y)) satisfies the following: ∆ T (L(X,Z,Y))≥ 32L 1/4 √ n g 1 √ α ∨ 1 β 1/2 + C L +Lν K , (3.11) with ν K = 2 9(C/2+2B) 2 logK √ K 2/5 , for K sufficiently large such that ν K < 1. Then the PCR test, used withthefinite-samplethreshold θ finite L,α ,achievesapowerofatleast1− β . Moreprecisely,foralldistributions L(X,Z,Y) satisfying (3.11), we have P U ng,L ≥ L+ r 2L α ! ≥ 1− β . The proof of Theorem 3.2.7 follows from Proposition 3.2.6 and is given in Section B.1.7. We next analyze the PCR test power when it is employed with the asymptotic thresholdθ asym L,α . 78 Theorem3.2.8. LetU ng,L bethePCRteststatistic–outputofAlgorithm1,withthenumberoflabelsL,and number of counterfeits per sample M, where M = KL− 1, and a score function T that satisfies Assump- tions 3.2.2 and 3.2.3 with parametersB,C. In addition, suppose that the following lower bound holds for the conditional dependency power∆ T (L(X,Z,Y): ∆ T (L(X,Z,Y))≥ L 1/4 √ n g · max r 3log 1 β + 3log 1 β +2 r log 1 α +2log 1 α ! 1/2 , 1 + C L +Lν K , (3.12) withν K = 2 9(C/2+2B) 2 logK √ K 2/5 , forK sufficiently large such that ν K < 1. Then the PCR test deployed withtheasymptoticthresholdθ asym L,α hasasymptoticstatisticalpoweratleast1− β . Formally,foralldistribu- tionsL(X,Z,Y) satisfying (3.12), the following holds lim ng→∞ P U ng,ℓ ≥ χ 2 L− 1 (1− α ) ≥ 1− β. Proof of Theorem 3.2.8 also uses the results of Proposition 3.2.6 and is deferred to Section B.1.8. The next result provides guidelines on the choice of the number of labelsL, as the number of samples n (and so the number of groupsn g ) grows to infinity. Remark 3.2.9. Note that the lower bound (3.11) on ∆ T (L(X,Z,Y)) is minimized for L ≍ n 2/5 g . This suggests that optimal scaling for the number of labels L in the PCR test with the finite-sample threshold is L≍ n 2/5 g which results in a non-trivial power as long as∆( L(X,Z,Y))≳n − 2/5 g . 79 3.3 Parameter-freePCRtest The PCR test statistic described in Algorithm 1 takes the parametersK,L as input. In general, having a largeK (for fixed value of L) results in large value ofM (the number of counterfeits) and hence increases the statistical power of the test because we can better discern the discrepancy between the distribution of the ranks and the discrete uniform distribution. This benefit of course comes at a higher computational cost for constructing the test statistic. The choice ofL (total number of labels) is however more subtle. On the one hand, a large value ofL implies that many of the labels occur rarely, which makes it challenging to point out significant deviations from the discrete uniform distribution (too many weak effects). On the other hand, a small value of L results in a few bins over which we are comparing the test statistic with discrete uniform. In this case the test may miss sharp deviations as they are aggregated by the relatively large number of other points in the same bin. Similar observation can be made from the results of Theorem 3.2.7 (and Theorem 3.2.8) where the right-hand side of (3.11) (and (3.12)) has a term decreasing inL and a term increasing inL. Therefore, the parameterL should be perceived as a tuning parameter in Algorithm 1. As we showed in Theorem 3.1.2, any choice of L results in a test with type I error control; however different choices of L gives different statistical powers. A natural approach is to run the PCR test multiple times, each time with a different value of L, and then ‘pick’ the one that results in the smallest (most significant) p-value. However, this approach clearly violates the validity of the reported p-value, as we should account for the ‘cherry-picking’. Also, note that the obtained p-values (with different choices of L) are dependent as they are constructed from a common data set. To properly combine thep-values, we use the Bonferroni’s method. Algorithm 2 describes this idea and presents a parameter-free version of Algorithm 1. 80 Input: n data points(X j ,Z j ,Y j )∈R× R 1× q × R, significance level α ∈(0,1), a real-valued score functionT :R s × R s× q × R s 7→R,K≥ 1 and a gird ofN values{L 1 ,...,L N }. Output: Decision on the conditional independence hypothesis (1.10). fori∈[N]do • Run Algorithm 1 withL=L i labels to get test statisticU ng,L i . • Constructp-valueP i using (3.5) (for finite sample) or (3.6) (for asymptotic case). end • Reject the null hypothesis ifP ∗ :=N min i∈[N] P i ≤ α . Algorithm2: Parameter-free PCR test The next theorem follows readily from Theorem 3.1.2 along with union bounding for the Bonferroni’s correction. Theorem3.3.1. Underthenullhypothesis (1.10),thep-valueP ∗ constructedinAlgorithm2issuper-uniform, i.eP(P ∗ ≤ t)≤ t, for allt∈[0,1]. 3.4 RobustnessofthePCRtest In this section, we investigate the conditional independence problem when the exact conditional distribu- tionP X|Z is not available; rather we use b P X|Z (·|Z) an estimate ofP X|Z (·|Z) for sampling the counterfeits. We would like to modify the PCR test so it still controls the type I error, when access to the exact conditional lawP X|Z (·|Z) is not feasible. To this end, the next theorem introduces a new test statistic which is based on the discrepancy between conditional lawsP X|Z (·|Z) and b P X|Z (·|Z) along with the rejection thresholds for both the asymptotic setting and the finite-sample setting. We use the expected total variation metric to assess the distance between conditional laws. 81 Theorem3.4.1. LetW ℓ , forℓ∈ [L], be the number of groups with labelℓ as defined in Algorithm 1. For δ such thatE Z h d TV P X|Z (.|Z), b P X|Z (.|Z) i ≤ δ , introduce U ng,L (δ ):= min {p ℓ } ℓ∈[L] L n g (1+Lδ ) L X ℓ=1 (W ℓ − n g p ℓ ) 2 s.t. p ℓ ≥ 0, |p ℓ − 1/L|≤ δ, forℓ∈[L], and L X ℓ=1 p ℓ =1. (3.13) Recall the thresholdsθ finite L,α andθ asym L,α from (3.2). Under the null hypothesis, we have the following relations: P U ng,L (δ )≥ θ finite L,α ≤ α, (3.14) lim ng→∞ P U ng,L (δ )≥ θ asym L,α ≤ α. (3.15) We refer to Section B.1.11 for the proof of Theorem 3.4.1. Note that optimization (3.13) is a quadratic programming and can be solved efficiently. Also, statistic U ng,L (δ ), given as the optimal value of this optimization, is a decreasing function with respect toδ and when there is no mismatch between the true and the approximate version (δ =0), we recover the primary statisticU ng,L that was given by Algorithm 1. As an immediate corollary of Theorem 3.4.1 we can construct validp-value for testing the conditional independence (i.e., super-uniform under the null hypothesis (1.10)), following the same recipe given by (3.5- 3.6), but usingU ng,L (δ ) instead ofU ng,L . In the next theorem, we provide an upper bound on type-I error inflation, for the case that the standard test statistics U n,L is adopted while randomizations are drawn from the estimate conditional law b P X|Z 82 Theorem 3.4.2. Under the null hypothesis (1.10), consider the test statisticU ng,L constructed in Algorithm 1 with the approximate conditional law b P X|Z . The followings hold: P(U ng,L ≥ θ finite L,α )≤ α +E h d TV (P n X|Z , b P n X|Z ) i , lim sup ng→∞ P U ng,L ≥ θ asym L,α ≤ α +lim sup n→∞ E Z h d TV (P n X|Z (.|Z), b P n X|Z (.|Z)) i , whereθ finite L,α ,θ asym L,α are given by (3.2). The proof of Theorem 3.4.2 is deferred to Section B.1.10. It is worth noting that in the model-X setup, b p X|Z is often approximated via a set of unlabeled samples{( e X j , e Z j )} j=1:N . Specifically, when P X|Z be- longs to a parametric family with k parameters and N ≫ kn, the aforementioned total variation dis- tance is of order o p (1). We refer to [21, Section 5.1] for a detailed discussion of conditions under which E[d TV (P n X|Z , b P n X|Z )]=o p (1). 3.5 NumericalExperiments In this section, we evaluate the performance of PCR test and its extensions on synthetic datasets. We consider groups each of size1 (n g =n), unless otherwise is stated. Size of PCR test. We start by showing that the size of PCR test is controlled at the desired level, under various choices of input parametersL andK. Assumen=100 data points{(X i ,Z i ,Y i )} n i=1 are generated 83 i.i.d. from the following model: First draw two vectors v,u ∈R p with i.i.d standard normal entries and p=20. Then, Z∼ N(0,I p ), forZ∈R p , X|Z∼ N(v T Z,1), forX ∈R, (3.16) Y|X,Z∼ N (u T Z) 2 ,1 . ClearlyX ⊥ ⊥ Y|Z and the null hypothesis holds. We assume that the dependency ruleX|Z and the vector v are known, and therefore for every given Z we can easily sample from N(v T Z,1) to construct the counterfeit variables. Figures 3.1 and 3.2 exhibit the performance of the PCR test with thresholdsθ finite L,α and θ asym L,α , respectively. As expected, the θ finite L,α threshold is conservative and controls the size at a level lower thanα . Theθ asym L,α threshold also controls the size, albeitn being only100. StatisticalPowerofPCRtest. Consider a setup similar to (3.16), but withn=1000 data points and the conditional law Y|X,Z∼ N (u T Z) 2 +2X,1 , (3.17) Our power analysis in section 3.2 suggests that larger values of M = KL− 1 would results in higher power. We fix K = 100 and let L vary in the set L = {2,3,...,30}. The significance level is fixed at α = 0.1. Figure 3.3 showcases the power of PCR test with both choices of rejection thresholds θ finite L,α andθ asym L,α . As we see, whenn doubles not only the power increases but also it becomes more stable with respect to the choice ofL. 84 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (a)α =0.05 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (b)α =0.1 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (c)α =0.15 Figure 3.1: Size of PCR test applied on a dataset consisting ofn = 100 samples generated from model (3.16), where X ⊥ ⊥ Y|Z. Three significance levels α = 0.05,0.1, and 0.15 are considered. Statistic U n,L is obtained from Algorithm 1 by using the score functionT(x,z,y) = (y− x− z T 1) 2 , and the decision rule (3.1) is employed with the thresholdθ finite L,α . Reported numbers are averaged out over10,000 independent realizations. 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (a)α =0.05 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (b)α =0.1 0.00 0.05 0.10 0.15 0.20 2 4 6 8 10 Number of labels L Rejection rate K=1 K=100 K=20 K=4 (c)α =0.15 Figure 3.2: Size of PCR test applied on a dataset consisting ofn = 100 samples generated from model (3.16), where X ⊥ ⊥ Y|Z. Three significance levels α = 0.05,0.1, and 0.15 are considered. Statistic U n,L is obtained from Algorithm 1 by using the score functionT(x,z,y) = (y− x− z T 1) 2 , and the decision rule (3.1) is employed with the thresholdθ asym L,α . Reported numbers are averaged out over10,000 independent realizations. Parameter-free PCR test. We consider a setup similar to the previous experiment (3.17) and run the PCR test with different choices of L ∈ {2,4,8,16,32}. We combine the obtained p-values using the Bonferroni’s correction, as described in Algorithm 2. With n = 1000 data points, we get a statistical power of0.192 (with the finite-sample threshold), and 0.815 (with the asymptotic threshold). Note that in 85 5 10 15 20 25 30 35 Number of labels 0 0.2 0.4 0.6 0.8 1 Rejection rate (a)n=1000 data points 5 10 15 20 25 30 35 Number of labels 0 0.2 0.4 0.6 0.8 1 Rejection rate (b)n=2000 data points Figure 3.3: Power of PCR test for (left) n = 1000 and (right) n = 2000 data points. Data points are generated under the setup (3.16) and the conditional law (3.17). We consider the score function(y− x− z T 1) 2 and choose the significance level α =0.1. We consider the decision rule (3.1) with both of the rejection thresholdsθ asym L,α andθ finite L,α . Each reported power is obtained by averaging over1000 trials. this case, the power of the PCR test with different individual choices of L (without combining thep-values) ranges in(0.13− 0.53), for the finite-sample threshold, and in (0.576− 0.887), for the asymptotic-threshold. For n = 2000 data points, and with the Bonferroni’s correction, we get a power of 0.613, with the finite-sample threshold, and a power of 0.972, with the asymptotic threshold. Here, the power of the PCR test with individual choices of L ranges in (0.477− 0.83), for the finite-sample threshold, and in (0.8560− 0.981), for the asymptotic threshold. RobustnessofthePCRtest. In this part, we consider cases where the exact dependency lawP X|Z is not available, and we use an estimate of it denoted by b P X|Z (see Section 3.4 for the details and the description of the robust PCR test). Consider a setup similar to (3.16), but withn=5000 data points and the conditional law Y|X,Z∼ N (u T Z) 2 +aX,1 . (3.18) 86 When a = 0, then the null hypothesis is true (X ⊥ ⊥ Y|Z) and the rejection rate amounts to the type I error. Fora̸=0, the null hypothesis is false and the rejection rate amounts to the power of the test. In the current experiment, we assume that the counterfeits are sampled from b P X|Z with b X|Z∼ N(v T Z,(1+η ) 2 ). (3.19) Note that whenη = 0, we get the true distributionP X|Z defined in (3.16). We use the Pinsker’s inequal- ity ∗ to bound the expected total variation distanceE Z h d TV P X|Z (·|Z), b P X|Z (·|Z) i . Note that for two 1-dimensional Gaussian distributions we have d KL N(µ,σ 2 1 ),N(µ,σ 2 2 ) =log σ 2 σ 1 + σ 2 1 2σ 2 2 − 1 2 , which combined with Pinsker’s inequality implies that E Z h d TV P X|Z (·|Z), b P X|Z (·|Z) i ≤ δ := 1 √ 2 log(1+η )+ 1 2(1+η ) 2 − 1 2 1/2 . (3.20) The results fora = 0 anda = 4 are summarized in Table 3.1. As we see the robust PCR test controls the type I error under the levelα =0.1 for different choices of η . In addition, it achieves a high power for a=4. It is also worth noting that if we use the test statisticsU n,L (instead ofU n,L (δ )) we observe an inflation in type I errors. Concretely, when η = 0.04 we obtain an inflated type I error of 0.595 (with the finite- sample thresholdθ finite L,α ) and an inflated type I error of 0.1860 (with the asymptotic thresholdθ asym L,α ), while ∗ d TV (P,Q)≤ r 1 2 d KL (P,Q) 87 a=0 a=4 setting η 0 0.01 0.02 0.04 0 0.01 0.02 0.04 U n,L (δ ) withθ finite L,α 0.008 0 0 0 1 0.998 0.973 0.63 U n,L (δ ) withθ asym L,α 0.1050 0.003 0 0 1 1 0.995 0.8790 Table 3.1: Size (a = 0) and power (a = 4) of the robust PCR test for the setting of (3.18) and the approximate distribution (3.19) available for sampling the counterfeits. We considera=0,4 andn=5000 data points. The PCR test is run withL=4 number of labels. For each value of the discrepancy levelη , we use (3.20) to get an upper bound δ on the expected total variation distanceE Z h d TV P X|Z (·|Z), b P X|Z (·|Z) i , and use it in constructing the robust statisticU n,L (δ ). Note thatη =0 impliesδ =0, and therefore the robust statisticU n,L (δ ) matches the statisticU n,L in Algorithm (1). Fora = 0 (true null hypothesis), the size is controlled at the significance level α = 0.1. Reported numbers are obtained by averaging over1000 trials. the target level is α = 0.1. This highlights the importance of adjusting for the errors in estimating the model-X conditional distribution, as proposed in Section 3.4. 3.5.1 Realdataexperiment: CapitalBikesharedataset In this section, we evaluate the performance of the PCR test on real data from the Capital Bikeshare † . Capital Bikeshare is bike-sharing system in Washington, D.C, and releases its trips data on a quarterly basis. The data includes each trip taken, start date and time, end date and time, start and end stations, Bike ID, and the user type indicating whether the rider was a registered member or if it was a casual ride (one-time rental or a short pass). In this experiment, we use our proposed PCR test to study the independence of the trip duration (X), and other variables, such as the user type (Y ), and providep− values for their associations. A similar data and question has been studied by [21] using the Conditional Permutation Test (CPT). As can be imagined, the trip duration (X) heavily depends on the route (length of the rout, elevation ,etc) and the time of the † The dataset is publicly available athttps://www.capitalbikeshare.com/system-data. 88 day at the start of the ride (due to varying traffic and the rush hours). To control for the effect of such variables, we condition on the start and end locations and the day hourZ =(Z start loc ,Z end loc ,Z hour ). In order to implement the PCR test, we use the conditional normal distributionX|Z∼ N(µ (Z),σ 2 (Z)) as an approximation of P X|Z . We follow the procedure of [21] to estimate the mean µ (z) and variance σ 2 (z). We outline the procedure here for the reader’s convenience. We consider a test data, consisting of the rides taken on weekdays in Oct 2011, and a training data consisting of the rides taken on weekdays in Sep 2011 and Nov 2011. The test data is used to for testing conditional independence between factors of interest, and the training data is used to estimate the conditional mean and variance (µ (z),σ 2 (z)). To have reliable estimation, we eliminate the records in the test data for which the corresponding route in the training data has less than 20 rides. After this preprocessing step, the test data includes 7,346 samples. Finally, the conditional functionsµ (z) andσ 2 (z) are estimated using a Gaussian kernel with a bandwidth of20 minute, on the training data. (See [21, Appendix B]for further details on this part.) We test the null hypothesis (1.10) withX being the duration of the ride, and three different response variablesY : (1) User type– registered members have acquaintance with the routes and are likely to have lower trip durations, (2) Date of the month (continuous variable from1− 30)– this can be used to capture effect of factors such as weather and sunlight hours. (3) Weekday (categorical variable from Monday to Friday)– rides on the early days of the week are likely to be more work-related. For score function to be used in the PCR test, we consider the squared residual from regressingY onX. As an example, whenY is the user type, we encode it as a binary variableY =I{the user is a registered member}, and fit the linear modelY =b 0 +b 1 X to the training data to obtain the estimates b b 0 , b b 1 . In this experiment, we use the PCR test withL = 10 number of labels and the counterfeit ratioK = 200, and therefore M = 1999. Further, in order to reduce the variation between the true distribution 89 ResponseY p-value (finite) p-value (asym) User type 0.0014 0 Date 0.3855 0.0456 Week day 0.2094 0.0194 Table 3.2: P -values that are computed from the PCR test on the Capital Bikeshare dataset. The null hypothesis (1.10) is considered withX being the duration of the ride, and the confounder variableZ encoding the start and end locations, as well as the time of day at the start of the ride. We consider three different response values Y : (1) User type, (2) Date of the month, (3) Weekday. Thep-values are obtained as per (3.5) with the number of labelsL = 10, and the counterfeit ratioK =200. P X|Z distribution and its estimate N(b µ (Z),b σ 2 (Z)), we use PCR test with groups each of size 4. This means that in Algorithm (1) the number of groups is n g = ⌊n/4⌋. In this case, for each group of data pointsG =(X,Y) we use the following test statistic T(G)= 1 4 ∥ b b 0 1+ b b 1 X− Y∥ 2 2 , with1 being the all-one vector inR 4 . We calculate thep-values for each of the conditional independence tests, using (3.5). The results are summarized in Table 3.2. As we see among the three response variables considered in this experiment, user type has the most significant (conditional) dependence to duration of the ride. 3.6 Comparisonwithotherconditionalrandomizationtests In this section, we explain the advantages of our PCR test over other conditional randomization tests by providing a set of theoretical results as well as thorough numerical study with various choice of score functions. Before proceeding, let us stress that we do not claim the PCR test to be always more powerful 90 than other CRTs, but would like to explore settings where the PCR test offers clear advantages over the other competitors. We first discuss a regression setting, where the CRT with marginal covariance score function, provably gets power less than c 0 α , for any arbitrarily small but fixed c 0 > 0. This result holds even with infinite sample size and infinite number of counterfeits. Theorem3.6.1. Consider the following model between the response variableY and covariateX: X ∼ N(0,1), Y =g(X)+ε, ε∼ N(0,1), (3.21) where regression functiong(x) is an even function. Define η g := 1+E[X 2 g(X) 2 ] 1+E[g(X) 2 ] 1/2 . Then, the followings hold: (a) Foranyα ∈(0,1/2),thetwo-sidedCRTatsignificancelevel α (rejectα/ 2-thupperandlowerquantiles) has power smaller than 8 π (η 2 g +2η g ). Formally, lim M→∞ lim n→∞ P p (M) n − 1 2 ≥ 1− α 2 ≤ 8 π (η 2 g +2η g ). (b) For anyα ∈ (0,1/2− γ ) withγ > 0, the one-sided CRT at significance level α (rejecting eitherα -th upper or lower quantile) has power smaller than η 2 g +2η g 2πγ 2 . Formally, lim M→∞ lim n→∞ P p (M) n ≥ 1− α ≤ η 2 g +2η g 2πγ 2 , lim M→∞ lim n→∞ P p (M) n ≤ α ≤ η 2 g +2η g 2πγ 2 . We refer to Section B.1.2 for the proof of Theorem 3.6.1. The next corollary follows from the above theorem. 91 Corollary 3.6.2. For g(x) = 1 √ θ 2 +x 2 , a simple algebraic calculation shows that η g ≤ 5θ √ 2π . Therefore, by havingθ small enough (depending onα ), the power of CRT is less thanα/ 2. To illustrate the advantage of the PCR test in this case, we generate a data set(X,Y) withn = 1000 data points according to (3.21) withg(x)= 1 √ 10 − 6 +x 2 . We run the two-sided CRT at significance level α = 0.1 withM = 1000 counterfeits. The statistical power of CRT, averaged overN = 10,000 experiments turns out to be zero. We also run the PCR test on the same example, withL=5 and different values for K, and with the same score function. The number of counterfeits per each sample is thereforeM =5K− 1. In this experiment, PCR test is considered with groups of size 4 (with n g = 250) at significance level α = 0.1. We consider both of the rejection thresholds θ asym ,θ finite for decision rule (3.1). The PCR test achieves perfect power for different choices of K ≥ 4 (and so different numbers of counterfeits) for both of the rejection thresholds. Before proceeding to the next result, it is useful to remark on a few points regarding Theorem 3.6.1. Remark3.6.3. The marginal covariance statisticT(X,Y)=n − 1 X T Y used as a score function for CRT is a popular choice in many applications [196], [121]. In particular, [190] analyzed statistical power of the CRT with this score function under the high dimensional linear regression setting. However, let us stress that the CRT with other choices of score function may have a high power even for the setting of Theorem 3.6.1. The mainmessageoftheabovetheoremistopointoutalternativeswherethedeviationofnormalizedscoresfrom theuniformdistributionoccursatthemiddlerangeratherthanextremevalues,andshowthatinthesescases the design of our PCR test will help to discern this deviation. Remark 3.6.4. Theorem 3.6.1 is not in conflict with the result of [97], where it states that in model-X CI problems the CRT equipped with a likelihood-based score function is the most powerful test against a point 92 alternative. Moreprecisely,forthedatasettingofTheorem3.6.1theoptimalscorefunctionfortheCRTisgiven by T opt (X,Y)= n Y i=1 L(y i |x i )=(2π ) − n/2 exp − 1 2 n X i=1 (y i − g(x i )) 2 ! . Note that T opt (X,Y) requires the knowledge of the link function g(·), which is a strong assumption in the model-X framework as it aims to make no assumption on the conditional lawL(Y|X). We next prove a (stronger) converse of Theorem 3.6.1. Informally, it states that whenever the CRT has non-trivial power, the PCR test will also have non-trivial power (it achieves any power1− β , provided a large enough sample size). Note that this result holds for any alternative hypothesis and any choice of score function. Theorem3.6.5. Consideranalternativehypothesis,underwhichtheCRTachievesanot-trivialpower,with a proper choice of score function. This in particular implies that the distribution of normalized rank deviates from the uniform distribution, i.e., P 1 M +1 1+ M X j=1 I{T(X,Z,Y)≥ T( e X j ,Z,Y)} ≤ α ≥ α +δ, (3.22) forsomeδ > 0. ConsiderthePCRtestwithLnumberoflabels(withL≥ 1/α )andn g groups,eachofsizen (so the total sample size ofnn g ). Under the following condition δ ≥ L 1/4 √ n g max r 3log 1 β + 3log 1 β +2 r log 1 α +2log 1 α ! 1/2 , 1 , (3.23) the PCR test asymptotically achieves statistical power larger than1− β : lim inf ng→∞ P U ng,L ≥ θ asym L,α ≥ 1− β. 93 In addition, if the gap valueδ satisfies the following lower bound, δ ≥ 32L 1/4 √ n g 1 √ α ∨ 1 β 1/2 , (3.24) then the PCR test with the finite-sample threshold θ finite L,α achieves statistical power larger than1− β : P U ng,L ≥ θ finite L,α ≥ 1− β. We refer to Section B.1.9 for the proof of Theorem 3.6.5. 3.6.1 Numericalexamples In this section, we compare the performance of PCR with other model-X CI tests. For this end, we consider CRT, dCRT (distilled CRT) [117], and HRT (holdout randomization test) [175]. We focus on the following data generating law Y|X,Z∼ N ν √ X 2 +c 2 +νβX +γ T Z,1 , (3.25) for β ∈ R, γ ∈ R p , c = 0.001, and X,Z with i.i.d standard normal entries. We focus on two different settings: (i) low-dimensional (n>p) , and(ii) high-dimensional (nτ. (4.2) withH 0 representing the null hypothesis andH A the alternative. Lemma4.1.2. The followings hold: 1. By choosingf(t)= 1 2 |t− 1| (total variation distance) we get E[D TV (Bern(η (X))∥Bern(b η (X)))]=E[|b η (X)− η (X)|]. 2. For the choice off(t)=tlogt (KL divergence), we get E[D KL (Bern(η (X))∥Bern(b η (X)))]= CE(b η )− CE(η ), where for a modelb η (x):X →[0,1], its cross entropy loss is given by CE(b η )=− E[η (X)logb η (X)+(1− η (X))log(1− b η (X))]. 100 The minimum cross-entropy loss is achieved whenb η =η . 3. For the choice off(t)=( √ t− 1) 2 (Hellinger distance), we get E[D H (Bern(η (X))∥Bern(b η (X)))]=E p η (X)− p b η (X) 2 + p 1− η (X)− p 1− b η (X) 2 . The Proof of Lemma 4.1.2 is given in Section C.1.1. It is worth noting that by considering different f, our framework allows to for a variety distance measures between oracle modelη and the estimate model b η . We conclude this section by two definitions that will be used later in Section 4.3. Definition4.1.3. (subdifferential)Foraconvexfunction f :R→R,thesubdifferential ∂f(t)atapointtis given by the following set of real values ∂f(t)={u∈R:f(s)− f(t)≥ u(s− t),∀s∈R}. In addition, for differentiable f, we have∂f(t)=f ′ (t). Definition4.1.4. (conjugate dual) The conjugate dual function off :R→R is defined as f ∗ (t)=sup s∈R (st− f(s)). In addition, for convex lower semi-continuousf, we can writef in terms of its conjugate dual as f(s)=sup t∈R (st− f ∗ (t)). 101 4.2 Distribution-freesetting In this section, we develop a new methodology for testing the null hypothesis in (4.2) without imposing any structure on the conditional lawη (x), covariatesx distribution, or predictive modelb η (x). In order to provide a high level intuition behind the main idea of the procedure, we first introduce a sampling scheme which characterizes thef-divergence between the modelsη andb η as the conditionalf-divergence distance of a sampled value andUnif[0,1] distribution. Proposition4.2.1. For(x,y)∼P let w = u 1 ∼ Unif[0,b η (x)], y =1, u 2 ∼ Unif[b η (x),1], y =0. (4.3) Then we have E X D f (L W|X ∥Unif([0,1])) =E X [D f (Bern(η (X))∥Bern(b η (X)))]. As a special case, ifb η =η thenW|X ∼ Unif([0,1]). The proof of Proposition 4.2.1 is given in Section C.1.2. Proposition 4.2.1 implies that deviation of sampled w from the uniform [0,1] can be counted as evidence for large distance of η andb η . In the next section we elaborate the formal process to construct a set of statistics that will be used later to define the decision rules for testing null hypothesis (4.2). 102 4.2.1 Teststatistic Algorithm 3 describes the procedure for constructing the test statisticV n,L . The construction consists in two main steps: Counterfeit sampling. For each data point (x,y), we follow the procedure described in (4.3) and con- struct the sampled valuew. We then construct randomizations f W 1 ,... f W M from the uniform distribution [0,1]. Scoreandlabel. In this step, by using a score functionT :X × [0,1]→R, we try to score the original sample T(x,w) and corresponding values T(x, f W j ). Then the original data point (x,y) will be labeled based on the relative location ofT(x,w) among the counterfeit values{T(x, f W j )} j=1:M . The final output statisticV n,L ∈R L denote the count of each label among the wholen samples. Note that the number of labelsL is an input parameter in Algorithm 3. The construction of test statisticV n,L is inspired by the PCR test proposed by [87], where a similar structure (counterfeit sampling-scoring-labeling) is used but for a different task, namely to test conditional independence between variables. Also in [87], the counterfeits are drawn from a conditional distribution (depending on variables under test), while here the counterfeits are drawn fromUnif([0,1]), no matter the features or labels. 103 Input: n data points(x j ,y j )∈X ×{ 0,+1}, the modelb η :X →[0,1], a score function T :X × [0,1]→R, and integersM,L≥ 1 such thatM +1=KL for some integerK. Output: Test statisticsV n,L =[V 1 ,...,V L ]∈R L . forj∈{1,2,...,n}do • Let w j = ( u 1 ∼ unif[0,b η (x j )], y j =1, u 2 ∼ unif[b η (x j ),1], y j =0. • DrawM i.i.d. samples f W (1) j ,..., f W (M) j fromUnif[0,1]. • Use score functionT to score the initial sample(x j ,w j ) and itsM constructed counterfeits {(x j , f W (1) j ),...,(x j , f W (M) j )}: T j =T(x j ,w j ) e T (i) j =T(x j , f W (i) j ), fori∈[M]. • LetR j denote the rank ofT j among{T j , e T (1) j ,..., e T (M) j }: R j =1+ M X i=1 I n T j ≥ e T (i) j o • Assign labelL j ∈{1,2,...,L} to samplej if(L j − 1)K +1≤ R j ≤ KL j . end forℓ∈{1,2,...,L}do • LetV ℓ be the number of samples with labelℓ, i.e. V ℓ = j∈{1,2,...,n}:L j =ℓ . end Algorithm3: Construction of GRASP test statistic (Distribution-free setting) 104 4.2.2 Decisionrule We introduce two decision rules based on the statisticsV n,L =[V 1 ,...V L ] given by Algorithm 3. For this end, we first construct the following two test statistics that will be used later for the decision rules: U finite τ (V n,L )= min p∈R L 1 n L X ℓ=1 (V ℓ − np ℓ ) 2 p ℓ + 1 L s.t p ℓ ≥ 0, ∀ℓ∈[L], L X ℓ=1 p ℓ =1, 1 L L X ℓ=1 f(Lp ℓ )≤ τ. U asym τ (V n,L )= min p∈R L 1 n L X ℓ=1 (V ℓ − np ℓ ) 2 p ℓ s.t p ℓ ≥ 0, ∀ℓ∈[L], L X ℓ=1 p ℓ =1, 1 L L X ℓ=1 f(Lp ℓ )≤ τ. (4.4) Note that the choice of functionf and the value ofτ in the test statisticsU asym τ (V n,L ) andU finite τ (V n,L ) are determined by the null hypothesis (4.2). We next consider the following two decision rules based on (4.4). At the significance level α , the decision rules are given by Φ asym n,L,α,τ = 1 (rejectH 0 ), ifU asym τ (V n,L )≥ χ 2 L− 1 (1− α ), 0 (fail to rejectH 0 ), otherwise . (4.5) The next decision rule is given by Φ finite n,L,α,τ = 1 (rejectH 0 ), ifU finite τ (V n,L )≥ L+ q 2L α , 0 (fail to rejectH 0 ), otherwise . (4.6) 105 The rationale behind theasym andfinite names comes from the fact that we show that later they have asymptotic and finite-sample control guarantees for type I error, respectively. As we show in the next lemma, the asymptotic rule is less conservative than the finite rule. Proposition 4.2.2. For L sufficiently large, the asymptotic decision rule (4.5) rejects more than the finite decision rule (4.6). More precisely, forL≥ 55 and everyα ∈(0,1), ifΦ finite n,L,α,τ =1 thenΦ asym n,L,α,τ =1. We refer to Section C.1.3 for the proof of Proposition 4.2.2. We conclude this section by providing some insight behind the test statistics. Consider the case of perfect alignment (τ = 0), for which we showed w j are uniform in [0,1]. Therefore, under the null each sample is identically distributed as its counterfeits and its label, assigned in the last step of Algo- rithm 3, follows multi(L) distribution. The test statistics in (4.4) reduces to the Pearson’s chi-square test statistic. When τ > 0, we will show in Theorem 4.2.5 that the labels follow a multinoulli distribution multi(p 1 ,...,p L ) which should be close to the uniform multinoulli distributionmulti(L) in the sense that 1 L P L ℓ=1 f(Lp ℓ ) ≤ τ . However, the nominal probabilities p ℓ are unknown and so in our construction of test statistics (4.4), we form an optimization problem overp ℓ to impose this constraint and then consider the minimum Pearson’s chi-square test value over the feasible probability vectors(p 1 ,...,p L ). 4.2.3 Sizeofthetest We will show that the proposed decision rules control the size of the test under the pre-assigned levelα . We first state the following technical assumption which posits a density function for specific conditional law forT(X,W) andT(X, f W). 106 Assumption4.2.3. Consider the following conditional cumulative distribution functions: F o (t;x)=P(T(X,W)≤ t|X =x), F c (t;x)=P(T(X, f W)≤ t|X =x). where (X,W) is is given by (4.3), and f W is drawn independently from Unif[0,1]. Assume that the density functions of these cdfs exist and indicate them by f o (t;x) and f c (t;x). In addition, a new measure on the interval[0,1] can be considered with cdfω(u;x) := F o (F − 1 c (u;x);x). Suppose that this measure is absolutelycontinuouswithrespecttotheLebesguemeasure,anddenoteitsdensityfunction(Radon–Nikodym derivative) byψ (u;x). Proposition 4.2.4 (distribution-free). Let V n,L = [V 1 ,...,V L ] be the output of Algorithm 3, then V n,L has a multinomial distribution with L outcomes and nominal probabilities p 1 ,...,p L . In addition, under Assumption 4.2.3 the following holds: 1 L L X ℓ=1 f(Lp ℓ )≤ E h D f L(T(X,W)|X)∥L(T(X, f W)|X) i ≤ E h D f L W|X ∥L f W|X i =E D f L W|X ∥Unif([0,1]) . The proof of this Proposition is given in Section C.1.4. It is easy to observe that the second inequality in Proposition 4.2.4 can be achieved by using the score function T(x,w) = w. The next theorem is an immediate consequence of Proposition 4.2.1 and Proposition 4.2.4. 107 Theorem 4.2.5 (distribution-free). LetV n,L = [V 1 ,...,V L ] be the output of Algorithm 3 that has a multi- nomial distribution with nominal probabilities p 1 ,...p L , as per Proposition 4.2.4. Then for a score function T :X × [0,1]→R satisfying Assumption 4.2.3, the following upper bound holds: 1 L L X ℓ=1 f(Lp ℓ )≤ E h D f Bern(η (X))∥Bern(b η (X)) i . Note that by definition of f-divergence for discrete distributions we have D f (multi(p 1 ,...,p L )||multi(L))= 1 L L X ℓ=1 f(Lp ℓ ). Therefore, Theorem 4.2.5 implies that under the null hypothesis (4.2) thef-divergence between the multi- noulli distributionmulti(p 1 ,...,p L ) and the uniform multinoulli distributionmulti(L) should be bounded byτ . As can be observed from (4.4), our decision rules are based on optimization problems over probability vectors{p ℓ }, which minimizes a chi-squared type statistic subject to the constraint 1 L P L ℓ=1 f(Lp ℓ )≤ τ . In other words, by virtue of Theorem 4.2.5, we can reduce the null hypothesis regarding the quantity of interestE[D f (Bern(η (X))∥Bern(b η (X)))] to a hypothesis on the quantity 1 L P L ℓ=1 f(Lp ℓ ) which is more amenable to test. Algorithm 4 outlines a simple version of Algorithm 3 with the score function T(x,w) = w. In this case, scores of counterfeits will be uniform random variables on [0,1], and we label the original scores simply by partitioning the [0,1] interval into L subintervals of equal size. It is worth noting that the second inequality of Proposition 4.2.4 becomes an equality for the choice of score functionT(x,w) = w. At first glance, this is expected to always results in a higher power compared to other choices of T . This 108 argument is not valid though, since the gap in the first inequality of 4.2.5 is undetermined, and so we keep both versions (score functionT(x,w)=w, and general score function) in this chapter. The next result provides control over the size of our test with the two decision rules (4.6) and (4.5), in the finite sample and asymptotic settings. Input: n data points(x j ,y j )∈X ×{ 0,+1}, the modelb η :X →[0,+1], and an integerL≥ 1. Output: Test statisticV n,L =[V 1 ,...,V L ]∈R L . forj∈{1,2,...,n}do • Let w j = ( u 1 ∼ unif[0,b η (x j )], y j =1, u 2 ∼ unif[b η (x j ),1], y j =0. • Assign labelL j ∈{1,2,...,L} to samplej if L j − 1 L ≤ w j ≤ L j L . end forℓ∈{1,2,...,L}do • LetV ℓ be the number of samples with labelℓ, i.e. V ℓ = j∈{1,2,...,n}:L j =ℓ . end Algorithm4: GRASP test statistic (distribution-free setting and with the score functionT(x,w)= w) Theorem 4.2.6. LetV n,L = [V 1 ,...,V L ] be the output of either Algorithms 3 or 4. Consider the decision rulesΦ asym n,L,α,τ andΦ finite n,L,α,τ ,whicharegivenrespectivelyin (4.5)and (4.6),withtheteststatisticsU asym τ (V n,L ) andU finite τ (V n,L ) as per (4.4). Under the null hypothesis (4.2), we have: P Φ finite n,L,α,τ =1 ≤ α, lim sup n→∞ P Φ asym n,L,α,τ =1 ≤ α. We refer to Section C.1.5 for the proof of Theorem 4.2.6. This implies that deploying the decision rule Φ finite n,L,α,τ controls the type I error at levelα , for every finite n. In addition, the decision ruleΦ asym n,L,α,τ has an 109 asymptotic control over the type I error at the significance level α . Let us stress that the result of Theorem 4.2.6 is valid for every choice of parametersK,L and score functionT . 4.2.4 P-valuesandone-sidedconfidenceintervals Considering the definition of rules Φ asym n,L,α,τ and Φ finite n,L,α,τ and the results of Theorem 4.2.6, we construct the following p-values for the hypothesis testing problem in (4.2). p finite n,L,τ = 1, U finite τ (V n,L )≤ L, 1∧ 2L (U finite τ (V n,L )− L) 2 , otherwise. p asym n,L,τ =1− F L− 1 (U asym τ (V n,L )), whereF L− 1 (t) denote the cdf of a Chi-squared random variable withL− 1 degrees of freedom. Super- uniformity of thesep-values, under the null hypothesis, follows simply from Theorem 4.2.6. Formally, for everyt∈[0,1] we have P p finite n,L,τ ≤ t ≤ t, ∀n,L≥ 1. lim n→∞ supP p asym n,L,τ ≤ t ≤ t, ∀L≥ 1. We next use the duality between confidence intervals and hypothesis testing to construct a one-sided confidence interval for the quantity of interest E x [D f (Bern(η (x))∥Bern(b η (x)))]. The intuition behind this construction comes from the fact that for a fixed value V n,L , the test statistics U asym τ (V n,L ) and U finite τ (V n,L ) are nonincreasing inτ . 110 Proposition4.2.7. Forα ∈(0,1), let τ finite n,L,α =sup n τ ≥ 0:U finite τ (V n,L )≥ L+ p 2L/α o , τ asym n,L,α =sup τ ≥ 0:U asym τ (V n,L )≥ χ 2 L− 1 (1− α ) . Then the followings hold P E h D f Bern(η (X))∥Bern(b η (X)) i ≥ τ finite n,L,α ≥ 1− α, lim n→∞ infP E h D f Bern(η (X))∥Bern(b η (X)) i ≥ τ asym n,L,α ≥ 1− α. Note that the probabilities in the above equations are with respect to the randomness in τ finite n,L,α and τ asym n,L,α , which stem from the randomness inV n,L . We refer to Section C.1.7 for the Proof of Proposition 4.2.7. 4.2.5 Choiceofthescorefunction In this section, we provide some insights on the choice of score function T in Algorithm 3. First, from Theorem 4.2.5 we have 1 L L X ℓ=1 f(Lp ℓ )≤ E h D f Bern(η (X))∥Bern(b η (X)) i . Note that in Algorithm 3 we test for 1 L P L ℓ=1 f(Lp ℓ ) ≤ τ (which holds under null), and therefore the statistical power of our method in the first place depends on the gap between the quantity of primary interestE h D f Bern(η (X))∥Bern(b η (X)) i and 1 L P L ℓ=1 f(Lp ℓ ). 111 We next explore scenarios under which the gap in the chain of inequalities presented in Proposition 4.2.4 can be tightened. Recall the chain of inequalities summarized below: 1 L L X ℓ=1 f(Lp ℓ )≤ E h D f L(T(X,W)|X)∥L(T(X, f W)|X) i ≤ E h D f L W|X ∥L f W|X i =E D f L W|X ∥Unif([0,1]) =E X [D f (Bern(η (X))∥Bern(b η (X)))]., (4.7) where the last step is proved in Proposition 4.2.1. Our goal is to develop a score function that narrows the gap between the left and right-hand sides in the above chain of inequalities. Note that the second inequality becomes tight whenT(x,w) = w, a simple score function outlined in Algorithm 4. While it is a straightforward choice, it does not account for possible dependence betweenx andw, and it is also not clear how this choice would impact the first inequality. To examine the first inequality, we consider the limit case of K,L→∞ and see if the gap in the first inequality becomes tight in this asymptotic case. Our next result answers this question in the negative. In contrast, we later answer the similar question in the positive for the model-X setup. Further details can be found in Proposition 4.3.3. Proposition4.2.8. Recallthedensityfunctionψ (u,X)fromAssumption4.2.3. Considerthesimilarsetupof Proposition4.2.4withthefollowingtwoadditionalassumptions: i)ForapositivevalueC wehaveψ (u;X)≤ 112 C a.s. over [0,1], and ii) ψ (u,X) is differentiable on (0,1) and there exists a positive constant B such that a.s. we have| ∂ ∂u ψ (u,X)|≤ B. Then the following holds asK,L grow to infinity: lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )= Z 1 0 f(E X [ψ (u;X)])du ≤ E X Z 1 0 f(ψ (u;X))du =E X h D f L(T(X,W)|X)∥L(T(X, f W)|X) i . (4.8) We refer to Section C.1.8 for the proof of Proposition 4.2.8. The above Proposition highlights an important issue: even withK,L → ∞, we can still have a gap in the first inequality in (4.7) for any strictly convex divergence function f, no matter the choice of score functionT . One can try to minimize this gap for someT , but apart from being a challenging question it is not clear how this choice would impact the second inequality in (4.7). In Section 4.4, Experiment 4.4.2, we evaluate the performance of our test using T(x,w) = w. We repeat the same numerical study in Experiment 4.4.3 using another score function which also depends on x. Concretely, we regress w on to x, and take T(x,w) = |w− x T b θ | the residuals. As we see the former choice of T(x,w) slightly outperforms the regression-type one in power, for different choices of f-divergence. These experiments indicate that the simple choice ofT(x,w) = w can be competitive and in general other more complicated score score which takes into account the dependence betweenw andx may achieve a lower power. 113 4.2.6 Testingperfectfitofthemodel By setting τ = 0 in hypothesis (1.12), our framework allows to test whether we have perfect fit for the model at hand b η . We next provide a modification of our test tailored for this special case, which has demonstrated higher power in our numerical experiments. The main adjustment to the framework is that the score functions are now defined at dataset level, rather than sample level. Letx,w respectively stand for{x i } i=1:n ,{w i } i=1:n with sampling process forw i described in (4.3). In addition, forj∈[M] we define w j to be a vector of sizen with iid samples drawn fromUnif([0,1]). By Proposition 4.2.1, under the null, we have W|X ∼ Unif([0,1]) and therefore, T(x,w),T(x,w 1 ),...,T(x,w M ) are exchangeable. Using this observation, we construct the followingp-value for the null: p= 1+ M P j=1 I(T(x,w)≥ T(x,w j )) M +1 , (4.9) Using the exchangeability property, in the next proposition we show that the abovep-value is super uni- form under null and hence we have control on the size of the test. The above construction ofp-value is inspired by the Conditional Randomization Test (CRT) [29]. Al- though the choice of score functionT is optional, it is recommended to use a score function that captures the variation betweenw andx, and henceT(x,w) is smaller than most of its counterfeitsT(x,w j ), re- sulting in a smallp-value under the alternative hypothesis. One simple choice is to use the residual (e.g., mean squared error) of a linear regression model when regressing variable w on x. Alternatively, more complex predictive methods such as Lasso, random forest, or neural networks can be used to fit a model, and the residual of the fitted model can be reported as the score value. 114 Our next proposition shows the super-uniformity of p-value (4.9) under the null, and its proof is de- ferred to Section C.1.9. Proposition 4.2.9. Under the null hypothesis (4.2) with τ = 0, the p-value (4.9) is super uniform, i.e. for everyt∈[0,1] we haveP(p≤ t)≤ t. In Example (4.4.5), we evaluate the performance of our test in a setting where the size of the test data is significantly smaller than the training size, and is comparable to the feature dimension. The modification made in defining the test score at the data-set level allows us to achieve non-trivial power even for small size test data. For the score function, we regressw onx using a three-layer neural network and define the score function as the mean-squared-error of this model across the dataset. As we discuss, although the test error is not an indicative measure of goodness-of-fit, our test returns a significantly smaller p-value for the model that is closer to the underlying conditional law. We refer to Example (4.4.5) for further details. 4.3 Model-Xsetting In many applications, we may have access to a large amount of unsupervised data (covariate data without corresponding labels) in addition to a limited number of labeled observations. Motivated by this trend, [29] proposed model-X setup where it assumes no knowledge of the conditional distribution of labels Y|X, but assumes that the joint distribution of the covariates is known, or can be well approximated. Model- X setup is also relevant in experimental design, where we control the covariate distribution and so it is known to us, e.g., in randomized controlled trials, as well as sensitivity analysis of quantitive models [155]. This modeling assumption has paved the way to address several statistical problems which are open or notoriously hard otherwise. For example, [29] used this setup to extend the knockoff framework of [7] to high-dimensional regime and provided a methodology for variable selection with controlled false discovery 115 rate in this regime. Also, it develops a conditional randomization test using the knowledge of covariates distribution (see [176, 21, 87] for other related work on this topic.) Using similar ideas as in GRASP , in this section we propose model-X GRASP , which uses the covari- ates distribution to improve the power in flagging the differences between the estimate model b η and the oracle modelη . Our next proposition is analogous to Proposition 4.2.1 and extends it to the model-X setup. Proposition4.3.1. For(x,y)∼P let w = u 1 ∼ Unif[0,b η (x)], y =1, u 2 ∼ Unif[b η (x),1], y =0. (4.10) Then the following holds: D f (L X,W ∥P X × Unif([0,1]))=E h D f Bern(η (X))∥Bern(b η (X)) i . The proof of Proposition 4.3.1 is given in Section C.1.6. Proposition 4.3.1 implies that the distance of interestE[D f (Bern(η (X))∥Bern(b η (X)))] can be seen as thef-divergence between multivariate random variables(X,W) and( e X, f W) withX, e X ∼P X , independently and f W ∼ Unif([0,1]). 4.3.1 Teststatisticanddecisionrule Algorithm 5 describes the procedure for constructing statisticsV n,L under model-X setup for testing the null hypothesis in (4.2). The overall procedure is similar to the distribution-free counterpart proposed in Algorithm 3. The major difference is that here the counterfeits are of the form (˜ x, f W), with the covariate 116 componente x drawn fromP X and f W drawn from Unif[0,1]. In contrast, in the distribution-free version (cf. Algorithm 3) the covariatex was fixed for an original sample and its counterfeits. Input: n data points(x j ,y j )∈X ×{ 0,+1}, the modelb η :X →[0,+1], a score function T :X × [0,1]→R, and integersM,L≥ 1 such thatM +1=KL for some integerK. Output: Test statisticsV n,L =[V 1 ,...,V L ]∈R L . forj∈{1,2,...,n}do • Let w j = ( u 1 ∼ unif[0,b η (x j )], y j =1, u 2 ∼ unif[b η (x j ),1], y j =0. • DrawM i.i.d. samples f W (1) j ,..., f W (M) j fromUnif[0,1] ande x (1) j ,...,e x (M) j fromP x . • Use score functionT to score the initial sample(x j ,w j ) and itsM constructed counterfeits {(e x (1) j , f W (1) j ),...,(e x (M) j , f W (M) j )}: T j =T(x j ,w j ) e T (i) j =T(e x (i) j , f W (i) j ), fori∈[M]. • LetR j denote the rank ofT j among{T j , e T (1) j ,..., e T (M) j }: R j =1+ M X i=1 I n T j ≥ e T (i) j o • Assign labelL j ∈{1,2,...,L} to samplej if(L j − 1)K +1≤ R j ≤ KL j . end forℓ∈{1,2,...,L}do • LetV ℓ be the number of samples with labelℓ, i.e. V ℓ = j∈{1,2,...,n}:L j =ℓ . end Algorithm5: Construction of GRASP test statistic (Model-X setting) Decisionrule. LetV n,L be the statistic returned by Algorithm 5. We construct the test statisticsU asym τ (V n,L ) andU finite τ (V n,L ) following the same formulation as in (4.4). The obtained test statistics are then used in 117 (4.6) and (4.5) to give the decision rulesΦ finite n,L,α,τ andΦ asym n,L,α,τ . In the next section, we show that these rules control the size of our test (Φ finite n,L,α,τ in finite sample settings and Φ asym n,L,α,τ in asymptotic regimes). 4.3.2 Sizeofthemodel-XGRASP test We first start with the following assumption, which assumes a density function for the random variable T(X,W). This is the unconditional version of Assumption 4.2.3, and is used in our analysis of the test in a model-X setting. Assumption4.3.2. Consider the following cumulative distribution functions: F o (t)=P(T(X,W)≤ t), F c (t)=P(T( e X, f W)≤ t), where x,e x ∼ P X independently, w is given by (4.10) and f W ∼ Unif[0,1]. Assume that the density functionsofcdfsF o andF c existandshowthembyf o (t)andf c (t),respectively. Inaddition,consideramea- sure on the interval[0,1] defined with the cdf ω(u):=F o (F − 1 c (u)). Suppose that this measure is absolutely continuouswithrespecttotheLebesguemeasure,anddenoteitsdensityfunction(Radon–Nikodymderivative) byψ (u). In our next result we show that thef-divergence between the multinomial distribution ofV n,L (out- put of algorithm 5) and the uniform multinomial distribution is bounded by the distance of the random variablesT(x,w) andT(e x, f W). 118 Proposition 4.3.3. LetV n,L = [V 1 ,...,V L ] be outputs of Algorithm 5, thenV n,L has a multinomial dis- tribution withL outcomes and nominal probabilitiesp 1 ,...,p L . Under the setting of Assumption 4.3.2, for a score functionT :X × [0,1]→R, the following holds 1 L L X ℓ=1 f(Lp ℓ )≤ D f (L(T(X,W))∥L(T( e X, f W))). In addition, if the functionψ (u) from Assumption 4.3.2 is continuous, the above inequality becomes equality asK,L grow to infinity: lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )=D f (L(T(X,W)),L(T( e X, f W))). The proof of this proposition is given in Section C.1.10. The next Theorem follows from the combina- tion of Propositions 4.3.1 and 4.3.3, along with the data processing inequality forf-divergence. Theorem 4.3.4 (Model-X). LetV n,L = [V 1 ,...,V L ] be the output of Algorithm 5 that has a multinomial distribution with nominal probabilitiesp 1 ,...p L as per Proposition 4.3.3. Then for a score functionT :X × [0,1]→R satisfying Assumption 4.3.2 the following holds 1 L L X ℓ=1 f(Lp ℓ )≤ E[D f (Bern(η (X))∥Bern(b η (X)))] . The proof of this theorem is given in Section C.1.11. The result of this theorem allows us to perform sta- tistical inference on the complex expressionE[D f (Bern(η (x))||Bern(b η (x)))] by focusing on the deviation ofV n,L from the uniform multinomial distribution, similar to the distribution-free version. 119 We are now ready to show that the two decision rules (4.6) and (4.5) control the size of our testing in the model-X setup, respectively in the finite sample and asymptotic regimes. Theorem4.3.5. LetV n,L =[V 1 ,...,V L ]betheoutputofAlgorithm5. ConsiderthedecisionrulesΦ asym n,L,α,τ andΦ finite n,L,α,τ whicharegivenrespectivelyby (4.5)and (4.6),withteststatisticsU asym τ (V n,L )andU finite τ (V n,L ) in (4.4). Then under the null hypothesis (4.2), this holds: P Φ finite n,L,α,τ =1 ≤ α, lim sup n→∞ P Φ asym n,L,α,τ =1 ≤ α. The proof of this theorem follows along the same lines as in the proof of Theorem 4.2.6 . 4.3.3 Choiceofthescorefunction In summary, our analysis in the previous section showed the following chain of inequalities: 1 L L X ℓ=1 f(Lp ℓ )≤ D f (L(T(X,W))∥L(T( e X, f W))) ≤ D f (L(X,W)∥L( e X, f W)) =E[D f (Bern(η (X))∥Bern(b η (X)))] , (4.11) where the first inequality follows from Proposition 4.3.3, the second one is an application of the data processing inequality, and the third line is the claim proved in Proposition 4.3.1. In order to increase the power of the test (make it less conservative), our guide is to choose score functions for which the gap between the leftmost side and the rightmost side in the chain of inequality is minimum. 120 For the first inequality, as shown in Proposition 4.3.3 (second part), the gap vanishes for large values ofL andK. Focusing on the second inequality, we seek score functionsT such that D f (L(T(X,W))∥L(T( e X, f W)))=D f (L(X,W)∥L( e X, f W)). (4.12) A trivial choice isT(x,w) = (x,w), however, we would like to have scores with one-dimensional range, since we need to compare the scores values of the original samples and the counterfeits. We proceed our discussion with providing a brief background on the variational representation of f-divergences, with the following statement borrowed from [130, Lemma 1]. Lemma4.3.6. Fortwoprobabilitydensityfunctionsq,ponX ⊂ R d andasetofmeasurablefunctionsG,we have D f (p∥q)≥ sup φ∈G Z (φp− f ∗ (φ)q)dµ, (4.13) wheref ∗ istheconjugatedualoff. Inaddition,theequalityisachievedifthesubdifferential ∂f(p/q)contains an element ofG. For the reader’s convenience, we provide the proof of Lemma 4.3.6 in Section C.1.12. Proposition4.3.7. Fortwoprobabilitydensityfunctionsp,q overX ⊂ R d andfunctionf asperDefinition 4.1.1, letσ ∈ ∂f(p/q). Then, forX p ∼ p andX q ∼ q thef-divergence between distributions ofσ (X p ) and σ (X q ) is equal to thef-divergence of distributionsp andq. Formally, D f (L(σ (X p ))∥L(σ (X q )))=D f (p∥q). The proof of this proposition is given in C.1.13. 121 Going back to our discussion on the choice of score functionT(x,w), observe that by an application of Proposition 4.3.7, condition (4.12) is satisfied for T(x,w)∈∂f p X,W (x,w)/p e X, f W (x,w) . Since f W ∼ Unif[0,1], independent from e X ∼P X , we have p X,W (x,w) p e X, f W (x,w) =p W|X (w|x). In addition, from the construction ofw, cf. Algorithm 5, we have P(w|x)=P(y =1|x)P(w|x,y =1)+P(y =0|x)P(w|x,y =0) = η (x) b η (x) I(w≤ b η (x))+ 1− η (x) 1− b η (x) I(b η (x)≤ w). (4.14) Putting things together, the optimal score function is given by T(x,w)∈∂f η (x) b η (x) I(w≤ b η (x))+ 1− η (x) 1− b η (x) I(b η (x)≤ w) . We next note that our test statistics in Algorithm 5 is based on therelativeranking of the score values, not the absolute values. For an increasing functiong, the rank values obtained by the score functionsT and g◦ T are the same. In particular, sincef is convex its derivative is a non-decreasing function, and so we can simply consider the ratio of densities. Whenf is not strictly-convex, we may have ties in the ordering with respect to the above derivation. We use the following rule which also breaks ties. T opt (x,w)= η (x) b η (x) I(w≤ b η (x))+ 1− η (x) 1− b η (x) I(b η (x)≤ w). (4.15) 122 As can be seen T opt (x,w) involves the true model η (x) which is unknown. To cope with this issue, we consider two routes: • Model-agnosticapproach: We useT opt (x,w) with replacingη (x)= 1 2 . • GAN-based approach: We propose an optimization inspired by generative adversarial networks (GANs) to approximate the ratio of densitiesp X,W (x,w)/p e X, f W (x,w). Our next subsection gives a brief background overview on GANs and the details of our GAN-based approach. 4.3.3.1 GenerativeAdversarialNets(GANs) The GANs framework was introduced by [65] to generate samples from the population of given data samples x ∈ X ⊂ R d . This framework can be perceived as a game between a generator that tries to learn data distributionp X and produce fake samples that are statistically close to the original data, and a discriminator that tries to discriminate samples of the generator from the original data samples. It is a repeated game where each player aims to improve her model with respect to her objective. Starting from noise sample z ∈ Z ⊂ R k with density function p Z , the generator produce fake samples in the data spaceX via a mapping G(z;θ g ), where G : Z → X is a differentiable neural network with parametersθ g . The discriminator, on the other side, has access to another neural networkD(x;θ d ) that represents the likelihood of datax coming from the original data distributionp X . This game can be cast as the following min-max optimization problem min θ g max θ d n E X h logD(X;θ d ) i +E Z h log 1− D(G(Z;θ g );θ d ) io . (4.16) 123 As evident from the above formulation, the discriminator aims to increase the likelihood of the original samples coming fromp X , while simultaneously lower the likelihood of fake sampleG(z;θ g ) coming from the same distribution. The generator, on the other hand, aims to find a model G(z;θ g ) which generates samples with high likelihood of coming fromp X . As shown in [65], the above optimization problem has a global optimum atp g =p X , wherep g stands for the distribution of samplesG(z;θ g ) withz∼ p Z . We follow the GANs framework by consideringL(X,W) as the original data distribution. In addi- tion, we assume that the generator at each round produces the randomizations (e x, f W), and update the discriminator to discriminate (x,w) from (e x, f W). The major distinction with the original GAN setup is dropping a separate network for the generator as the generated samples are always coming fromL( e X, f W). Optimization problem (4.16) then reduces to: max θ d n E (X,W) h logD((X,W);θ d ) i +E ( e X, f W) h log 1− D(( e X, f W);θ d ) io . (4.17) The next result connects the optimal discriminatorD(x) (solution of (4.17)) and the likelihood ratio func- tion. This proposition is borrowed from [65, Proposition 1]. Proposition4.3.8. LetD ∗ (x,w) be the maximizer of optimization problem (4.17). Then we have D ∗ (x,w)= p X,W (x,w) p X,W (x,w)+p e X, f W (x,w) . Since we only have one network, we only update the parameters of the discriminative network D. For this end, we adopt the same optimization procedure used in [65, 133], where at each step given a mini-batch of samples (x 1 ,w 1 ),...,(x m ,w m ) and n (e x 1 , f W 1 ),...,(e x m , f W m ) o we use the following stochastic gradient ascent: 124 θ (t+1) d =θ (t) d + β t m m X i=1 ∇ θ d h logD (x i ,w i );θ (t) d +log 1− D (e x i , f W i );θ (t) d i , with step sizeβ t . The gradients of the above optimization problem can be efficiently computed using back- propagation algorithms. Let b θ d be the resulting estimate after convergence. Considering Proposition 4.3.8 and our discussion below Proposition 4.3.7, we consider the following score function T(x,w)= D((x,w), b θ d ) 1− D((x,w), b θ d ) . 4.4 NumericalExperiments In this section, we evaluate the performance of our proposed methodology on various synthetic datasets. Consider a binary classification problem under the logistic regression setting. Throughout the experiments, we let the feature vectors have an isotropic Gaussian distribution and the conditional probability be given by η (x)= 1 1+exp(− x T θ 0 ) , x∼ N(0,I d ). (4.18) We construct θ 0 by drawing one time realization from the N(0,σ 2 I d ) distribution and this θ 0 value is fixed for the rest of the experiments. We set the feature dimension d = 200 and standard deviation σ = 0.25. Letb η (x) denote the model that we want to perform the goodness-of-fit testing. We consider b η (x) to be another logistic model with parameterθ 1 . Concretely,b η (x)=1/[1+exp(− x T θ 1 )]. Obviously b η (x)=η (x) ifθ 1 =θ 0 . Example4.4.1. (Sizeofthetest)Webeginwithstudyingthesizeofourproposedteststatistics. Wefocuson theθ 1 = θ 0 setting, which implies thatE[D f (Bern(η (X))||Bern(b η (X))] = 0 for every divergence function 125 f. Wesetτ =0inthenullhypothesis (4.2)andconsiderthreesignificancelevels α =0.05,0.1and0.15. The rejectionrateofH 0 isreportedastheaverageover200independentexperiments. Sinceτ =0theoptimization problemin (4.4)isindependentofthechoiceoff. Itsonlyfeasiblesolutionisgivenbyp 1 =··· =p L =1/L, and so the test statistics amount to U asym 0 (V n,L )= L n L X ℓ=1 V ℓ − n L 2 , U finite 0 (V n,L )= 2L n L X ℓ=1 V ℓ − n L 2 . Table 4.1 summarizes the rejection rates for the distribution-free Algorithm 4 and for different values of (n,L). Each cell of the table has an ordered pair, where the first and second entry respectively corresponds to the asymptotic decision rule (4.5) and the finite decision rule (4.6). We next consider model-X setting for a similar experiment. We consider the model-agnostic approach in choosing the score functionT discussed at the end of Section 4.3.3, namely, T(x,w)=I(w≤ b η (x)) 1 2b η (x) +I(b η (x)≤ w) 1 2(1− b η (x)) . (4.19) Table4.2presentstherejectionrateofthemodel-XAlgorithm5forthreesignificancelevels α =0.05,0.1,0.15 fordifferentnumberofsamples( n),numberoflabels(L)andrandomizationsperlabel(K). EachcellofTables 4.2 consists of an ordered pair, where the first and second numbers respectively correspond to the asymptotic decision rule (4.5) and the finite decision rule (4.6). Reported numbers are averaged over 200 independent experiments. 126 α =0.05 α =0.1 α =0.15 n L 50 100 50 100 50 100 5000 (0.055,0) (0.03,0) (0.105,0) (0.075,0) (0.16,0) (0.1,0) 20000 (0.07,0) (0.05,0) (0.1,0) (0.1,0) (0.185,0) (0.16,0) 50000 (0.06,0) (0.05,0) (0.105,0) (0.105,0) (0.15,0) (0.145,0) Table 4.1: Size of the distribution-free GRASP, cf. Algorithm 4. In this experiment,b η (x) = η (x) is considered with η (x) being a logistic model (4.18). We consider both decision rules (asymptotic (4.5)) and (finite (4.6)). The first and the second entry of each pair in the table represent the rejection rate of these decision rules, respectively. The numbers are averaged over 200 independent experiments for three significance levels α = 0.05,0.1,0.15 and for different sample sizes ( n) and number of labels (L). α =0.05 α =0.1 α =0.15 (n,K) L 50 100 50 100 50 100 (5000,1) (0.055,0) (0.04,0) (0.115,0) (0.08,0) (0.165,0) (0.11,0) (5000,5) (0.025,0) (0.07,0) (0.12,0) (0.12,0) (0.185,0) (0.125,0) (10000,1) (0.085,0) (0.035,0) (0.125,0) (0.095,0) (0.18,0) (0.105,0) (10000,5) (0.065,0) (0.06,0) (0.07,0) (0.095,0) (0.17,0) (0.125,0) Table 4.2: Size of the model-X GRASP, cf. Algorithm 5. In this experiment,b η (x) = η (x) is considered with η (x) being a logistic model (4.18). Algorithm 5 is run with the score functionT(x,w) given in (4.19). We consider both decision rules (asymptotic (4.5)) and (finite (4.6)). The first and the second entry of each pair in the table represent the rejection rate of these decision rules, respectively. The numbers are averaged over200 independent experiments for three significance levels α =0.05,0.1,0.15 and for different sample sizes ( n) and number of labels (L), and number of randomizations per label (K). Example4.4.2. (Powerofthetest) In this experiment, we consider the null hypothesis (4.2) for a logistic modelb η (x) with parameterθ 1 =− θ 0 . Formally, we have b η (x)= 1 1+exp(− θ T 1 x) , withθ 1 =− θ 0 . (4.20) 127 In this experiment, we consider three divergence functions (Kl, TV, and Hellinger given in Lemma 4.1.2) between the true modelη (·) and the test modelb η (·). The computed nominal values of these divergences are τ KL 0 =2.7819,τ TV 0 =0.7330, andτ H 0 =0.9576. More precisely, we have τ KL 0 =E h D KL Bern(η (X))∥Bern(b η (X)) i , (4.21) τ TV 0 =E h D TV Bern(η (X))∥Bern(b η (X)) i , (4.22) τ H 0 =E h D H Bern(η (X))∥Bern(b η (X)) i . (4.23) In order to analyze the power of proposed methods, we choose τ that are smaller than τ 0 values, conse- quently we expect that the null hypothesis (4.2) must be rejected. For each divergence function, we consider fourτ values and report the average number of times (out of50 independent experiments) that the proposed method rejects the null hypothesis H 0 . Table 4.3 reports the results for distribution-free GRASP, outlined in Algorithm 4, for a variety of choices of (n,L). Each cell of Table 4.3 consists of an ordered pair where the firstandsecondcoordinaterespectivelystandsfortheasymptoticdecisionrule (4.5)andthefinitedecisionrule (4.6) at significance level α =0.1. To analyze the statistical power of model-X GRASP, as outlined in Algorithm 5, we adopt a similar setup withthetestmodelb η (x)givenby (4.20),andrunthemodel-XGRASPwiththescorefunction (4.19). Table4.4 presents the results for different values of the sample size ( n), number of labels (L), and randomizations per label(K), at the significance level α = 0.1. Each cell of Table 4.4 has an ordered pair with the first numbers indicatestherejectionratewiththeasymptoticdecisionrule (4.5)andthesecondnumberindicatestherejection rate with the finite decision rule (4.6). Reported numbers are averaged over50 independent experiments. 128 KL TV Hellinger τ KL 0 =2.7819 τ TV 0 =0.7330 τ H 0 =0.9576 (n,L) τ 0.72 0.82 0.96 1.02 0.4 0.44 0.48 0.52 0.28 0.32 0.36 0.4 (5000,50) (1,1) (1,0.02) (0,0) (0,0) (1,1) (1,0.52) (0.18,0) (0,0) (1,1) (1,0.08) (0.08,0) (0,0) (20000,50) (1,1) (1,1) (0,0) (0,0) (1,1) (1,1) (1,0.86) (0,0) (1,1) (1,1) (1,0.04) (0,0) (50000,50) (1,1) (1,1) (0.54,0) (0,0) (1,1) (1,1) (1,1) (0,0) (1,1) (1,1) (1,0.04) (0,0) (5000,100) (1,1) (1,0) (0,0) (0,0) (1,1) (1,0) (0.02,0) (0,0) (1,0.86) (1,0) (0.06,0) (0,0) (20000,100) (1,1) (1,1) (1,0) (0,0) (1,1) (1,1) (1,0) (0,0) (1,1) (1,1) (1,0) (0,0) (50000,100) (1,1) (1,1) (1,1) (0.44,0) (1,1) (1,1) (1,1) (0,0) (1,1) (1,1) (1,1) (0.06,0) Table 4.3: Statistical power of distribution-free GRASP, as outlined in Algorithm 4, with the true modelη (x) given by (4.18) and the test modelb η (x) in (4.20). In this experiment, we consider threef− divergences (KL, TV, and Hellinger as per Lemma 4.1.2) for different choices of sample size ( n) and number of labels (L). Each cell has two numbers, where the first and second number respectively refer to the rejection rates under the asymptotic decision rule (4.5) and the finite decision rule (4.6), at significance level α = 0.1. The reported numbers are averaged over 50 independent experiments Sincetheproblemsettingforthelasttwopoweranalysisexperiments(distribution-freeGRASP andmodel- X GRASP ) are the same, by a simple comparison between Table 4.3 and Table 4.4, it can be observed that for eachofthef-divergencefunctions,themodel-XGRASP algorithmachievesahigherstatisticalpowerthanthe distribution-free GRASP procedure. This is expected as the model-X GRASP utilizes the covariate distribution to obtain higher statistical power. KL TV Hellinger τ 0 =2.7819 τ 0 =0.7330 τ 0 = 0.9576 (n,K,L) τ 1.5 1.7 1.9 2 0.6 0.63 0.66 0.7 0.58 0.62 0.70 0.8 (5000,5,50) (1,1) (1,0.12) (0.82,0) (0,0) (1,1) (1,1) (1,0) (0.04,0) (1,1) (1,0.98) (1,0) (0.5,0) (5000,5,100) (1,1) (1,0.38) (1,0) (0.28,0) (1,1) (1,0) (1,0) (0,0) (1,1) (1,0) (1,0) (0.02,0) (10000,5,50) (1,1) (1,1) (1,0) (0.1,0) (1,1) (1,1) (1,1) (0.84,0) (1,1) (1,1) (1,0) (1,0) (10000,5,100) (1,1) (1,1) (1,0) (1,0) (1,1) (1,1) (1,0) (0,0) (1,1) (1,1) (1,0) (1,0) (5000,1,50) (1,1) (1,0) (0,0) (0,0) (1,1) (1,0.78) (1,0) (0,0) (1,1) (1,0) (1,0) (0,0) (5000,1,100) (1,1) (1,0) (0.38,0) (0,0) (1,0.96) (1,0) (0.96,0) (0,0) (1,0.44) (1,0) (1,0) (0,0) (10000,1,50) (1,1) (1,0.1) (0.02,0) (0,0) (1,1) (1,1) (1,0.38) (0.04,0) (1,1) (1,1) (1,0) (0.74,0) (10000,1,100) (1,1) (1,1) (1,0) (0.06,0) (1,1) (1,1) (1,0) (0,0) (1,1) (1,1) (1,0) (0.76,0) Table 4.4: Statistical power of model-X GRASP, as outlined in Algorithm 5, with the true modelη (x) given by (4.18) and the test modelb η (x) in (4.20). We consider three f− divergences (KL, TV, and Hellinger as per Lemma 4.1.2) for different choices of sample size ( n) and number of labels (L), and number of randomizations per label (K). The model-X algorithm is run with the score functionT(x,w) given in (4.19). Each cell contains an ordered pair, where the first and second number respectively refer to the rejection rates under the asymptotic decision rule (4.5) and the finite decision rule (4.6), at significance level α = 0.1. The reported numbers are averaged over 50 independent experiments. 129 Example4.4.3. (Scorefunctionfordistribution-freeGRASP)Inthisexperiment,weevaluatetheper- formance of distribution-free GRASP with other choices of score function. So far, we have used the score function T(x,w) = w. However, we want to examine whether one can achieve higher statistical power by selecting a score function that accounts for variability between x and w. To this end, we perform a similar experiment as the one reported in Table 4.3 for (n,L) = (5000,50), but with a new score function. We fit a linear model with response valuew and feature vectorsx on an auxiliary dataset of size4000. We denote the fitted linear model by ˆ θ and consider the score functionT(x,w) =|w− x T b θ |. We report the results for threef-divergencefunctions(TV,KL-divergence,andHellingerdistance)forbothasymptoticandfinite-sample decision rules in Figure 4.1. We averaged the reported numbers over50 experiments at a significance level of α =0.1. BycomparingtherejectionrateswiththereportednumbersinthefirstrowofTable4.3,wefindthat the previous score functionT(x,w) = w performs slightly better than the fitted model. For instance, for KL divergence, usingT(x,w) = w with both asymptotic and finite decision rules achieves full power (rejection rate 1.00) at τ = 0.72, while using the fitted score function already achieves trivial power (rejection rate 0) at τ = 0.7. Similar observations can be made for other divergence functions as well. Likewise, for TV and Hellingerdivergence,wealsoobservethatscorefunctionT(x,w)=w outperformstheregression-basedscore function slightly. Example4.4.4. (ThechoiceofnumberoflabelsL)Inthisexperiment,weanalyzethestatisticalpowerof theproposedmethodsforgoodness-of-fittesting,whileallowingawiderangeofvaluesfortheinsideparameter L(numberoflabels). Weconsidertheprevioussettingwiththetruemodelη (x)asin (4.18)andthetestmodel b η (x) as in (4.20). We consider the null hypothesis (4.2) for three f divergence functions KL,TV and H. In addition, the two decision rules (4.5) and (4.6) are being deployed. In this setting, the nominal values of 130 Figure 4.1: Statistical power of distribution free GRASP for trained score functionT(x,w)=|w− x Tb θ |. The linear model b θ is trained on an auxiliary dataset of size4000. The true modelη (x) is given in (4.18) and the test modelb η (x) is given in (4.20). In this case, the nominal values areτ KL 0 = 2.7819, τ TV 0 = 0.7330, andτ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run distribution free GRASP with two decision rules Φ asym n,L,α,τ and Φ finite n,L,α,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with different τ values. Here, sample size is n = 5000 with number of labels L = 50, and the significance level is α = 0.1. Reported rates are averaged over50 experiments. divergence functions are τ KL 0 = 2.7819, τ TV 0 = 0.7330, and τ H 0 = 0.9576, where τ 0 values are given in (4.21). We consider the same range ofL values for different f-divergence (KL, TV, Hellinger) and decision rules (Φ asym n,L,τ andΦ finite n,L,τ ). However,wechoosedifferentvaluesof τ inthehypothesis (4.2),dependingonthesetting tobettercapturetheeffectof L. Fordistribution-freeGRASP weconsiderthenullhypothesis (4.2)withτ KL asym = 0.84,τ KL finite =0.73,τ TV asym =0.46,τ TV finite =0.42,τ H asym =0.34,τ H finite =0.30. Figure4.2depictsthestatistical power curves under these settings versus the number of labelsL. The sample size is set ton = 5000 and the significance level to α =0.1. The obtained numbers are averaged over50 independent experiments. 131 0.0 0.2 0.4 0.6 0.8 1.0 5 15 25 35 45 55 65 75 85 95 105 115 Number of labels Rejection rate Setting KL asymptotic KL finite TV asymptotic TV finite Hellinger asymptotic Hellinger finite Hellinger asymptotic Hellinger finite KL asymptotic KL finite TV asymptotic TV finite Figure 4.2: Statistical power of distribution free GRASP versus the number of labelsL. The true modelη (x) is given in (4.18) and the test modelb η (x) is given in (4.20). In this case, the nominal values areτ KL 0 =2.7819,τ TV 0 =0.7330, and τ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run distribution free GRASP with two decision rulesΦ asym n,L,α,τ andΦ finite n,L,α,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with the following τ values: τ KL asym = 0.84,τ KL finite = 0.73,τ TV asym = 0.46,τ TV finite = 0.42,τ H asym = 0.34, and τ H finite = 0.30. Here, sample size isn = 5000, the significance level is α = 0.1 and the reported rates are averaged over50 experiments. We repeat the same experiment for model-X GRASP with the score function (4.19). Here, we consider τ KL asym = 1.87,τ KL finite = 1.6,τ TV asym = 0.67,τ TV finite = 0.62,τ H asym = 0.75, andτ H finite = 0.6. The sample size is setton=5000withrandomizationsperlabelK =1. Thesignificancelevelissetto α =0.1andthereported numbers are averaged over50 independent experiments. The behavior of the statistical power with respect to the change in the number of labels (L) can be seen in Figure 4.3. As observed from the curves in Figures 4.2 and 4.3, the power favors a middle range ofL values before and after which the power starts to decline. Example4.4.5. (Statisticalevidenceformisfittedmodels) In this experiment, we test for perfect fit of models to the underlying conditional law, by considering hypothesistestingproblem (4.2)withτ =0. WeconsiderfeaturevectorsdrawnindependentlyfromN(0,I d ), whered = 100. We use a planted model setup, where the data generating law is formulated by a two-layer neural network. Specifically, we consider a fully-connected neural network with N = 400 ReLU neurons in 132 0.0 0.2 0.4 0.6 0.8 1.0 5 15 25 35 45 55 65 75 85 95 105 115 Number of labels Rejection rate Setting KL asymptotic KL finite TV asymptotic TV finite Hellinger asymptotic Hellinger finite Hellinger asymptotic Hellinger finite KL asymptotic KL finite TV asymptotic TV finite Figure 4.3: Statistical power of model-X GRASP versus the number of labels L. The true model η (x) is given in (4.18) and the test modelb η (x) is given in (4.20). In this case, the nominal values areτ KL 0 = 2.7819,τ TV 0 = 0.7330, and τ H 0 = 0.9576 (see (4.21)). For three f− divergence (KL, TV, Hellinger) we run model-X GRASP with the score functionT(x,w) as per (4.19) and the two decision rulesΦ asym n,L,τ andΦ finite n,L,τ given in (4.5) and (4.6). For each setting, we consider the null-hypothesis (4.2) with the followingτ values: τ KL asym = 1.87,τ KL finite = 1.6,τ TV asym = 0.67,τ TV finite = 0.62,τ H asym =0.75, andτ H finite =0.6. Here, sample size isn=5000, withK =1 randomizations per sample, and the significance level α =0.1. The reported rates are averaged over50 experiments. the first layer and a sigmoid activation function as the output unit. The conditional distribution L(Y|X) is given by: Y|X ∼ Bern Sigmoid θ T 0 ReLU(W 0 X) , where W 0 ∈ R N× d is the weight matrix, θ 0 ∈ R N is the weight vector of the second layer, and ReLU is the ReLU activation function applied entry-wise. The entries of the planted parameters (W 0 ,θ 0 ) are drawn independently fromN(0,1). We compare two neural networks,N 1 andN 2 , trained on a similar datasetD train ={(x i ,y i )} i=1:T , but with different weight initialization schemes. Specifically, we initialize the weights of N 1 using the Kaiming initialization method, while the weights ofN 2 are initialized by adding independent noise from a standard normal distribution with mean 0 and standard deviation 10 − 2 to each entry of W 0 and θ 0 , We train both 133 models on a training dataset of size T = 2 16 (approximately 65K samples), and evaluate them on a test dataset of size 128. We observe thatN 2 achieves a higher accuracy (0.922) thanN 1 (0.875). However, the difference in their empirical accuracies is not significant enough to distinguish between the two models with confidence, given the small size of the test dataset. On the other hand, by computing the ℓ 2 distance between the estimated weights ofN 1 andN 2 and the actual parametersW 0 andθ 0 , we realize that the distance is much larger for N 1 (212.33)thanforN 2 (43.74). ThisindicatesthattheperturbedinitializationofN 2 hasresultedinthemodel findingabetterlocalminimumintheoptimizationlandscapecomparedto N 1 ,despitetheirsimilarempirical accuracies, and thereforeN 2 is better fitted to L(Y|X) thanN 1 . This further highlights the fact that solely focusing on predictive performance is not sufficient to differentiate between the models. We evaluate the performance of our goodness-of-fit test to determine the significance of evidence against the assumption thatN 1 ,N 2 are perfectly fitted to L(Y|X). We construct the p-values for the two models according to (4.9), withM = 500. For the score function, we fit a three-layer neural network to regress w on x, namelyv T ReLU(A 2 ReLU(A 1 x)), withv ∈R 100× 1 ,A 2 ∈R 100× 300 ,A 1 ∈R 300× 100 . We define score function T(x,w) as the MSE of this fitted model over the evaluation data set of size 128 (which was used earlier for the model accuracies.) The p-values forN 1 andN 2 work out at 0.0039 and 0.6566, respectively. Our test correctly indicates thatN 2 is better fit to L(Y|X). Example4.4.6. (GAN-basedversusmodel-agnosticscorefunctions) In this experiment, we compare the performance of model-X GRASP algorithm for two choices of score functions. Formally, the considered score functions are GAN-based approach and the model-agnostic given in (4.19). We consider the hypothesis testing problem (4.2) for three different f-divergence functions TV, KL and the Hellinger distance, and two decisionrules,asymptotic (4.5)andfinite (4.6). Foreachsetting(6overall),weruntheexperimentwithoneof 134 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 Rejection rate GAN-asym AGN-asym GAN-finite AGN-finite (a) KL 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.0 0.2 0.4 0.6 0.8 1.0 Rejection rate GAN-asym AGN-asym GAN-finite AGN-finite (b) TV 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 0.0 0.2 0.4 0.6 0.8 1.0 Rejection rate GAN-asym AGN-asym GAN-finite AGN-finite (c) Hellinger Figure 4.4: Comparison between the statistical power of model-X GRASP with a GAN-based score function and the agnostic score function (4.19). In this experiment, the test size is n = 5000, number of labels L = 50, and randomizations per labelK = 1. Hypothesis testing problem (4.2) is considered for a variety ofτ values, threef- divergence functions, and two asymptotic and finite decision rules (4.5) and (4.6). In each setting, it can be observed that the GAN-based score function achieves a higher statistical power than the agnostic score function. The reported numbers are average over50 experiments. the score functions (12 experiments in total). The true modelη (x) is given by (4.18) and the test modelb η (x) is given by (4.20), with θ 0 a draw of N(0,I d ), the feature dimension d = 200 and θ 1 = − 2θ 0 . We let the number of samples ben = 5000, number of labelsL = 50, andK = 1 randomizations per label. Figure 4.4 depict the performance of model-X GRASP for GAN based score function and agnostic (AGN) score function for threef-divergence functions, and the two decision rulesasym andfinite . The reported rates are averaged over50instances. Itisobservedthatineachsetting,theGAN-basedscorefunctionachievesahigherstatistical power for larger τ values than its counterpart used with an agnostic score function, for both the asymptotic and the finite decision rules. In this experiment, for the discriminator of the GAN-based score function we use a neural network with input dimension 201 (as inputs are (x,w) with x ∈ R 200 and w ∈ R). The first hidden layer consists of 256 neurons with ReLU activation function. The second and third hidden layers have respectively 128 and 64 neurons with ReLU activation functions. The output is a single neuron with sigmoid activation function. The 135 cross-entropy loss is considered for the training process, and the network is trained on64k inputs. In order to prevent overfitting, the dropout probability 0.3 is considered for neural connections. 4.5 SolvingtheoptimizationproblemforGRASPdecisionrules The optimization problems (4.4) have convex objectives and constraints (f-divergence ball), and therefore we have a convex optimization problem. We use an iterative optimization procedure to find the optimal solution. It is worth highlighting that vanilla projected gradient descent requires projection onto the f- divergence ball, which can be computationally complex for largeL or when selecting certainf divergence functions. To circumvent this issue, we explore an alternative approach by leveraging conditional gradient methods (Frank-Wolfe [57]) that avoid the need for such projections. This adoption of conditional gradient method is also motivated by the observation that minimizing a linear objective over anf-divergence ball has a simpler to analyze dual formulation [18, 128]. We first provide a brief overview of conditional gradient methods. We consider a convex differentiable objective function g :R L →R and a compact convex setP ⊂ R L . Given the optimization problem of minimizing g over P, the following iterative rule is considered by conditional gradient methods: p t+1 =γ t q t +(1− γ t )p t , q t =argmin q∈P q T ∇g(p t ), (4.24) whereγ t is the step size and can beset asγ t = 1 t+2 . Getting back to initial optimization problems (4.4), we have 136 U finite τ (V n,L )= min p∈Uτ 1 n L X ℓ=1 (V ℓ − np ℓ ) 2 p ℓ +1/L , U asym τ (V n,L )= min p∈Uτ 1 n L X ℓ=1 (V ℓ − np ℓ ) 2 p ℓ , (4.25) where the constraint setU τ is given by U τ = ( p∈R L s.t. p≥ 0, L X ℓ=1 p ℓ =1, L X ℓ=1 f(Lp ℓ )≤ Lτ ) . We next focus on solving the linearization of objective functions (4.24) over the f-divergence ball, which is needed in the update rule (4.24). This problem has been studied before by [18, 160] for general continuous distributions. For the reader’s convenience, we state the result for the specific case of the problem over discrete distributions. Proposition4.5.1. Forx∈R L andτ ≥ 0, let (λ ∗ ,η ∗ )∈arg min λ ≥ 0,η ∈R " λτ +η + λ L L X ℓ=1 f ∗ − η − x ℓ λ # . Ifq ∗ ∈R L is such that − x ℓ − η ∗ Lλ ∗ ∈∂f(q ∗ ℓ ), ∀ℓ∈[L], thenq ∗ is the minimizer ofq T x overU τ . It is easy to observe that the objective function in Proposition 4.5.1 is over a half-space in a two- dimensional space with decoupled constraints, and can be solved efficiently. Combining the result in Proposition 4.5.1 with (4.24) completes the iterative procedure to solve optimization problems (4.25). Since 137 objective functions (4.25) are different for asymptotic and finite decision rules, therefore we have distinct values for gradients (x inq T x as per Proposition 4.5.1). Namely, we have ∇g asym (p)=n− V n,L (ℓ) np 2 ℓ , ∇g finite (p)= (np ℓ − V n,L (ℓ)) np ℓ +V n,L (ℓ)+ 2n L n(p ℓ + 1 L ) , where g asym and g finite are associate objective functions in (4.25). Algorithms 6 and 7 summarizes the iterative procedures for solving finite and asymptotic optimization problems (4.25), respectively. 1: (initialization): p ℓ (0)← 1/L, for allℓ∈[L]. 2: for iterationh=1,...,H do 3: x ℓ (h)= (np ℓ (h)− V ℓ )(np ℓ (h)+V ℓ + 2n L ) n(p ℓ (h)+ 1 L ) 4: (λ (h),η (h))=arg min λ ≥ 0,η ∈R λτ +η + λ L L P ℓ=1 f ∗ − η − x ℓ (h) λ 5: letq ℓ (h) be such that − η (h)− x ℓ (h) Lλ (h) ∈∂f(q ℓ (h)), for allℓ∈[L] 6: p(h+1)=γ (h)q(h)+(1− γ (h))p(h) 7: return U finite τ (V n,L )← 1 n L P ℓ=1 (V ℓ − np ℓ (H)) 2 (p ℓ (H)+ 1 L ) 2 Algorithm6: Iterative procedure to solve optimization problem ofU finite τ (V n,L ) 138 1: (initialization): p ℓ (0)← 1/L, for allℓ∈[L]. 2: for iterationh=1,...,H do 3: x ℓ (h)=n− V 2 ℓ np ℓ (h) 2 , for allℓ∈[L] 4: (λ (h),η (h))=arg min λ ≥ 0,η ∈R λτ +η + λ L L P ℓ=1 f ∗ − η − x ℓ (h) λ 5: letq ℓ (h) be such that − η (h)− x ℓ (h) Lλ (h) ∈∂f(q ℓ (h)), for allℓ∈[L] 6: p(h+1)=γ (h)q(h)+(1− γ (h))p(h) 7: return U asym τ (V n,L )← 1 n L P ℓ=1 (V ℓ − np ℓ (H)) 2 p ℓ (H) 2 Algorithm7: Iterative procedure to solve optimization problem ofU asym τ (V n,L ) 139 PartII AdversarialRobustness 140 Chapter5 FundamentalTradeoffsinDistributionallyAdversarialTraining In a classic supervised learning setting, a learner is givenn pairs of data points{z i :=(x i ,y i )} i=1:n with x i ∈R d representing features vectors andy i the response variables (or labels). The common assumption in supervised machine learning is that the data points z i are drawn independently and identically from some probability measureP Z defined over the space Z := X ×Y . Given this training data, the learner would like to fit a parametric function f θ withθ ∈R d to predict the response (label) on new pointsx. A common practice to model fitting is through the empirical risk minimization: b θ =argmin θ ∈R d 1 n n X j=1 ℓ(θ ;(x j ,y j )), (5.1) with ℓ(θ ;(x,y)) := e ℓ(f θ (x),y) and e ℓ being a loss function which captures the discrepancy between the estimated valuef θ (x) and the true response valuey. The performance of the model is then measured in terms of standard risk (a.k.a. generalization error), defined as SR(θ ):=E z=(x,y)∼ P Z [ℓ(θ ;(x,y))] . (5.2) 141 Standard risk is a population risk and quantifies the expected error on new data points drawn from the same distribution as the training data. Although the empirical risk minimization is a widely used approach for model learning, it is well known that the resulting models can be highly vulnerable to adversarial perturbations of their inputs, known as adversarial attacks. We next discuss the adversarial setting and two common adversary models that are considered in literature. 5.0.1 Adversarialsetting The adversarial setting can be perceived as a game between the learner and the adversary. Given access to the training data, drawn i.i.d from a common distributionP Z , the learner chooses a modelθ . Depending on the adversary’s budgetε, the adversary chooses a test data point(˜ x,˜ y) that can deviate from a typical test point according to one of the following models. The performance of modelθ is then measured in terms of predicting ˜ y given the perturbed input ˜ x. Norm-boundedperturbations. In this setting, ˜ y = y (no perturbation on the response) and ˜ x = x+δ whereδ can be an arbitrary vector fromℓ r -ball of radiusε. The adversarial risk in this case is defined as AR(θ ):=E (x,y)∼ P Z " sup ∥δ ∥ ℓr ≤ ε ℓ(θ ;(x+δ,y )) # . (5.3) Distributionshift. In this setting, the adversary can shift the distribution of test data, making it different than the training distributionP Z . Specifically, (˜ x,˜ y)∼ Q whereQ∈U ε (P Z ) denotes anε- neighborhood 142 of the distributionP Z . A popular choice of this neighborhood is via the Wasserstein distance, which is formally defined below. In this case, the adversarial risk is defined as AR(θ ):= sup Q∈Uε(P Z ) E (˜ x,˜ y)∼ Q [ℓ(θ ;(˜ x,˜ y))] . (5.4) Note that this is a strong notion of adversary as the perturbation is chosen after observing both the modelθ and data point(x,y) (in norm-bounded perturbation model) or the training data distributionP Z (in the distribution shift model). The distribution perturbing adversary is a common model in a multitude of application domains and has already been adopted by several work including [168, 48, 144]. Our primary focus on this work is on the distribution shift adversary model with Wasserstein met- ric to measure the distance between distributions. The next section provides a brief background on the Wasserstein robust loss which will be used later in our work. 5.0.2 BackgroundonWassersteinrobustloss LetZ be a metric space endowed with metric d : Z×Z → R ≥ 0 . Denote byP(Z) the set of all Borel probability measures onZ. For aQ-measurable functionf, theL p (Q)-norm off is defined as ∥f∥ Q,p := R Z |f| p dQ 1/p forp∈[1,∞) Q-ess sup z∈Z |f(z)| forp=∞ (5.5) For two distributionsP,Q∈P(Z) the Wasserstein distance of orderp is given by W p (P,Q):= inf π ∈Cpl(P,Q) ∥d∥ π,p , (5.6) 143 where the coupling set Cpl(P,Q) denotes the set of all probability measures π onZ×Z with the first marginalπ 1 :=π (·×Z )=P and the second marginalπ 2 :=π (Z×· )=Q. We use the Wasserstein distance to define the neighborhood set U ε in the distribution shift adversary model. Namely, U ε (P Z ):={Q∈P(Z): W p (P Z ,Q)≤ ε}. (5.7) In this case we refer toAR(θ ) given by (5.4) as Wassersteinadversarialrisk. Note that this notion involves a maximization over distributionsQ∈P(Z) which can be daunting. However, an important result from distributional robust optimization which we also use in our characterization of AR(θ ) is that the strong duality holds for this problem under general conditions. The dual problem of (5.4) is given by min γ ≥ 0 n γε p +E P Z [ϕ γ (θ ;z)] o , p∈[1,∞), E z∼ P Z h sup ˜ z∈Z {ℓ(θ ;˜ z): d(z,˜ z)≤ ε} i p=∞. (5.8) Hereϕ γ (θ ;z) is the robust surrogate for the loss functionℓ(θ ;z) and is defined as ϕ γ (θ ;z 0 ):= sup z∈Z {ℓ(θ ;z)− γd p (z,z 0 )}. (5.9) Forp∈[1,∞) it is shown that strong duality holds if eitherP Z has finite support or Z is a Polish space [61]. Forp=∞, Lemma EC.2 in [60] shows that strong duality holds ifP Z has finite support. 144 Remark5.0.1. It is worth noting that the Wasserstein adversary model is stronger than and generalizes the norm-bounded perturbation model. In particular, for anyp∈[1,∞], E (x,y)∼ P Z " sup ∥δ ∥ ℓr ≤ ε ℓ(θ ;(x+δ,y )) # ≤ sup Q∈Uε(P Z ) E (˜ x,˜ y)∼ Q [ℓ(θ ;(˜ x,˜ y))] , whereU ε (P Z ) is given by (5.7) with the Wasserstein distance defined with respect to the ℓ r distance which is also used in the definition of norm-bounded perturbation model. Equality holds for p =∞. We refer to [168, Proposition 3.1] and [144, Corollary 2.1, Theorem 1] for the proof and further explanation. 5.0.2.1 RegularizationeffectofWassersteinadversarialloss It is clear from the definition that AR(θ )≥ SR(θ ) for any modelθ . Understanding the tradeoff between standard and adversarial risks is intimately related to the gapAR(θ )− SR(θ ). The gap between the Wasser- stein adversarial loss and the standard loss has been studied in several settings in the context of distribu- tionally robust optimization (DRO) [10, 60]. In particular, [10, 60] introduced the notion of variation of the loss, denoted asV(ℓ), as a measure of the magnitude change in the expected loss when the data dis- tribution is perturbed, and showed that the Wasserstein adversarial loss is closely related to regularizing the nominal loss by the variationV(ℓ) regularizer. The formal definition of the variation of loss, recalled from [60], is given below. 145 Definition 5.0.2. (Variation of the loss). Suppose thatZ is a normed space with norm∥·∥ . Let ℓ be a continuous function onZ. Also assume that∇ z ℓ existsP-almost everywhere. The variation of loss ℓ with respect toP is defined as V P,q (ℓ):= ∥∥∇ z ℓ∥ ∗ ∥ P,q q∈[1,∞), P Z − esssup z∈Z sup ˜ z̸=z (ℓ(˜ z)− ℓ(z)) + ∥˜ z− z∥ , q =∞. (5.10) Here∥·∥ ∗ denotes the dual norm of∥·∥ and we recall that∥·∥ P,q is theL q (P)-norm given by (5.5). The following proposition from [10, 60] states that the variation of loss captures the first order term of the gap between Wasserstein adversarial risk and standard risk for smallε. Proposition5.0.3. Suppose that the lossℓ(θ ;z) is differentiable in the interior of Z for everyθ , and∇ z ℓ is continuous onZ. Whenp∈(1,∞), assume that there existsM,L≥ 0 such that for everyθ andz,˜ z∈Z, ∥∇ z ℓ(θ ;˜ z)−∇ z ℓ(θ ;z)∥ ∗ ≤ M +L∥˜ z− z∥ p− 1 . Whenp =∞, assume instead that there existsM ≥ 0 andδ 0 > 0 such that for everyθ andz,˜ z ∈Z with ∥˜ z− z∥<δ 0 , we have ∥∇ z ℓ(θ ;˜ z)−∇ z ℓ(θ ;z)∥ ∗ ≤ M. Then, there exists ¯ε such that for all0≤ ε< ¯ε and allθ AR(θ )− SR(θ )=εV P Z ,q (ℓ)+O(ε 2 ), (5.11) where 1 p + 1 q =1andpistheorderofWassersteindistanceindefiningset U ε (P Z )intheadversarialrisk (5.4). 146 By virtue of Proposition 5.0.3, the Wasserstein adversarial risk can be perceived as a regularized form of the standard risk with regularization given by the variation of the loss. Nonetheless, note that this is only an approximation which captures the first order terms for small adversary’s power ε. (See also Remark 8 in [10] for an upper bound on the gap, up to second order terms inε.) In this chapter, we consider the settings of linear regression and binary classification. For these settings, only for the special case ofp=1 (1-Wasserstein) and when the loss is Lipschitz and its derivative converges at∞, it is shown that the gap (5.11) is linear in ε and therefore is precisely characterized as εV P Z ,q (ℓ). However, as we will consider more common losses for these settings, namely quadratic loss for linear regression and 0-1 loss for classification, such characterization does not apply to our settings and requires a direct derivation of adversarial risk. Later, in Section 5.1.3 we use the result of Proposition 5.0.3 to study the tradeoff between SR andAR in the problem of learning an unknown function over thed-dimensional sphereS d− 1 . 5.1 Mainresults In this chapter, we focus on the distribution perturbing adversary and aim at understanding the fundamen- tal tradeoff between standard risk and adversarial risk, which holds regardless of computational power or the size of training data. We consider 2-Wasserstein distance (p=2) with the metricd(z,˜ z) defined as d(z,˜ z)=∥x− ˜ x∥ ℓ 2 +∞· I{y̸= ˜ y}, (5.12) for z = (x,y) and ˜ z = (˜ x,˜ y). Therefore, the adversary with a finite power ε can only perturb the distribution of the input feature x, but not y. Otherwise, the distance d(z,˜ z) becomes infinite and so 147 the Wasserstein distance between the the data distributionP Z and the adversary distributionQ, given by (5.6), also becomes infinite. It is worth noting that this choice of d is only for simplicity of presentations and our results in this section can be derived in a straightforward manner for distances that also allow perturbations on they component. The following remark relates the two types of adversary discussed in Section 5.0.1 and follows readily from the definition (5.6) and Equation (5.5). Remark 5.1.1. For distance d(·,·) given by (5.12), the adversary model with norm bounded perturbations correspond to the distribution shifting adversary model withp=∞ Wasserstein distance. 5.1.1 Linearregression We consider the class of linear models to fit to data with quadratic loss ℓ(z;θ ) = (y− x T θ ) 2 . Our first result is a closed form representation of the Wasserstein adversarial risk (5.4) in this case. Proposition5.1.2. Considerthequadraticlossℓ(z;θ )=(y− x T θ ) 2 andthedistributionperturbingadver- sarywithU ε (P Z ) givenby (5.7)withp=2andthemetricdgivenby (5.12). Inthiscasetheadversarialrisk AR(θ ) admits the following form: AR(θ )= q E P Z [(y− x T θ ) 2 ]+ε∥θ ∥ ℓ 2 2 . (5.13) To prove Proposition 5.1.2 we exploit the dual problem (5.8). We refer to Section D.1.1 for the proof of Proposition 5.1.2. Paretooptimalcurve. For the linear regression setting, note that the standard riskSR(θ ) and the adver- sarial risk AR(θ ) are convex functions ofθ . (The latter is convex sinceE Q [(y− x T θ ) 2 ] is convex for any 148 distributionQ and maximization preserves convexity.) Therefore, we can find (almost) all Pareto optimal points by minimizing a weighted combination of the two risk measures by varying the weightλ : θ λ :=argmin θ λ SR(θ )+AR(θ ) (5.14) The Pareto optimal curve is then given by{(SR(θ λ ),AR(θ λ )):λ ≥ 0}. Theorem5.1.3. Consider the setting of Proposition 5.1.2 withv :=E[yx],σ 2 y :=E[y 2 ], andΣ := E[xx T ]. Then the solutionθ of optimization (5.14) is given either by (i)θ λ = 0 or(ii)θ λ = (Σ+ γ ∗ I) − 1 v, withγ ∗ the fixed point of the following two equations: γ = ε 2 +εA 1+λ + ε A , (5.15) A= 1 ∥(Σ+ γI ) − 1 v∥ ℓ 2 σ 2 y + Σ 1/2 (Σ+ γI ) − 1 v 2 ℓ 2 − 2v T (Σ+ γI ) − 1 v 1/2 . (5.16) In case(i) we haveSR(θ λ )= AR(θ λ )=σ 2 y . In case(ii) we have SR(θ λ )=A 2 ∗ (Σ+ γ ∗ I) − 1 v 2 ℓ 2 , AR(θ λ )=(A ∗ +ε) 2 (Σ+ γ ∗ I) − 1 v 2 ℓ 2 , (5.17) whereA ∗ is given by (5.16) whenγ =γ ∗ . The proof of Theorem 5.1.3 is given in Section D.1.2. 149 Corollary5.1.4. Supposethatdataisgeneratedaccordingtolinearmodely =x T θ 0 +wwithw∼ N(0,σ 2 ) and isotropic features satisfyingE[xx T ] = I d . Then the solutionθ λ of optimization (5.14) is given either by (i)θ λ =0 or(ii)θ λ =(1+γ ∗ ) − 1 θ 0 , whereγ ∗ is the fixed point of the following two equations: γ = ε 2 +εA 1+λ + ε A , (5.18) A= γ 2 +(1+γ ) 2 σ 2 ∥θ 0 ∥ 2 ℓ 2 ! 1/2 . (5.19) In case(i) we haveSR(θ λ )= AR(θ λ )=σ 2 +∥θ 0 ∥ 2 ℓ 2 . In case(ii) we have SR(θ λ )=A 2 ∗ (1+γ ∗ ) − 2 ∥θ 0 ∥ 2 ℓ 2 , (5.20) AR(θ λ )=(A ∗ +ε) 2 (1+γ ∗ ) − 2 ∥θ 0 ∥ 2 ℓ 2 , (5.21) whereA ∗ is given by (5.19) whenγ =γ ∗ . The proof of Corollary 5.1.4 is provided in Section D.1.3. Figure 5.1 shows the effect of various parameters on the Pareto optimal tradeoffs between adversarial (AR) and standard risks (SR) in linear regression setting. We consider data generated according to the linear modely = x T θ 0 +w withw ∼ N(0,1) and featuresx i sampled i.i.d from N(0,Σ) where Σ i,j = ρ |i− j| . Figure 5.1a demonstrates the role of features dimensiond on the Pareto optimal curve for the setting with ρ =0 (identity covariance matrix), adversary’s powerε=1, and the entries ofθ 0 generated independently fromN(0,1/40). Note that by Corollary 5.1.4, in the case of isotropic features, standard risk and adversarial risks depend onθ 0 only through itsℓ 2 norm. The variations in the Pareto-curve here is due to variations in∥θ 0 ∥ ℓ 2 asd changes. 150 1 1.5 2 2.5 3 Standard risk 1 2 3 4 5 6 Adversarial risk (a) Pareto optimal curve for several feature dimensionsd withρ =0 and ε=1. 1 2 3 4 5 Standard risk 4 5 6 7 8 Adversarial risk (b) Pareto optimal curve for several feature dependency values ρ with d=10,ε=1. 1 2 3 4 5 Standard risk 3 4 5 6 7 8 Adversarial risk (c) Pareto optimal curve for several adversary’s powerε withρ = 0 and d=10. Figure 5.1: The effect of feature dimension ( d), dependency across features (ρ ), and adversary’s power (ε) on Pareto optimal tradeoff between adversarial ( AR) and standard risks (SR) in linear regression setting. Figure 5.1b investigates the role of dependency across features (ρ ) in the optimal tradeoff between standard and adversarial risks. In this settingd=10,ε=1, andθ 0 ∼ 1 √ d N(0,I). As we see all the curves start from the same point. This can be easily verified by the result of Theorem 5.1.3: For the linear data modely = x T θ 0 +w, we havev = Σ θ 0 and atλ =∞, the Pareto-optimal estimator is the minimizer of the standard risk, i.e. θ λ =∞ = θ 0 . Also by (5.15) we haveγ ∗ = 0, and by (5.16) we obtainA = σ/ ∥θ 0 ∥ ℓ 2 . Plugging these values in (5.17) we get SR(θ ∞ ) = σ 2 and AR(θ ∞ ) = (σ +ε 0 ∥θ 0 ∥ ℓ 2 ) 2 . Therefore both metrics become independent of ρ at λ = ∞. Also looking at the right-most point of the Pareto-curves, corresponding to λ = 0, we see that as ρ increases from small to moderate values, this point moves upward-right, indicating that both standard and adversarial risks increase, but after some value ofρ , we start to see a reverse behavior, where standard and adversarial risks start to decrease. Finally in Figure 5.1c we observe the effect of adversary’s budget ε on the Pareto optimal curve. Here, d=10,ρ =0, andθ 0 ∼ 1 √ d N(0,I). Clearly, asε grows there is a wider range of Pareto-optimal estimators and the two measures of risks would deviate further from each other. Whenε becomes smaller, the two measures of standard and adversarial risks get closer to each other and so the Pareto-optimal curve becomes shorter. 151 5.1.2 Binaryclassification We next consider the problem of binary classification under a Gaussian mixture data model. Under this model, each data point belongs to one of two two classes{± 1} with corresponding probabilitiesπ + , and π − = 1− π + . The feature vectors in each class are generated independently according to an isometric Gaussian distribution with mean{± µ } depending on the class. In other words, given labely i ∈{± 1}, the feature vectorx i ∈R d is drawn fromN(y i µ, Σ) . We focus on class of linear classifiers {x T θ :θ ∈R d }. Given a modelθ the predicted labels are simply given assign(x T θ ). We consider 0-1 lossℓ(θ ;z)=I(ˆ y̸=y)=I(yx T θ ≤ 0). We also consider Wasserstein adversarial training with distance d(z,˜ z)=∥x− ˜ x∥ ℓr +∞· I{y̸= ˜ y}, (5.22) forz = (x,y) and ˜ z = (˜ x,˜ y). Our next results is on characterizing the standard risk and the Wasserstein adversarial risk for this model. Proposition5.1.5. ConsiderbinaryclassificationwithGaussianmixturedatamodeland0-1loss. Let a θ := µ T θ ∥Σ 1/2 θ ∥ ℓ 2 . Then, for a linear classifier x7→sgn(x T θ ), the standard risk is given by SR(θ )=Φ( − a θ ), 152 whereΦ( z) = 1 √ 2π R z −∞ e − t 2 2 dt denotes the c.d.f of a standard Gaussian distribution. In addition, define the functionF(θ,γ ):R d+1 7→R ≥ 0 as follows: F(θ,γ )= γ b θ ε 2 +Φ r 2 γ − a θ + γ 2 a θ + r 2 γ φ a θ − r 2 γ − a θ φ(a θ ) +(a 2 θ +1) h Φ a θ − r 2 γ − Φ( a θ ) i , (5.23) withb θ := ∥Σ 1/2 θ ∥ 2 ℓ 2 ∥θ ∥ 2 ℓq ,ℓ q denoting the dual norm ofℓ r (i.e., 1 r + 1 q = 1) andφ(t) := 1 √ 2π e − t 2 2 standing for the p.d.f of a standard Gaussian distribution. Then, the Wasserstein adversarial risk withp = 2 and metric d(·,·) given by (5.22) is characterized as AR(θ )= inf γ ≥ 0 F(θ,γ ). (5.24) Note that as an implication of Proposition 5.1.5, the standard riskSR(θ ) and the adversarial riskAR(θ ) depend on the estimatorθ only through the componentsa θ andb θ . We next characterize the Pareto optimal front for the region{(SR(θ ),AR(θ )):θ ∈R d }. Since the 0-1 lossI(yx T θ ≤ 0) is convex inθ , both the standard risk and the adversarial risks are convex functions ofθ (by a similar argument given prior to Theorem 5.1.3.) Theorem5.1.6. Assume the setting of Proposition 5.1.5 and consider the following minimization problem (θ λ ∗ ,γ λ ∗ ):=arg min γ ≥ 0,θ λ Φ( − a θ )+F(θ,γ ). (5.25) 153 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Standard risk 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Adversarial risk (a) Pareto optimal curve for severalℓr norms on feature space withd=10, ε=0.5 andρ =0. 0 0.05 0.1 0.15 Standard risk 0.45 0.5 0.55 0.6 0.65 0.7 Adversarial risk (b) Pareto optimal curve for several feature dependency values (ρ ) with d=10,ε=0.3, andr =∞. 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Standard risk 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Adversarial risk (c) Pareto optimal curve for several adversary’s budget ε with d = 10, r =∞, andρ =0. Figure 5.2: The effect of defined ℓ r norm on feature space, dependency across features (ρ ) and adversary’s powerε on Pareto optimal tradeoff between adversarial and standard risks in binary classification under Gaussian mixture model. The Pareto optimal curve is given by {(Φ( − a θ λ ∗ ),F(θ λ ∗ ,γ λ ∗ )):λ ≥ 0}. Theorem 5.1.6 follows from the fact that the Pareto front of a convex set is characterized by intersection points of the set with the supporting hyperplanes. Remark5.1.7. Forr = q = 2 andΣ = I, we haveb θ = 1. In this case the objective of (5.25) is decreasing in a θ and since|a θ | ≤ ∥ µ ∥ ℓ 2 , it is minimized at a θ = ∥µ ∥ ℓ 2 . In addition, SR(θ ) is decreasing in a θ and is 154 minimized at the same value ofa θ =∥µ ∥ ℓ 2 . Therefore, by introducingc :=∥µ ∥ ℓ 2 , the Pareto-optimal curve shrinks to a single point given by SR=Φ( − c), (5.26) AR= inf γ ≥ 0 γε 2 +Φ r 2 γ − c + γ 2 c+ r 2 γ φ c− r 2 γ − cφ(c) +(c 2 +1) h Φ c− r 2 γ − Φ( c) i . In other words, thetradeoffbetweenstandardandadversarialrisks, achievedbylinearclassifiers, vanishesin this case and the estimators in direction of the class averageµ are optimal with respect to both standard risk and the Wasserstein adversarial risks. We refer to Section D.1.5 for the proof of Remark 5.1.7. Figure 5.2 showcases the effect of different factors in a binary classification setting on the Pareto- optimal tradeoff between standard and adversarial risks. Here the features x are drawn from N(yµ, Σ) , with Σ ij = ρ |i− j| . The class average µ has i.i.d entries from N(0,1/d) with d = 10. In Figure 5.2a, we investigate the role of the norm r used in the Wasserstein adversary model, cf. Equation (5.22). As discussed in Remark 5.1.7, whenr = 2, the tradeoff between standard and adversarial risks vanishes and the estimators in direction of the class average µ are optimal with respect to both standard risk and the Wasserstein adversarial risks. Figure 5.2b illustrates the effect of dependency among features ρ on optimal tradeoff between standard and adversarial risks. In this setting r = ∞ and ε = 0.3. From the result of Theorem 5.1.6, we see that 155 these risks very much depend on the interaction between the class averageµ and the features covariance Σ and so the curves are shifted in highly nontrivial way depends on the value ofρ when we fix µ . The role of adversary’s budgetε is depicted in figure 5.2c in which r =∞,ρ =0. Similar to the linear regression setting, whenε is small the two measures of risk are close to each other and we have a small range of Pareto-optimal models. Asε grows, the standard risk and the adversarial risks differ significantly and we get a wide range of Pareto-optimal models. 5.1.3 Learningnonlinearfunctions We next investigate the tradeoff between standard and adversarial risk for the problem of learning an unknown function over the d-dimensional sphereS d− 1 . More precisely, we consider the following data generative model: y =f d (x)+w, (5.27) withx∼ Unif(S d− 1 ( √ d)), thed-dimensional sphere of radius √ d, andw ∼ N(0,σ 2 ) independent ofx. We consider fitting a random features model to data generated according to (5.27). The class of random features model is given by F RF (θ,U )= f(x,θ,U ):= N X i=1 θ i σ (u T i x), withθ i ∈R, i=1,...,N , (5.28) where U ∈ R N× d is a matrix whose i-th row is the vector u i , uniformly drawn fromS d− 1 (1), inde- pendently from data. The random features model can be equivalently represented by two-layer neural 156 network with the first-layer weights U chosen randomly andθ =(θ i ) 1≤ i≤ N corresponding to the second- layer weights. The random features model was introduced by [147] for scaling kernel methods to large datasets. There is indeed a substantial literature drawing connections between random features models, kernel methods and fully trained neural networks [39, 38, 80, 113]. In [123], the generalization error (stan- dard risk) of random features model was precisely characterized for the problem of learning a function f d (·) overS d− 1 ( √ d) in the regime where the network width N, sample size n and feature dimension d grow in proportion. The nonlinear model considered in [123] is of the form f d (x)=β d,0 +x T β d,1 +f NL d (x), (5.29) with the nonlinear componentf NL d (x) is a centered isotropic Gaussian process indexed byx. We follow the same model and consider the following random quadratic function f d (x)=β d,0 +x T β d,1 + F ∗ d [x T Gx− Trace(G)], (5.30) for some fixed F ∗ ∈R andG∈R d× d a random matrix with i.i.d entries fromN(0,1). Our goal is to study the Pareto-optimal tradeoff between standard and adversarial risks for this learning setting, achieved by the class of random features model (5.28). The standard risk in this setting is given by SR(θ )=E x,y h (y− θ T σ (Ux)) 2 i =E x h (f d (x)− θ T σ (Ux)) 2 i +σ 2 . (5.31) For the Wasserstein adversarial risk we use the following corollary which is obtained by specializing Propo- sition 5.0.3 to random features model. 157 Corollary5.1.8. Considertheclassoffeaturemodelgivenby (5.28). Inthiscase,the2-Wassersteinadversarial risk with distanced(·,·) (5.12) admits the following first-order approximation: AR(θ )= SR(θ ) +2εE x h (f d (x)− θ T σ (Ux)) 2 +σ 2 × U T diag(σ ′ (Ux))θ 2 ℓ 2 i 1/2 +O(ε 2 ), (5.32) withσ ′ (·) denoting the derivative of the activationσ (·) andSR(θ ) given by (5.31). The proof of Corollary 5.1.8 is given in Appendix D.1.6. The standard risk is quadratic and hence convex inθ . The adversarial risk is also convex inθ (it follows from the fact that pointwise maximization preserves convexity.) Therefore, for small values of ε (weak adversary) the first order approximation of AR(θ ) is also convex inθ . As such, (almost) all Pareto optimal points are given by minimizing a weighted combination of the two risk measures as the weightλ varies in[0,∞). Namely, θ λ :=argmin θ λ SR(θ )+AR(θ ), (5.33) withSR(θ ) given by (5.31), andAR(θ ) given by (5.32). We use the above characterization to derive the Pareto-optimal tradeoff curves between standard and adversarial risks for learning function f d (x), given by (5.30) with F ∗ = 1, β d,0 = 0, and β d,1 ∈ R d with i.i.d entries ∼ N(0,1/d). The data are generated according to (5.27) with σ = 2, d = 10 and N ∈ {250,500,750,1000}. To compute θ λ we use empirical loss with n = 500K samples of x ∼ Unif(S d− 1 ( √ d)). For each value of λ and N we generate 15 realization of weights U and compute θ λ for each realizations using gradient descent on the loss function (5.33). The Pareto optimal points 158 4.05 4.1 4.15 4.2 4.25 4.3 4.35 Standard risk 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 Adversarial risk N=250 N=500 N=750 N=1000 Figure 5.3: Pareto-optimal tradeoff curves for learning random quadratic functions using random features model. Data is generated according to (5.27) withσ = 2 andf d (x) given by (5.30). Here, d = 10 andN is the number of random features (width of the neural network). {SR(θ λ ),AR(θ λ ): λ ≥ 0} are plotted in Figure 5.3 forε=0.2. As we see for each value ofN the tradeoff curves concentrate asN grows implying that the estimatorθ λ becomes independent of the specific real- ization of weightsU. Also we observe that the tradeoff between standard and adversarial risks exist even for large values ofN. Interestingly, as the network widthN grows both the standard risk and adversarial risk decrease but the tradeoff between them clearly remains (the length of Pareto front does not shrink). 159 Chapter6 AdversarialRobustnessforLatentModels In the binary classification problem, we are given a set of labeled data points {(x i ,y i )} i=1:n which are drawn i.i.d. from a common lawP, wherex∈X ⊂ R d is the feature vector andy∈{+1,− 1} is the label associated to the featurex. The goal is to predict the label of a new test data point with a feature vector drawn from the similar population. To this end, the learner tries to fit a binary classification model to the training set, which results in an estimated model b h:X →{− 1,+1}. The conventional metric to measure the accuracy of a classifier h is its average error probability on an unseen data point(x,y)∼P . This is often referred to as the standard risk of the classifier, a.k.a. generalization error. Concretely, standard risk of a classifier h is defined as the following: SR(h):=P(h(x)y≤ 0) . (6.1) Despite the remarkable success in deriving classifiers with high accuracy (low standard risk) during the past decades, it has been observed that even the state-of-the-art classifiers are vulnerable to minute but adversarially chosen perturbations on test data points. 160 The adversarial setting is often formulated as a game between the learner and the adversary. Given access to unperturbed training data, the learner fits a model h : X → {− 1,+1}. After observing the modelh and each test data point(x,y) generated from the distributionP, the adversary perturbs the data point arbitrarily as far as its within its budget. A common and widely-used adversarial model is that of norm-bounded perturbations, where for each test data point (x,y) the adversary chooses an arbitrary perturbationδ from theℓ p ball of radiusε p and replacesx byx+δ . Here,ε p is a parameter of the setting which quantifies adversary’s power. ∗ The adversarial risk of the classifier h is defined as the following: AR(h)=E (x,y)∼P sup ∥δ ∥ ℓp ≤ εp ℓ(h(x+δ ),y) , (6.2) for some loss functionℓ. For the 0-1 lossℓ(s,t)=I(st≤ 0), this measure amounts to AR(h)=P inf ∥x ′ − x∥ ℓp ≤ εp h(x ′ )y≤ 0 ! . (6.3) Remark6.0.1. A couple points are worth noting regarding the adversarial setting: • Theadversarychoosesperturbation “after”observingthetestdatapoint. Theperturbationδ caningen- eraldependonx,i.e. differentdatapointscanbeperturbeddifferently. Therefore,inthedefinition (6.2), the supremum is taken inside the expectation. • In the above setting, the perturbations are added in the test time, while the learner is given access to unperturbedtrainingdata. Otheradversarialsetupsarealsostudiedintheliterature;seee.g.[],wherean attackercanobserveandmodifyalltrainingdatasamplesadversariallysoastomaximizetheestimation error caused by his attack. ∗ We will drop the indexp and writeε for the adversary’s power when it is clear from the context. 161 • Another popular adversarial model is the so-called distribution shift. In this model, in contrast to norm bounded perturbations as discussed above, the adversary can shift the test data distribution. The ad- versary’s power is measured in terms of the Wasserstein distance between the test and the training distributions; see [168, 48, 144, 122] for a non-exhaustive list of references. That said, our focus in this chapter is on the norm-bounded perturbations. From the definition of standard risk and adversarial risk given by (6.1) and (6.3), it can be seen that the adversarial risk is always at least as large as the standard risk. We refer to the non-negative difference of adversarial risk and standard risk as the boundary risk, formulated by BR(h):= AR(h)− SR(h) =P h(x)y≥ 0, inf ∥x ′ − x∥ ℓp ≤ εp h(x)h(x ′ )≤ 0 ! . (6.4) The boundary risk can be considered as the average vulnerability of the classifier with respect to small perturbations on successfully labeled data points. In other words, it measures the likelihood that the classifier correctly determines the label of a data point, but fails to label another test input very close to the primary data point. In the main result section, we study the boundary risk of optimal classifiers (having the lowest standard risk) in scenarios that features vectors lie on a low-dimensional manifold. We next discuss the data generative models. 6.0.1 Datagenerativemodel Latent low-dimensional manifold models. We focus on the binary classification problem with high- dimensional features generated from a low-dimensional latent manifold. Specifically, we assume that for 162 the features vectorx∈R d , and the binary labely∈{+1,− 1}, there exists an inherent low-dimensional link z ∈ R k such that x ⊥ ⊥ y|z. This structure can be perceived as a transformed binary classifi- cation model, where low-dimensional features z ∈ R k of a hidden classification problem with labels y ∈ {+1,− 1}, are embedded in a high-dimensional space by a mapping G : R k → R d . The learner observes the embedded high-dimensional features x i = G(z i ) and the primary binary labels y i , while being oblivious to the low-dimensional latent vectorz i . Throughout this chapter, we consider a special case of this model, whereG(z) = φ(Wz) withW ∈ R d× k a tall full-rank weight matrix, andφ acting entry-wise on vector inputs with a derivativedφ/dt≥ c, for some positive constantc>0. Classificationsettings. The focus of this chapter is on two widely used binary classification settings: (i) Gaussian mixture models (ii) generalized linear models which we briefly explain below. • Gaussian mixture models. In the Gaussian mixture model, the binary response value y accepts the positive label with probabilityπ , and the negative label with probability1− π . In this setting, labels are assigned independently from the feature vector z, while feature vectors are generated from a multivariate Gaussian distribution with the mean vectoryµ , and a certain covariance matrix. Concretely, the data generating law for the Gaussian mixture problem with features coming from a low-dimensional manifold can be written as the following: y∼ Bern(π, {+1,− 1}), x=φ(Wz), z∼ N(yµ,I k ). (6.5) In this model, we consider low-dimensional isotropic Gaussian features. In other words, manifold featuresz are drawn from a Gaussian distribution with identity covariance matrix. 163 • Generalizedlinearmodels. In binary classification under a generalized linear model, there is an increasing function f : R :→ [0,1], a.k.a. link function, along with a linear predictor β ∈ R k , where the score function f(z T β ) denotes the likelihood of feature vector z accepting the positive label. Formally, the data generating law for this classification problem under the low-dimensional manifold model can be formulated as the following: y = +1 w.p.f(z T β ), − 1 w.p.1− f(z T β ). , x=φ(Wz), z∼ N(0,I k ). (6.6) Popular choices of the link functionf are the logistic modelf(t)=1/(1+exp(− t)), and the probit modelf(t)=Φ( t) withΦ( t) being the standard normal cumulative distribution function. 6.0.2 Backgroundonoptimalclassifiers For each classification setup described in the previous section, we want to identify the classifiers that are optimal with respect to the standard risk. To this end, we provide a summary of the Bayes-optimal classifiers. For a data point (x,y) ∼ P , consider the conditional distribution function η (x) := P(y = +1|X =x). This distribution function can be perceived as the likelihood of assigning the positive label to a data point with feature vectorx. The Bayes-optimal classifier simply assigns label y =+1 to the feature vector x, if for this feature there is a higher likelihood to accept the label +1 than− 1. In other words, h Bayes (x) = sign(η (x)− 1/2). We formalize it in the next proposition, which is a standard result (see e.g., [45, Theorem 2.1].) Proposition6.0.2. [45, Theorem 2.1] Among all the classifiers h :R d →{+1,− 1}, such thath is a Borel function, the Bayes-optimal classifier h Bayes (x)=sign(η (x)− 1/2) has the lowest standard risk. 164 The next corollary uses Proposition 6.0.2 to characterize the Bayes-optimal classifier under each of the binary classification settings described earlier in Section 6.0.1. Corollary6.0.3. UndertheGaussianmixturemodel (6.5),theBayes-optimalclassifiercanbeformulatedby h ∗ (x)=sign φ − 1 (x) T WW T † Wµ − q/2 , withq = log( 1− π π ). Moreover, under the generalized linear model (6.6), the Bayes-optimal classifier is given by h ∗ (x)=sign f β T (W T W) − 1 W T φ − 1 (x) − 1/2 . It is worth noting that in the described manifold latent model of Section 6.0.1, the weight matrixW is tall and full-rank, andφ is strictly increasing hence bothW T W andφ are invertible. 6.1 Mainresults We will focus on the described binary classification settings of Section 6.0.1. In each setting, we characterize the asymptotic behavior of the associated boundary risk of Bayes-optimal classifiers, when the ambient dimension d grows to infinity. We aim at studying the role of low-dimensional latent structure of data in obtaining a vanishing boundary risk for the Bayes-optimal classifiers. In this case, the Bayes-optimal classifiers are optimal with respect to both measures of standard accuracy and the robust accuracy. 165 6.1.1 Gaussianmixturemodel Consider the Gaussian mixture model with features lying on a low-dimensional manifold, cf. (6.5). Recall that the learner only observes the ambient d− dimensional features x, and is oblivious to the original k- dimensional manifold features z. The next result states that under this setup, the boundary risk of the Bayes-optimal classifier will converge to zero, when the minimum nonzero singular value of the weight matrixW grows at a sufficient rate, which depends on adversary’s power ε p and the choice of perturbations normℓ p . Theorem 6.1.1. Consider the binary classification problem under the Gaussian mixture model (6.5) in the presence of an adversary with ℓ p -norm bounded adversary power ε p , for p ≥ 2. By letting the ambient dimensiond grow to infinity, under the condition that the weight matrix W satisfies ε p d 1 2 − 1 p σ min (W) =o d (1), (6.7) the boundary risk of the Bayes-optimal classifier converges to zero. The proof of Theorem 6.1.1 is given in Section E.1.2. We proceed by discussing condition (6.7). Asε p gets larger, the condition becomes more strict which is expected; larger value of ε p indicates a stronger adversary which makes the boundary risk larger. In addition, σ min (W) somewhat measures the extent of low-dimensional structure in data; small σ min (W) indicates that there are directions in the low-dimensional space along which the energy of the signal is not scaled sufficiently large when transformed into the ambient space. Therefore, the adversary can perturb feature x along those dimensions as the existent signal is weak. In particular, if the means of mixture components are close to the space of small eigenvalues of W , then they are collapsed in the embedding 166 from the latent space to the ambient space, making the Bayes-optimal classifier less robust. Finally, since ∥δ ∥ ℓp ≤∥ δ ∥ ℓ 2 forp≥ 2, an adversary with powerε inℓ p norm is stronger than an adversary with power ε inℓ 2 norm. This is consistent with the fact thatd 1 2 − 1 p is increasing inp and so the condition becomes stronger for largerp. Example. Consider the case ofφ(·) being the identity function andp = 2. We observe that for feature x with label y, x i ∼ N(yw T i µ, ∥w i ∥ 2 ℓ 2 ). To be definite, we fix ∥w i ∥ ℓ 2 = 1, which implies in particular ∥W∥ 2 F =d. To simplify further, we assume that all the non-zero singular values ofW to be equal, whence W T W = (d/k)I. In this case, condition (6.7) reduces toε 2 = o( p d/k). In particular, ifε 2 = O(1) and the dimension ratiod/k→∞ the boundary risk converges to zero. Figure 6.1 validates the result of Theorem 6.1.1 under the Gaussian mixture model (6.5) withπ =1/2, µ = N(0,I k /k), in the presence of an adversary with ℓ 2 bounded adversarial attacks of power ε 2 . In this example, we fix the high-dimensional feature dimension d = 300, and vary the dimensions ratio d/k from 1 to 300. Further, we consider the identical function φ(t) = t, and let the feature matrix W have independent Gaussian entriesN(0,1/k). Figure 6.1a shows the effect of dimensions ratio d/k on the standard risk, adversarial risk, and the boundary risk of the Bayes-optimal classifier. For each fixed values (k,d), we generate M = 100 independent realizations and compute the risks. The shaded area around each curve denotes one standard deviation (computed overM realizations) above and below the average curve. As it can be seen, the boundary risk will eventually converge to zero. Finally, in Figure 6.1b, we consider several values for adversary’s power, where it can be observed that for all adversary’s powerε 2 , the boundary risk decays to zero as the feature dimensionsd/k grows. The standard risk and the robust risk in Figure 6.1a are calculated using the following proposition. 167 0 50 100 150 200 250 300 d/k 0 0.2 0.4 0.6 0.8 1 1.2 Risk (a) Behavior of the standard and adversarial risks of the Bayes-optimal classifier for the adversary’s power ε2 =1. 0 50 100 150 200 250 300 d/k 0 0.2 0.4 0.6 0.8 1 Boundary Risk (b) Behavior of the boundary risk of the Bayes-optimal classifier for the several values of adversary’s power ε2. Figure 6.1: Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes-optimal classifier with ℓ 2 perturbations under the Gaussian mixture model (6.5), where features lie on a low-dimensional manifold. Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over theM =100 realizations. Proposition 6.1.2. Consider the mixture of Gaussian classification setup (6.5) with the identity mapping φ(t) = t and balanced classes (i.e. π = 1/2). For a linear classifier h(x) = sign(x T a) under ℓ p -bounded adversarial setup, adversarial and standard risks are given by AR(h)=Φ ε p ∥a∥ q − a T Wµ ∥W T a∥ 2 , SR(h)=Φ − a T Wµ ∥W T a∥ 2 , where∥.∥ q is the dual norm of∥.∥ p , i.e. 1 p + 1 q =1. We refer to Section E.1.4 for the proof of Proposition 6.1.2. In Theorem (6.1.1), we showed that when features lie on a low-dimensional manifold, the Bayes- optimal classifier is also optimal with respect to the adversarial risk. In other words, the adversarial risk is always at least as large as the standard risk, for any classifier, and the gap between the “minimum” of these two risks converges to zero. This result raises the below natural question: 168 “Does the boundary risk of any classifier vanish under the low-dimensional latent structure?” In the next proposition, we provide a simple example to show that such behavior (vanishing boundary risk) does not necessarily happen for all classifiers. Proposition 6.1.3. Consider the Gaussian mixture model (6.5) with µ having i.i.d. N(0,1/k) entries with class probability π = 1/2 in the presence of an adversary with bounded ℓ p perturbations of size ε p . In addition, suppose that the rows of the feature matrix W are sampled from the k-dimensional unit sphere and considerφ being the identity function. Then, the boundary risk of the classifier h(x) = sign(e T 1 x) with e 1 = (1,0,0,...,0) is lower bounded by some constant c εp , where c εp depends only on ε p (independent of dimensionsk,d), and is strictly positive for positive values ofε p . We refer to Section E.1 for proof of this proposition. 6.1.2 Binaryclassificationundergeneralizedlinearmodels Consider a binary classification problem under a generalized linear model with features enjoying a low- dimensional latent structure, cf. (6.6). The next result states that under certain conditions on the weight matrix, the boundary risk of the Bayes-optimal classifier will converge to zero, as the ambient dimension grows to infinity. Theorem 6.1.4. Consider the binary classification problem under the generalized linear model (6.6) in the presence of an adversary with ℓ p -bounded perturbations of power ε p for some p ≥ 2. Assume that as the ambient dimensiond grows to infinity, the weight matrix W satisfies the following condition: ε p d 1 2 − 1 p σ min (W) =o d (1). (6.8) 169 0 50 100 150 200 250 300 d/k 0 0.2 0.4 0.6 0.8 1 1.2 Risk (a) Behavior of the standard and adversarial risks of the Bayes-optimal classifier for the adversary’s power ε2 =1. 0 50 100 150 200 250 300 d/k 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Boundary Risk (b) Behavior of the boundary risk of the Bayes-optimal classifier for the several values of adversary’s power ε2. Figure 6.2: Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes- optimal classifier of the generalized linear model (6.6) with ℓ 2 perturbations, in which features are coming from a low-dimensional manifold. Solid curves represent the average values, and the shaded areas represent one standard deviation above and below the corresponding curves overM =100 realizations. Then the boundary risk of the Bayes-optimal classifier converges to zero. The proof of Theorem 6.1.4 is given in Section E.1. Note that the condition (6.8) is similar to condi- tion (6.7) for the Gaussian mixture model. Figure 6.2 validates the result of Theorem 6.1.4 for binary classification under the generalized linear model (6.6) with identity φ mapping and ℓ 2 perturbations. In this example, the ambient dimension d is fixed at 300, and the manifold dimension k varies from 1 to 300. In addition, the linear predictor β and the weight matrix W have i.i.d. N(0,1/k) entries. For each fixed values (k,d), we generate M = 100 independent realizations, and we compute the average and the standard deviation of total M obtained values. In each figure, the shaded areas are obtained by moving the average values one standard deviation above and below. Figure 6.2a denotes the behavior of the standard risk, adversarial risk, and the boundary risk of the Bayes-optimal classifier, as the dimensions ratio d/k grows. Further, Figure 6.2b exhibits a 170 similar behavior for several values of adversary’s powerε, in which it can be observed that the boundary risk decays to zero. The standard risk and the robust risk in Figure 6.2a are calculated using the following proposition. Proposition6.1.5. Consider the generalized linear model (6.6) with the identity mappingφ(t) = t. Under this setup, the adversarial and standard risks of linear classifier h(x)=sign(f(θ T x)− 1/2) are given by AR(h)=E [u 1 ,u 2 ]∼ N(0,Σ u) [f(u 1 )I(u 2 ≤ c+ε p ∥θ ∥ q )+(1− f(u 1 ))I(u 2 ≥ c− ε p ∥θ ∥ q )] , SR(h)=E [u 1 ,u 2 ]∼ N(0,Σ u) [f(u 1 )I(u 2 ≤ c)+(1− f(u 1 ))I(u 2 ≥ c)] , where the covariance matrixΣ u = h ∥β ∥ 2 2 β T W T θ β T W T θ ∥W T θ ∥ 2 2 i ,c=f − 1 (1/2), and∥.∥ q is the dual norm of∥.∥ p . In addition, for the Bayes-optimal classifier h ∗ (x) = sign(f(x T θ ∗ )− 1/2) with θ ∗ = W(W T W) − 1 β (see Corollary 6.0.3) we have AR(h ∗ )=E u∼ N(0,∥β ∥ 2 2 ) [f(u)I(u≤ c+ε p ∥θ ∗ ∥ q )+(1− f(u))I(u≥ c− ε p ∥θ ∗ ∥ q )] , SR(h ∗ )=E u∼ N(0,∥β ∥ 2 2 ) [f(u)I(u≤ c)+(1− f(u))I(u≥ c)] . We refer to Section E.1.5 for the proof of this proposition. 171 6.2 Discussion 6.2.1 Isitnecessarytolearnthelatentstructuretoobtainavanishingboundaryrisk? Asimplecase In the previous sections, for the two binary classification settings, we showed that when the features inher- ently have a low-dimensional structure, the boundary risk of the Bayes-optimal classifiers will converge to zero, as the ambient dimension grows to infinity. A closer look at the Bayes-optimal classifier of each setting (can be seen in Corollary 6.0.3) reveals the fact that these classifiers directly use the knowledge of the nonlinear mapping from the low-dimensional manifold to the ambient space. In other words, the Bayes-optimal classifiers explicitly draw upon the generative components φ and W . In this section, we investigate the existence of classifiers that are agnostic to the mapping between the low-dimensional and the high-dimensional space, while they have asymptotically vanishing boundary risk. For this purpose, consider binary classification under the Gaussian mixture model (6.5). In addition, assume a training set {(x i ,y i )} n i=1 sampled from (6.5). We focus on the class of linear classifiers h θ (x)=sign(x T θ ) withθ ∈R d andℓ 2 perturbation (p=2). We consider the logistic loss ℓ(t) = log(1 + exp(− t)), and assume that the adversary’s power is bounded byε. We consider the minimax approach of [118] to adversarially train a modelθ by solving the following robust empirical risk minimization (ERM): b θ ε =argmin θ ∈R d 1 n n X i=1 max u∈Bε(x i ) ℓ(y i u T θ ). 172 This is a convex optimization problem, as it can be cast as a point-wise maximization of the convex func- tionsℓ(y i u T θ ). Further, when perturbations are from theℓ 2 ball, the inner maximization problem can be solved explicitly (see e.g. [94]), which leads to the following equivalent problem: b θ ε =argmin θ ∈R d 1 n n X i=1 ℓ(y i x T i θ − ε∥θ ∥ ℓ 2 ). (6.9) Figure 6.3 demonstrates the effect of the dimensions ratio d/k on the standard, adversarial, and the bound- ary risk of the classifier h b θ ε for four different choices of the feature mapping φ: (i) φ 1 (t) = t, (ii) φ 2 (t) = t/4 + sign(t)3t/4, (iii) φ 3 (t) = t + sign(t)t 2 , and (iv) φ 4 (t) = tanh(t). In this example, we consider the ambient dimensiond=100, and number of samplesn=300. In addition,k varies from1 to100, andµ ,W have i.i.d. entriesN(0,1/k). Further, we consider balanced classes (each label± 1 occurs with probability π = 1/2). The plots in Figure 6.3 exhibit the behavior of the standard, adversarial, and the boundary risks of the classifier h b θ ε , for each of these mappings and for the adversary’s powerε = 1. For each fixed value of k,d, we considerM = 20 trials of the setup. The solid curve denote the average values over theseM trials. The shaded areas are obtained by plotting one standard deviation above and below the main curves. The plots in Figure 6.4 showcase the boundary risk for different choices of ε. As we observe, the boundary risk decreases to zero, when the dimensions ratio d/k grows to infinity. Our next theorem proves this behavior for the special case ofφ(t)=t. Theorem6.2.1. ConsiderbinaryclassificationundertheGaussianmixturemodel (6.5)withidentitymapping φ(t) = t in the presence of an adversary withℓ p norm bounded perturbations of sizeε p for somep≥ 2. In 173 0 20 40 60 80 100 d/k 0 0.1 0.2 0.3 0.4 0.5 0.6 Risk (a) Feature mappingφ(t) = t and adversary’s powerε=1 0 20 40 60 80 100 d/k 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Risk (b) Feature mappingφ(t)=t/4+sign(t)3t/4 and adversary’s powerε=1 0 50 100 d/k 0 0.2 0.4 0.6 Risk (c) Feature mappingφ(t)=t+sign(t)t 2 and adversary’s powerε=1 0 50 100 d/k 0 0.2 0.4 0.6 Risk (d) Feature mapping φ(t) = tanh(t) and ad- versary’s powerε=1. Figure 6.3: Effect of dimensions ratio d/k on the standard, adversarial, and boundary risks of the linear classifier h θ (x)=sign(x T θ ) withθ being the robust empirical risk minimizer (B.26). Samples are generated from the Gaussian mixture model (6.5) with balanced classes (π = 1/2), and with four choices of feature mappingφ: (a) φ(t) = t, (b) φ(t) = 3t/4 + sign(t)t/4, (c) φ(t) = t + sign(t)t 2 and (d) φ(t) = tanh(t). In these experiments, the ambient dimension d is fixed at 100, and the manifold dimension k varies from 1 to 100. The sample size is n = 300 the classes average µ and the weight matrix W have i.i.d. entries from N(0,1/k). The adversary’s power is fixed at ε = 1. For each fixed values of k andd, we considerM = 20 trials of the setup. Solid curves represent the average results across these trials, and the shaded areas represent one standard deviation above and below the corresponding curves. 174 0 20 40 60 80 100 d/k 0 0.2 0.4 0.6 0.8 Boundary Risk (a) Boundary risk with the feature mappingφ(t) = t and for multiple values of adversary’s powerε 0 20 40 60 80 100 d/k 0 0.2 0.4 0.6 0.8 Boundary Risk (b) Boundary risk with the feature mapping φ(t) = t/4 + sign(t)3t/4 and for multiple values of adver- sary’s powerε 0 20 40 60 80 100 d/k 0 0.1 0.2 0.3 0.4 0.5 Boundary Risk (c) Boundary risk with the feature mappingφ(t)=t+ sign(t)t 2 and for multiple values of adversary’s power ε 0 20 40 60 80 100 d/k 0 0.2 0.4 0.6 0.8 1 Boundary Risk (d) Boundary risk with the feature mapping φ(t) = tanh(t) and for multiple values of adversary’s power ε Figure 6.4: Effect of dimensions ratio d/k on the boundary risk of the linear classifier h θ (x) = sign(x T θ ) with θ being the robust empirical risk minimizer (B.26). Samples are generated from the Gaussian mixture model (6.5) with balances classes (p=1/2), and with four choices of feature mappingφ: (a) φ(t)=t, (b)φ(t)=3t/4+sign(t)t/4, (c)φ(t)=t+sign(t)t 2 and (d) φ(t)=tanh(t). In these experiments, the ambient dimensiond is fixed at 100, and the manifold dimension k varies from 1 to 100. The sample size is n = 300 the classes average µ and the weight matrixW have i.i.d. entries fromN(0,1/k). We consider different levels of the adversary’s power ε∈{1,2,4}. For each fixed values of k andd, we considerM =20 trials of the setup. Solid curves represent the average results across these trials, and the shaded areas represent one standard deviation above and below the corresponding curves. 175 addition,Leth θ (x)=sign(x T θ )bealinearclassifierwith θ ∈R d andassumethatastheambientdimension d grows to infinity, the following condition on the weight matrix W and the decision parameterθ hold: ε p d 1 2 − 1 p σ min (W) 1− P Ker(W T ) (θ ) 2 ℓ 2 /∥θ ∥ 2 ℓ 2 − 1/2 =o d (1), (6.10) whereP Ker(W) (θ )standsfortheℓ 2 -projectionofvectorθ ontothekernelofthematrixW. Thentheboundary risk of the classifier h θ will converge to zero. Inparticular,assumethat b θ ε isthesolutionofthefollowingadversarialempiricalriskminimization(ERM) problem: b θ ε =argmin θ ∈R d 1 n n X i=1 sup u∈Bε(x i ) ℓ(y i u T θ ), withℓ:R→R ≥ 0 beingastrictlydecreasinglossfunction. Inthiscase,withtheweightmatrixW satisfying εpd 1 2 − 1 p σ min (W) =o d (1), the boundary risk ofh b θ ε converges to zero. We refer to Section E.1.7 for the proof of Theorem 6.2.1. 6.2.2 Effectofperturbationintheliftingmatrix W The Bayes-optimal classifiers in previous sections were computed using the true lifting matrix W . In practice, this matrix (or in general the embedding from the latent to ambient space) should be learned and therefore deviates from the true W . Our goal in this section is to show through numerical experiments that our result about the boundary risk being decreasing asd/k grows, remains valid even if a perturbed lifting matrix c W is used in the Bayes-optimal classifier. 176 Consider the perturbed matrix c W =W +N whereN has i.i.d Gaussian entriesN(0,σ 2 N ). We start by the Gaussian mixture mode. Using Corollary 6.0.3 with mappingφ(t)=t and balanced class probabilities (π =1/2), the Bayes-optimal classifier reads b h(x)=sign x T c W c W T † c Wµ . (6.11) We first characterize the adversarial and standard risk of classifier b h using Proposition 6.1.2 and then compute the boundary riskBR( b h)= AR( b h)− SR( b h). We generateW andµ similar to the previous experiments setup (results in Figure 6.1) withd = 300. In Figure 6.5a, we fix σ N = 1 and plot the adversarial, standard and boundary risks, averaged over 200 independent instances of the problem. The shaded areas around each curve represent the one standard deviation (computed over 200 experiments) above and below the mean curve. As can be observed, the boundary risk vanishes as the dimension ratiod/k grows to infinity. In Figure 6.5b, we plot the boundary risk for σ N ∈ {1,2,3,4,5}. As can be seen, even for large perturbation levels σ N , the boundary risk converges to zero, albeit at a slower rate. We next conduct similar experiments for the generalized linear model. In this setting, the Bayes- optimal classifier is given by b h(x)=sign f β T ( c W T c W) − 1 c W T x − 1/2 . (6.12) By an application of Proposition 6.1.5 for b θ = c W( c W T c W) − 1 β we compute the adversarial and the stan- dard risks of b h. For our numerical experiment, we consider the logistic modelf(t)= 1 1+exp(− t) . In Figure 177 0 50 100 150 200 250 300 d/k 0 0.2 0.4 0.6 0.8 1 Risk (a) Behavior of the standard and adversarial risks of the Bayes-optimal classifier (6.11) computed with noisy weight matrix c W = W +N withN having i.i.d en- triesN(0,1) for the adversary’s powerε2 =1. 0 50 100 150 200 250 300 d/k 0 0.1 0.2 0.3 0.4 0.5 Boundary Risk (b) Behavior of the boundary risk of Bayes-optimal classifier (6.11) computed with noisy weight matrix c W =W+N withN having i.i.d entriesN(0,σ 2 N ) for the adversary’s powerε2 =1 andσ N ∈{1,2,3,4,5}. Figure 6.5: Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes- optimal classifier obtained from a noisy weight matrix c W withℓ 2 perturbations under the Gaussian mixture model (6.5), where features lie on a low-dimensional manifold. Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over the M =200 realizations. 0 50 100 150 200 250 300 d/k 0 0.2 0.4 0.6 0.8 1 Risk (a) Behavior of the standard and adversarial risks of the Bayes-optimal classifier (6.12) under generalized linear model (6.6) computed with noisy weight matrix c W = W +N withN having i.i.d entriesN(0,1) for the adversary’s powerε2 =1. 0 50 100 150 200 250 300 d/k 0 0.1 0.2 0.3 0.4 0.5 Boundary Risk (b) Behavior of the boundary risk of Bayes-optimal classifier (6.12) under generalized linear model (6.6) computed with noisy weight matrix c W = W + N with N having i.i.d entries N(0,σ 2 N ) for the adver- sary’s powerε2 =1 andσ N ∈{1,2,3,4,5}. Figure 6.6: Effect of the dimensions ratio d/k on the standard, adversarial, and the boundary risk of the Bayes- optimal classifier obtained from a noisy weight matrix c W withℓ 2 -perturbations under the generalized linear model (6.6). Solid curves represent the average values, and the shaded area around each curve represents one standard deviation above and below the computed average curve over theM =50 realizations. 178 6.6a, we fix ε 2 =1 (underℓ 2 adversarial setup),σ N =1 andd=300, and varyk. The reported results are averaged over50 independent experiments with shaded curves showing one standard deviation above and below the averaged values (computed over50 realizations). As can be seen, even though the Bayes-optimal classifier is computed using the perturbed matrix c W , the boundary risk converges to zero as d/k grows to infinity. In Figure 6.6b, we repeat similar experiments for different values of σ N ∈{1,2,3,4,5}. As we see, even for large perturbation levelsσ N , the boundary risk converges to zero, albeit at a slower rate. The above experiments demonstrate that for the Bayes-optimal classifier, constructed using a perturbed lifting matrix c W , the robust-standard risks tradeoff is attenuated as the low-dimensional structure of data becomes stronger. 6.3 BoundaryriskofBayes-optimalimageclassifiers We next provide several numerical experiments on the MNIST image data to corroborate our findings regarding the role of low-dimensional structure of data on the boundary risk of Bayes-optimal classifiers. Of course, the evaluation of this finding on image data is challenging since learning particular structure of the underlying image distribution is notoriously a difficult problem. There have been a few well-established techniques for this task that we briefly discuss below. Generative Adversarial Net (GAN) [65] is among the most successful methods in modeling the statisti- cal structure of image data. Despite the remarkable success of GANs in generating realistic high resolution images, it has been observed that they may fail in capturing the full data distribution, which is referred to as model collapse. In addition, computing the likelihood of image data with GANs requires to perform complex computations. As a direct implication of these observations, it is not statistically accurate and efficient to deploy GANs to formulate the Bayes optimal classifiers [149, 150]. 179 Fitting elementary statistical models can mitigate the statistical inaccuracies of GANs. [149] learns the statistical structure of image data by using the class of Gaussian Mixture Models (GMM). This choice is motivated by the statistical power of GMMs that they are universal approximators of probability densities [64]. On the other hand, working with general Gaussian covariance matrices can make the estimation problem both in terms of computational cost and memory storage extremely prohibitive. [149] deployed Mixture of Factor Analyzers (MFA) [63] to avoid storage and computation with such high-dimensional matrices. This deployment is aligned with the former intuition that the space of meaningful images is indeed a small portion of the entire high-dimensional space. In addition, they show that with moderate number of components in the GMM one can produce adequately realistic images, and further reduce the computational burden. We will adopt the MFA procedure introduced in [149] to generate realistic image data for our numerical experiments. The main reasons for this adoption are: i) this framework is flexible for generating realistic images from a low-dimensional subspace, and ii) it enables us to accurately and efficiently calculate the log-likelihood of images, which can be used later to formulate the Bayes-optimal classifier. It is worth noting that, using a class of less complex models, in this case GMMs rather than GANs, will output images with lower resolution, which is not a major concern for the main purpose of this numerical study. In the next sections, we first provide a brief overview of the GMM estimation steps and then review some of the standard frameworks to produce adversarially crafted examples. 180 6.3.1 LearningimagedatawithGaussianMixtureModels(GMM) A general setup for fitting a GMM to image data {x i } 1:n ∈R d is based on the following model x∼ K X k=1 α k N(µ k ,Σ k ), Σ k ∈R d× d , µ k ∈R d , whereK denotes the number of components in GMM andα i are mixing weights. This problem, without imposing any extra structure on image data, involves learningO(Kd 2 ) parameters which can be extremely difficult for high-dimensional images. [149] deployed a mixture of factor analyzers (MFA), where they use tall matricesA d× ℓ to embed a low-dimensional subspace in the full data space. In this case, the following model is considered x∼ K X k=1 α k N(µ k ,A k A T k +D k ), A k ∈R d× ℓ , D k ∈R d× d , µ k ∈R d , where D k is a diagonal matrix showing the variance on each single pixel. This model ameliorates the previous high storage and computational cost, as in this caseKd(ℓ+2) learning parameters exist, which scales linearly with image dimension d. This model is intimately related to the specific case of the low- dimensional manifold models on features in (6.5) and (6.6) with the identity mappingφ(t) = t. The only difference is in the entrywise independent Gaussian noise coming from diagonal matrix D k , however in the numerical experiments we observed that indeed the estimated values ofD k entries are extremely small, which makes this difference negligible. We use the maximum likelihood estimator to compute the model 181 parameters. For this purpose, we need to compute the log-probabilities. For a single component of GMM we first compute the likelihood p(x|A,D,µ ) given by p(x|A,D,µ )=− 1 2 h dlog(2π )+logdet(AA T +D)+(x− µ ) T D+AA T − 1 (x− µ ) i . (6.13) [149] used the following algebraic identities withu=x− µ ,L=A T DA∈R ℓ× ℓ to avoid large matrix storage and multiplications: u T Σ − 1 u=u T h D − 1 u− D − 1 AL − 1 (A T D − 1 u) i , logdet(AA T +D)=logdetL+ d X j=1 log(d j ), where d i denotes the i-th entry on the diagonal of matrix D. In addition, [149] employed differentiable programming framework to efficiently solve the corresponding Maximum-likelihood optimization problem on GPU. We use their publicly available code athttps://github.com/eitanrich/gans-n-gmms to fit GMMs on full image data sets. As a simple example, we consider models withK =10 components and the manifold dimensionsℓ = 1,10,100. We fit these three models to the training samples of the MNIST data set [42] that are labeled “6". Figure 6.7 exhibits sample images generated from the learned GMM models. 6.3.2 PGDandFGMadversarialattacks Recall that adversarial examples are meant to be close enough to original samples, yet be able to degrade the classifier performance. For a loss function L consider the adversarial optimization problem max ∥x− x ′ ∥ ℓ 2 ≤ ε L(h(x ′ ),y). 182 Figure 6.7: Sample generated images from a GMM model fit on MNIST training set images with label six. Three GMM models with number of componentsK = 10, and manifold dimensionsℓ = 1,10,100 (from left to right) are considered. Using the 0-1 lossL(h(x ′ ),y) =I(y ̸= h(x ′ )) yields the inner optimization problem (6.3). A large body of proposed methods to produce adversarial examples consider the first-order linear approximation of the loss function around the original sample. More precisely,x ′ is written asx+δ , and then single/multi steps of gradient descent (GD) of the negative loss function is considered. In this framework, first a powerful predictive model, e.g. a neural network, is fit to the training samples, which will be used as a surrogate for the learner’s modelh. Forℓ 2 -bounded adversary’s budgetε, the Fast Gradient Method (FGM) performs a single step of normalized GD which yields x ′ =x+ε ∇ x L(h(x),y) ∥∇ x L(h(x),y)∥ ℓ 2 . This method is first introduced in [66] for ℓ ∞ -bounded attacks under the nameFastGradientSignMethod (FGSM). Other variants for otherℓ p -bounded adversarial attacks are introduced in [179], generally called the FGM (removing “sign" from FGSM). A more general scheme to produce adversarial examples is via a 183 multi-step implementation of the above procedure with the projected gradient descent (PGD). This attack is introduced in [118], with iterative updates given by x t+1 =Π Bε(x) x t +ε ∇ x L(h(x t ),y) ∥∇ x L(h(x t ),y)∥ ℓ 2 , whereΠ Bε(x) stands for the projection operator to theℓ 2 -ball centered atx with radiusε. For our image classification numerical experiments, we will use FGM and PGD attacks to produce adversarial exam- ples . We follow the same implementation of PGD and FGM adversarial attacks provided in CleverHans library v4.0.0 [136]. The code for this implementation can be accessed at https://github.com/cleverhans- lab/cleverhans. In our implementation, the original image values are normalized to be in the interval [0,1], and we clip perturbed pixel values to be in the same interval. We next present key findings from our numerical experiments. 6.3.3 Mainexperimentsandkeyfindings In this section, we connect the previously described parts. Put all together, these are the three main steps of our experiments: 1. For several choices of K (number of components) and latent dimensions ℓ, we first fit two GMM models to zeros and sixes of the training set of MNIST data set. By deploying the learned models, we generate 5000 new images with uniform probability on labels 6 or 0, i.e. at the beginning of generating an image, with equal chance we decide to use either of the models. In addition, for the defined binary classification problem (0 vs 6), we deploy (6.13) to obtain two likelihood models p 0 (x),p 1 (x). This can be used to formulate the Bayes-optimal classifier I(p 1 (x)>p 0 (x)). 184 2. In this step, we adversarially attack the generated images. To this end, the data set is split into 80%-20% training-test samples. The training set is used to train a neural network for PGD and FGM attacks. The obtained model will be used later to craft adversarial examples for the1000 test images. 3. Finally, the performance of the Bayes-optimal classifier on adversarially perturbed test images (size 1000) is evaluated. In our first experiment, we consider a fixed number of components K =10 along with three different latent dimensionsℓ=1,10, and100. For each pair of(K,ℓ), we randomly select one sample with label6 among the original1000 test images. For PGD adversarial attacks, we start fromε=0, and incrementally increase the adversary’s ℓ 2 -bounded power ε until to the point that the Bayes-optimal classifier fails to correctly label the sample. Figure 6.8 displays the original, adversarial attack, and the perturbed images for eachℓ value. The Bayes-optimal classifier fails at adversarial power ε = 22.6,11,7 forℓ = 1,10,100 respectively. The result conforms to the fact that samples coming from a higher value of dimension ratio d/ℓ ( hered=784) indeed require stronger adversarial attacks (largerε). In the second experiment, we consider the two choices of K = 1 and K = 10 and vary the latent dimensionℓ. For each pair of(K,ℓ), we repeat the above three-step procedure for adversary’sℓ 2 -bounded powerε = 12 and compute the boundary risk of the Bayes-optimal classifier on the PGD and FGM per- turbed images. The plots are included in Figure 6.9. As observed, by increasing the dimension ratiod/ℓ, the boundary risk of the Bayes-optimal classifier decreases to zero. 185 Figure 6.8: Sample attacks and perturbed images for three different latent dimensions ℓ = 1,10,100, respectively from top to bottom. In each row, from left to right, the original sample, the adversarially crafted perturbation, and the perturbed image is exhibited. The original images are generated from a GMM with the number of components K = 10. In all images under the PGD adversarial perturbation, we start with the ℓ 2 − bounded adversarial power ε = 0, and incrementally increase it to the point that the Bayes-optimal classifier fails to infer the correct label. In this experiment, the Bayes-optimal misclassification on sampled images with ℓ = 1,10,100 occurs atℓ 2 − bounded adversarial powerε = 22.6,11,7, respectively. It can be seen that samples coming from smaller latent dimensionℓ require stronger perturbation to get misclassified by the Bayes-optimal classifier. 0 0.1 0.2 0.3 0.4 0.5 0.6 0 100 200 300 400 500 600 700 800 Boundary Risk Dimension ratio PGD FGM 0 0.1 0.2 0.3 0.4 0.5 0.6 0 100 200 300 400 500 600 700 800 Boundary risk Dimension ratio PGD FGM Figure 6.9: Boundary risk of the Bayes-optimal classifier on 1000 test images generated from GMM models with number of components K = 1 and K = 10 (from left to right). In each experiment, the adversary’s ℓ 2 -bounded perturbation power is fixed at ε=12, and both adversarial attacks PGD and FGM are considered. It can be observed that the boundary risk of the Bayes-optimal classifier will converge to zero as the dimension ratio d/ℓ grows. 186 Chapter7 Conclusions In Chapter 2, we devised the ‘online debiasing’ approach for the high-dimensional regression and showed that it asymptotically admits an unbiased Gaussian distribution, even when the samples are collected adap- tively. In addition, through numerical studies we demonstrated that the (offline) debiased estimator suffers from the bias induced by the correlation in the samples and cannot be used for valid statistical inference in these settings (unless the precision matrix is sufficiently sparse). The (offline) debiasing approach has been used as a tool to address a variety of problems from esti- mating average treatment effect and casual inference in high-dimension to precision matrix estimation, distributed multitask learning, and hierarchical testing. It has also been used for different statistical aims such as controlling FDR in high-dimensions [84], estimation of the prediction risk [92], inference on pre- dictions [27] and general functions of model parameters [85]. We anticipate that the online debiasing approach and analysis can be used to tackle similar problems under adaptive data collection. We leave this for future work. 187 In Chapter 3, we introduced the PCR test procedure to examine conditional independence of two vari- ables in the presence of a high-dimensional confounding variable, in a model-X setup where the distribu- tional information on the covariate population is available. The proposal of the PCR test was inspired by some of the alternative distributions for which the CRT (and its variants) are powerless. The PCR test is generally more flexible in capturing the conditional dependency, and under some alternatives can result in much higher statistical power compared to the CRT. We also provided a power analysis of the PCR test in terms of the so-called conditional dependency power of the joint lawL(X,Z,Y), sample sizen and the number of labelsL used in constructing the PCR test statistic. In addition, the proposed PCR test makes a novel contribution to the conditional independence testing problem by leveraging the i.i.d. property of the samples to obtain high-resolutionp-values with a very small number of conditional randomizations. This can significantly reduce the computational cost in high-dimensional variable selection problems. We also proposed two extensions of the PCR test:(i)Parameter-freePCRtest, which consists of multiple runs of PCR test with different choices of number of labels L, and then using Bonferroni’s method to combine the obtained p-values. (ii) Robust PCR test, which improves the robustness of the test against errors in estimating the conditional distribution P X|Z . Both of these extensions would have important practical implications. Finally, the score function in the proposed PCR test can be borrowed from many score functions de- veloped to improve the robustness and computational complexity of CRT such as dCRT, HRT, and CPT which further improves the general performance of PCR. In Chapter 4, we consider the fundamental problem of assessing the goodness-of-fit for a general bi- nary classifier. Our framework does not make any parametric assumption on the conditional law L(Y|X), 188 and treats that as a black box oracle model which can be accessed only through queries. We formu- late the goodness-of-fit assessment problem as a tolerance hypothesis testing, and propose a novel test, called GRASP for testing this hypothesis which works in finite sample settings, no matter the features (distribution-free). We also propose model-X GRASP designed for model-X settings where the joint dis- tribution of the features vector is known. Model-X GRASP uses this distributional information to achieve higher statistical power. Linear regression and binary classification are two simple, yet foundational settings in machine learn- ing and still the full effect of adversarial training is not known for these settings. In Chapter 5, we focus on distribution perturbing adversary and provide a framework on how to think about the tradeoff between the Standard risk (SR) and the Adversarial risk (AR), its existence and its quantitative behavior with respect to data distribution and the hypotheses class. Note that these are non-trivial questions and previously there has been specific “examples” to hint such tradeoff. A tantalizing question is whether one can remove this tradeoff (or improve SR and AR simultaneously) by considering a more complex class of hypotheses. Our discussion in Section 5.1.3 is a first attempt to answer this question for random features model. In Chapter 6, we studied the role of data distribution (in particular latent low-dimensional manifold structures of data) on the tradeoff between robustness (against adversarial perturbations in the input, at test time) and generalization (performance on test data drawn from the same distribution as training data). We developed a theory for two widely used classification setups (Gaussian-mixture model and generalized linear model), showing that as the ratio of the ambient dimension to the manifold dimension grows, one can obtain models which both are robust and generalize well. This highlights the role of exploiting underlying data structures in improving robustness and also in mitigating the tradeoff between generalization and 189 robustness. Through numerical experiments, we demonstrate that low-dimensional manifold structure of data, even if not exploited by the training method, can still weaken the robustness-generalization tradeoff. 190 Bibliography [1] Measuring Calibration in Deep Learning., volume 2, 2019. [2] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998. [3] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circum- venting defenses to adversarial examples. arXiv preprint arXiv:1802.00420, International Conference on Machine Learning, 2018. [4] M. Azadkia and S. Chatterjee. A simple measure of conditional dependence. arXiv preprint arXiv:1910.12327, 2019. [5] S. Balakrishnan, L. Wasserman, et al. Hypothesis testing for densities and high-dimensional multi- nomials: Sharp local minimax rates. Annals of Statistics, 47(4):1893–1927, 2019. [6] D. Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of mathematical psychology, 12(4):387–415, 1975. [7] R. F. Barber and E. J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015. [8] R. F. Barber, E. J. Candès, and R. J. Samworth. Robust inference with knockoffs. The Annals of Statistics, 48(3):1409–1431, 2020. [9] A. D. Barbour and L. H. Y. Chen. AnintroductiontoStein’smethod, volume 4. World Scientific, 2005. [10] D. Bartl, S. Drapeau, J. Obloj, and J. Wiesel. Robust uncertainty sensitivity analysis. arXiv preprint arXiv:2006.12022, 2020. [11] S. Basu, S. Das, G. Michailidis, and A. K. Purnanandam. A system-wide approach to measure con- nectivity in the financial sector. Available at SSRN 2816137, 2017. [12] S. Basu and G. Michailidis. Regularized estimation in sparse high-dimensional time series models. Annals of Statistics, 43(4):1535–1567, 2015. [13] S. Bates, E. Candès, L. Janson, and W. Wang. Metropolized knockoff sampling. JournaloftheAmer- ican Statistical Association, pages 1–15, 2020. [14] S. Bates, E. Candès, L. Janson, and W. Wang. Metropolized knockoff sampling. JournaloftheAmer- ican Statistical Association, 116(535):1413–1427, 2021. 191 [15] S. Bates, T. Hastie, and R. Tibshirani. Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673, 2021. [16] A. Belloni, V. Chernozhukov, and C. Hansen. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650, 2014. [17] A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [18] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of optimization problems affected by uncertain probabilities. ManagementScience, 59(2):341–357, 2013. [19] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful ap- proach to multiple testing.JournaloftheRoyalstatisticalsociety: seriesB(Methodological), 57(1):289– 300, 1995. [20] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001. [21] T. B. Berrett, Y. Wang, R. F. Barber, and R. J. Samworth. The conditional permutation test for indepen- dence while controlling for confounders. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1):175–197, 2020. [22] A. Bhaskar, A. Javanmard, T. A. Courtade, and D. Tse. Novel probabilistic models of spatial genetic ancestry with applications to stratification correction in genome-wide association studies. Bioinfor- matics, 33(6):879–885, 2017. [23] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Srndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. InJointEuropeanconferenceonmachinelearningand knowledge discovery in databases, pages 387–402. Springer, 2013. [24] L. Birgé. An alternative point of view on lepski’s method. Lecture Notes-Monograph Series, pages 113–133, 2001. [25] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012. [26] P. Bühlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and applica- tions. Springer Science & Business Media, 2011. [27] T. T. Cai and Z. Guo. Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. Annals of statistics, 45(2):615–646, 2017. [28] C. D. Campbell, E. L. Ogburn, K. L. Lunetta, H. N. Lyon, M. L. Freedman, L. C. Groop, D. Altshuler, K. G. Ardlie, and J. N. Hirschhorn. Demonstrating stratification in a european american population. Nature genetics, 37(8):868–872, 2005. [29] E. Candès, Y. Fan, L. Janson, and J. Lv. Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection.JournaloftheRoyalStatisticalSociety: SeriesB(StatisticalMethodology), 80(3):551–577, 2018. 192 [30] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing conditional independence of discrete distributions. In2018InformationTheoryandApplicationsWorkshop(ITA), pages 1–57. IEEE, 2018. [31] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden voice commands. In 25th{USENIX} Security Symposium ({USENIX} Security 16), pages 513–530, 2016. [32] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. InProceedingsofthe10thACMWorkshoponArtificialIntelligenceandSecurity , pages 3–14, 2017. [33] L. Chen, Y. Ye, and T. Bourlai. Adversarial machine learning in malware detection: Arms race be- tween evasion attack and defense. In2017EuropeanIntelligenceandSecurityInformaticsConference (EISIC), pages 99–106. IEEE, 2017. [34] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robust- ness to adversarial examples. arXiv preprint arXiv:1704.08847, International Conference on Machine Learning, 2017. [35] L. Cong, F. A. Ran, D. Cox, S. Lin, R. Barretto, N. Habib, P. D. Hsu, X. Wu, W. Jiang, L. A. Marraffini, et al. Multiplex genome engineering using crispr/cas systems. Science, 339(6121):819–823, 2013. [36] J. A. Costa and A. O. Hero. Learning intrinsic dimension and intrinsic entropy of high-dimensional datasets. In 2004 12th European Signal Processing Conference, pages 369–372. IEEE, 2004. [37] L. Crawford, K. C. Wood, X. Zhou, and S. Mukherjee. Bayesian approximate kernel regression with variable selection. Journal of the American Statistical Association, 113(524):1710–1721, 2018. [38] A. Daniely. Sgd learns the conjugate kernel class of the network. InAdvancesinNeuralInformation Processing Systems, pages 2422–2430, 2017. [39] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. InAdvancesInNeuralInformationProcessingSystems, pages 2253–2261, 2016. [40] A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society: Series B (Methodological), 41(1):1–15, 1979. [41] N. Deb and B. Sen. Multivariate rank-based distribution-free nonparametric testing using measure transportation. Journal of the American Statistical Association, (just-accepted):1–45, 2021. [42] L. Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012. [43] Y. Deshpande, A. Javanmard, and M. Mehrabi. Online debiasing for adaptively collected high- dimensional data with applications to time series analysis. Journal of the American Statistical Asso- ciation, 118(542):1126–1139, 2023. [44] Y. Deshpande, L. Mackey, V. Syrgkanis, and M. Taddy. Accurate inference for adaptive linear models. In International Conference on Machine Learning, pages 1202–1211, 2018. 193 [45] L. Devroye, L. Györfi, and G. Lugosi. Aprobabilistictheoryofpatternrecognition, volume 31. Springer Science& Business Media, 2013. [46] A. Dobra, C. Hans, B. Jones, J. R. Nevins, G. Yao, and M. West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90(1):196–212, 2004. [47] E. Dobriban, H. Hassani, D. Hong, and A. Robey. Provable tradeoffs in adversarially robust classifi- cation. arXiv preprint arXiv:2006.05161, 2020. [48] Y. Dong, Z. Deng, T. Pang, H. Su, and J. Zhu. Adversarial distributional training for robust deep learning. arXiv preprint arXiv:2002.05999, 2020. [49] M. Drton, F. Han, and H. Shi. High dimensional consistent independence testing with maxima of rank correlations. arXiv preprint arXiv:1812.06189, 2018. [50] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. InProceedingsofthe25thinternationalconferenceonMachinelearning, pages 272–279. ACM, 2008. [51] A. Dvoretzky. Asymptotic normality for sums of dependent random variables. InProc.6thBerkeley Symp. Math. Statist. Probab, volume 2, pages 513–535, 1972. [52] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. Guilt-free data reuse. Com- munications of the ACM, 60(4):86–93, 2017. [53] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126, 2015. [54] B. Efron. How biased is the apparent error rate of a prediction rule? Journal of the American statistical Association, 81(394):461–470, 1986. [55] B. Efron. The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(467):619–632, 2004. [56] C. Farrington. On assessing goodness of fit of generalized linear models to sparse data. Journal of the Royal Statistical Society: Series B (Methodological), 58(2):349–360, 1996. [57] M. Frank and P. Wolfe. An algorithm for quadratic programming. Navalresearchlogisticsquarterly, 3(1-2):95–110, 1956. [58] N. Friedman. Inferring cellular networks using probabilistic graphical models. Science, 303(5659):799–805, 2004. [59] A. Fujita, J. R. Sato, H. M. Garay-Malpartida, R. Yamaguchi, S. Miyano, M. C. Sogayar, and C. E. Fer- reira. Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC systems biology, 1(1):39, 2007. [60] R. Gao, X. Chen, and A. J. Kleywegt. Wasserstein distributionally robust optimization and variation regularization. arXiv preprint arXiv:1712.06050v3, 2020. 194 [61] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. arXiv preprint arXiv:1604.02199, 2016. [62] S. Gelper, I. Wilms, and C. Croux. Identifying demand effects in a large network of product cate- gories. Journal of Retailing, 92(1):25–39, 2016. [63] Z. Ghahramani, G. E. Hinton, et al. The em algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996. [64] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. [65] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- gio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. [66] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, International Conference on Learning Representations, 2015. [67] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [68] D. GUIDANCE. Adaptive designs for clinical trials of drugs and biologics. Center for Biologics Evaluation and Research (CBER), 2018. [69] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017. [70] H. Gzyl and J. L. Palacios. The weierstrass approximation theorem and large deviations. TheAmer- ican mathematical monthly, 104(7):650–653, 1997. [71] P. Hall and C. C. Heyde. Martingale limit theory and its application. Academic press, 2014. [72] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman. Theelementsofstatisticallearning: data mining, inference, and prediction, volume 2. Springer, 2009. [73] T. Hastie, R. Tibshirani, and M. Wainwright. Statisticallearningwithsparsity: thelassoandgeneral- izations. Chapman and Hall/CRC, 2015. [74] D. Holtz-Eakin, W. Newey, and H. S. Rosen. Estimating vector autoregressions with panel data. Econometrica: Journal of the Econometric Society, pages 1371–1395, 1988. [75] D. W. Hosmer and N. L. Hjort. Goodness-of-fit processes for logistic regression: simulation results. Statistics in medicine, 21(18):2723–2738, 2002. [76] D. W. Hosmer and S. Lemesbow. Goodness of fit tests for the multiple logistic regression model. Communications in statistics-Theory and Methods, 9(10):1043–1069, 1980. [77] F. Hsieh, B. W. Turnbull, et al. Nonparametric and semiparametric estimation of the receiver oper- ating characteristic curve. Annals of statistics, 24(1):25–40, 1996. [78] Y. Ingster, J. I. Ingster, and I. Suslina. Nonparametric goodness-of-fit testing under Gaussian models , volume 169. Springer Science & Business Media, 2003. 195 [79] Y. I. Ingster and T. Sapatinas. Minimax goodness-of-fit testing in multivariate nonparametric regres- sion. Mathematical Methods of Statistics, 18:241–269, 2009. [80] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. [81] A. Jalal, A. Ilyas, C. Daskalakis, and A. G. Dimakis. The robust manifold defense: Adversarial training using generative models. arXiv preprint arXiv:1712.09196, 2017. [82] J. Janková, R. D. Shah, P. Bühlmann, and R. J. Samworth. Goodness-of-fit testing in high dimensional generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):773–795, 2020. [83] A. Javanmard. Inference and estimation in high-dimensional data analysis. PhD thesis, PhD Thesis, Stanford University, 2014. [84] A. Javanmard and H. Javadi. False discovery rate control via debiased lasso. Electronic Journal of Statistics, 13(1):1212–1253, 2019. [85] A. Javanmard and J. D. Lee. A flexible framework for hypothesis testing in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):685–718, 2020. [86] A. Javanmard and M. Mehrabi. Adversarial robustness for latent models: Revisiting the robust- standard accuracies tradeoff. arXiv preprint arXiv:2110.11950, 2021. [87] A. Javanmard and M. Mehrabi. Pearson chi-squared conditional randomization test. arXiv preprint arXiv:2111.00027, 2021. [88] A. Javanmard and M. Mehrabi. Grasp: A goodness-of-fit test for classification learning. arXiv preprint arXiv:2209.02064, 2022. [89] A. Javanmard and A. Montanari. Nearly optimal sample size in hypothesis testing for high- dimensional regression. In 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1427–1434. IEEE, 2013. [90] A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15(1):2869–2909, 2014. [91] A. Javanmard and A. Montanari. Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554, 2014. [92] A. Javanmard and A. Montanari. Debiasing the lasso: Optimal sample size for gaussian designs. Annals of Statistics, 46(6A):2593–2622, 2018. [93] A. Javanmard and A. Montanari. Online rules for control of false discovery rate and false discovery exceedance. The Annals of statistics, 46(2):526–554, 2018. [94] A. Javanmard and M. Soltanolkotabi. Precise statistical analysis of classification accuracies for ad- versarial training. arXiv preprint arXiv:2010.11213, 2020. 196 [95] A. Javanmard, M. Soltanolkotabi, and H. Hassani. Precise tradeoffs in adversarial training for linear regression. volume 125 of Proceedings of Machine Learning Research, Conference of Learning Theory (COLT), pages 2034–2078. PMLR, 09–12 Jul 2020. [96] J. Josse and S. Holmes. Measures of dependence between random vectors and tests of independence. literature review. arXiv preprint arXiv:1307.7383, 2013. [97] E. Katsevich and A. Ramdas. On the power of conditional independence testing under model-x. arXiv preprint arXiv:2005.05506, 2020. [98] E. S. Kim, R. S. Herbst, I. I. Wistuba, J. J. Lee, G. R. Blumenschein, A. Tsao, D. J. Stewart, M. E. Hicks, J. Erasmus, S. Gupta, et al. The battle trial: personalizing therapy for lung cancer. Cancerdiscovery, 1(1):44–53, 2011. [99] D. Koller and N. Friedman. Probabilisticgraphicalmodels: principlesandtechniques. MIT press, 2009. [100] A. Kumar, P. S. Liang, and T. Ma. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32, 2019. [101] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016. [102] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985. [103] T. L. Lai and C. Z. Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. Annals of Statistics, pages 154–166, 1982. [104] E. Lann. Testing statistical hypotheses. Wiley, New York, 1959. [105] S. Le Cessie and J. Van Houwelingen. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, pages 1267–1282, 1991. [106] D. Lee, X. Huang, H. Hassani, and E. Dobriban. T-cal: An optimal test for the calibration of predictive models. arXiv preprint arXiv:2203.01850, 2022. [107] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Science & Business Media, 2006. [108] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018. [109] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 71–96, 2014. [110] L. Lei and W. Fithian. Adapt: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649–679, 2018. [111] O. V. Lepski and V. G. Spokoiny. Minimax nonparametric hypothesis testing: the case of an inho- mogeneous alternative. Bernoulli, pages 333–358, 1999. 197 [112] S. Li and E. J. Candès. Deploying the conditional randomization test in high multiplicity problems. arXiv preprint arXiv:2110.02422, 2021. [113] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018. [114] F. Liang, Q. Li, and L. Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018. [115] D. Lin, L. Wei, and Z. Ying. Model-checking techniques based on cumulative residuals. Biometrics, 58(1):1–12, 2002. [116] W.-A. Lin, C. P. Lau, A. Levine, R. Chellappa, and S. Feizi. Dual manifold adversarial robustness: Defense against lp and non-lp adversarial attacks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 3487–3498. Curran Associates, Inc., 2020. [117] M. Liu, E. Katsevich, L. Janson, and A. Ramdas. Fast and powerful conditional randomization testing via distillation. arXiv preprint arXiv:2006.03980, 2020. [118] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resis- tant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [119] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models re- sistant to adversarial attacks. arXiv preprint arXiv:1706.06083, International Conference on Learning Representations, 2018. [120] C. L. Mallows. Some comments on cp. Technometrics, 42(1):87–94, 2000. [121] P. J. McMurdie and S. Holmes. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology, 10(4):e1003531, 2014. [122] M. Mehrabi, A. Javanmard, R. A. Rossi, A. Rao, and T. Mai. Fundamental tradeoffs in distributionally adversarial training. InProceedingsofthe38thInternationalConferenceonMachineLearning, volume 139, pages 7544–7554. PMLR, 2021. [123] S. Mei and A. Montanari. The generalization error of random features regression: Precise asymp- totics and double descent curve. arXiv preprint arXiv:1908.05355, 2019. [124] Y. Min, L. Chen, and A. Karbasi. The curious case of adversarially robust models: More data can help, double descend, or hurt generalization. arXiv preprint arXiv:2002.11080, 2020. [125] A. Montanari. Graphical models concepts in compressed sensing. Compressed Sensing: Theory and Applications, page 394, 2012. [126] R. Mukherjee and S. Sen. Optimal adaptive inference in random design binary regression. 2018. [127] M. P. Naeini, G. Cooper, and M. Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence , 2015. 198 [128] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29, 2016. [129] M. Neykov, S. Balakrishnan, and L. Wasserman. Minimax optimal conditional independence testing. arXiv preprint arXiv:2001.03039, 2020. [130] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEETransactionsonInformationTheory, 56(11):5847–5861, 2010. [131] X. Nie, T. Xiaoying, J. Taylor, and J. Zou. Why adaptively collected data have negative bias and how to correct for it. 2017. [132] Y. Ning and H. Liu. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195, 2017. [133] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016. [134] G. Osius and D. Rojek. Normal goodness-of-fit tests for multinomial models with large degrees of freedom. Journal of the American Statistical Association, 87(420):1145–1152, 1992. [135] L. Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008. [136] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, et al. Technical report on the cleverhans v2. 1.0 adversarial examples library. arXivpreprint arXiv:1610.00768, 2016. [137] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017. [138] N. Papernot, P. McDaniel, A. Swami, and R. Harang. Crafting adversarial input sequences for re- current neural networks. In MILCOM 2016-2016 IEEE Military Communications Conference, pages 49–54. IEEE, 2016. [139] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016. [140] J. Pearl et al. Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19, 2000. [141] J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algo- rithms. The MIT Press, 2017. [142] J. M. Peters, A. Colavin, H. Shi, T. L. Czarny, M. H. Larson, S. Wong, J. S. Hawkins, C. H. Lu, B.-M. Koo, E. Marta, et al. A comprehensive, crispr-based functional analysis of essential genes in bacteria. Cell, 165(6):1493–1506, 2016. [143] N. Pfister, P. Bühlmann, B. Schölkopf, and J. Peters. Kernel-based tests for joint independence. Journal of the Royal Statistical Society Series B, 80(1):5–31, 2018. 199 [144] M. S. Pydi and V. Jog. Adversarial risk via optimal transport and optimal couplings. InInternational Conference on Machine Learning, pages 7814–7823. PMLR, 2020. [145] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. [146] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Adversarial training can hurt gener- alization. arXiv preprint arXiv:1906.06032, 2019. [147] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20:1177–1184, 2007. [148] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lan- der, M. Mitzenmacher, and P. C. Sabeti. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011. [149] E. Richardson and Y. Weiss. On gans and gmms. AdvancesinNeuralInformationProcessingSystems, 31, 2018. [150] E. Richardson and Y. Weiss. A bayes-optimal view on adversarial examples. Journal of Machine Learning Research, 22(221):1–28, 2021. [151] Y. Romano, E. Patterson, and E. Candes. Conformalized quantile regression. Advances in Neural Information Processing Systems, 32:3543–3553, 2019. [152] S. Rosset, E. Aharoni, and H. Neuvirth. Novel statistical tools for management of public databases fa- cilitate community-wide replicability and control of false discovery.Geneticepidemiology, 38(5):477– 481, 2014. [153] S. Rosset and R. J. Tibshirani. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. JournaloftheAmericanStatisticalAssociation, 2019. [154] A. Rozza, G. Lombardi, C. Ceruti, E. Casiraghi, and P. Campadelli. Novel high intrinsic dimension- ality estimators. Machine learning, 89(1):37–65, 2012. [155] A. Saltelli, M. Ratto, T. Andres, F. Campolongo, J. Cariboni, D. Gatelli, M. Saisana, and S. Tarantola. Global sensitivity analysis: the primer. John Wiley & Sons, 2008. [156] M. Sesia, C. Sabatti, and E. J. Candès. Gene hunting with knockoffs for hidden markov models. arXiv preprint arXiv:1706.04677, 2017. [157] A. K. Seth, A. B. Barrett, and L. Barnett. Granger causality analysis in neuroscience and neuroimag- ing. Journal of Neuroscience, 35(8):3293–3297, 2015. [158] R. D. Shah and P. Bühlmann. Goodness-of-fit tests for high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):113–135, 2018. [159] R. D. Shah and J. Peters. The hardness of conditional independence testing and the generalised covariance measure. The Annals of Statistics, 48(3):1514–1538, 2020. 200 [160] A. Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017. [161] J.-H. Shih and T. Emura. On the copula correlation ratio and its generalization. JournalofMultivari- ate Analysis, 182:104708, 2021. [162] J. Shin, A. Ramdas, and A. Rinaldo. On the bias, risk and consistency of sample means in multi-armed bandits. arXiv preprint arXiv:1902.00746, 2019. [163] R. H. Shumway and D. S. Stoffer. Timeseriesanalysisanditsapplications: withRexamples. Springer Science & Business Media, 2006. [164] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.IEEETransactionsonInformationTheory, 65(2):742– 769, 2018. [165] Y. Song, R. Shu, N. Kushman, and S. Ermon. Constructing unrestricted adversarial examples with generative models. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,AdvancesinNeuralInformationProcessingSystems, volume 31. Curran Associates, Inc., 2018. [166] S. Spigler, M. Geiger, and M. Wyart. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, 2020. [167] S. Srinivasan, K. Pauwels, D. M. Hanssens, and M. G. Dekimpe. Do promotions benefit manufactur- ers, retailers, or both? Management Science, 50(5):617–629, 2004. [168] M. Staib and S. Jegelka. Distributionally robust deep learning as a generalization of adversarial training. In NIPS workshop on Machine Learning and Computer Security, 2017. [169] J. H. Stock and M. W. Watson. Vector autoregressions. Journal of Economic perspectives, 15(4):101– 115, 2001. [170] D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6976–6987, 2019. [171] J. Q. Su and L. Wei. A lack-of-fit test for the mean function in a generalized linear model. Journalof the American Statistical Association, 86(414):420–426, 1991. [172] T. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. [173] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. International Conference on Learning Representations (ICLR), 2014. [174] H. Taheri, R. Pedarsani, and C. Thrampoulidis. Asymptotic behavior of adversarial training in binary classification. arXiv preprint arXiv:2010.13275, 2020. [175] W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test: Prin- cipled and easy black box feature selection. arXiv preprint arXiv:1811.00645, 1, 2018. 201 [176] W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics, 31(1):151– 162, 2022. [177] O. Thas. Comparing distributions. Springer, 2010. [178] R. Tibshirani. Regression shrinkage and selection via the lasso.JournaloftheRoyalStatisticalSociety: Series B (Methodological), 58(1):267–288, 1996. [179] F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. The space of transferable adver- sarial examples. arXiv preprint arXiv:1704.03453, 2017. [180] A. A. Tsiatis. A note on a goodness-of-fit test for the logistic regression model. Biometrika, 67(1):250– 251, 1980. [181] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018. [182] J. Vaicenavicius, D. Widmann, C. Andersson, F. Lindsten, J. Roll, and T. Schön. Evaluating model cal- ibration in classification. In The22ndInternationalConferenceonArtificialIntelligenceandStatistics , pages 3459–3467. PMLR, 2019. [183] T. Vaidya, Y. Zhang, M. Sherr, and C. Shields. Cocaine noodles: exploiting the gap between human and machine speech recognition. In 9th{USENIX} Workshop on Offensive Technologies ( {WOOT} 15), 2015. [184] G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. SIAM Journal on Computing, 46(1):429–455, 2017. [185] S. Van de Geer, P. Bühlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42(3):1166–1202, 2014. [186] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Ku- tyniok, editors,CompressedSensing: TheoryandApplications, pages 210–268. Cambridge University Press, 2012. [187] S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [188] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005. [189] S. Wager. Cross-validation, risk estimation, and model selection: Comment on a paper by rosset and tibshirani. Journal of the American Statistical Association, 115(529):157–160, 2020. [190] W. Wang and L. Janson. A power analysis of the conditional randomization test and knockoffs. arXiv preprint arXiv:2010.02304, 2020. [191] L. Weihs, M. Drton, and N. Meinshausen. Symmetric rank covariances: a generalized framework for nonparametric measures of dependence. Biometrika, 105(3):547–562, 2018. 202 [192] X. Wen and M. Stephens. Using linear predictors to impute allele frequencies from summary or pooled genotype data. The annals of applied statistics, 4(3):1158, 2010. [193] D. Widmann, F. Lindsten, and D. Zachariah. Calibration tests in multi-class classification: A unifying framework. Advances in Neural Information Processing Systems, 32, 2019. [194] I. Wilms, S. Basu, J. Bien, and D. S. Matteson. Interpretable vector autoregressions with exogenous time series. arXiv preprint arXiv:1711.03623, 2017. [195] E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer ad- versarial polytope. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 5283–5292, 2018. [196] J. Wu, B. Devlin, S. Ringquist, M. Trucco, and K. Roeder. Screen and clean: a tool for identifying interactions in genome-wide association studies. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 34(3):275–285, 2010. [197] Y. Xing, R. Zhang, and G. Cheng. Adversarially robust estimate and risk analysis in linear regression. In International Conference on Artificial Intelligence and Statistics , pages 514–522. PMLR, 2021. [198] M. Xu, T. Qin, and T.-Y. Liu. Estimation bias in multi-armed bandit algorithms for search advertising. In Advances in Neural Information Processing Systems, pages 2400–2408, 2013. [199] Y.-Y. Yang, C. Rashtchian, H. Zhang, R. R. Salakhutdinov, and K. Chaudhuri. A closer look at accuracy vs. robustness. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances inNeuralInformationProcessingSystems, volume 33, pages 8588–8601. Curran Associates, Inc., 2020. [200] C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in high di- mensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242, 2014. [201] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu. Dolphinattack: Inaudible voice commands. InProceedingsofthe2017ACMSIGSACConferenceonComputerandCommunicationsSecurity, pages 103–117, 2017. [202] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pages 7472–7482, 2019. [203] J. Zhang, J. Ding, and Y. Yang. Is a classification procedure good enough?—a goodness-of-fit as- sessment tool for classification learning. JournaloftheAmericanStatisticalAssociation, pages 1–11, 2021. [204] K. Zhang. Bet on independence. JournaloftheAmericanStatisticalAssociation, 114(528):1620–1637, 2019. [205] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012. [206] L. Zhang and L. Janson. Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283, 2020. 203 [207] Q. Zhang, S. Filippi, A. Gretton, and D. Sejdinovic. Large-scale kernel methods for independence testing. Statistics and Computing, 28(1):113–130, 2018. [208] L. Zheng, G. Raskutti, et al. Testing for high-dimensional network parameters in auto-regressive models. Electronic Journal of Statistics, 13(2):4977–5043, 2019. [209] X. Zhou, S. Liu, E. S. Kim, R. S. Herbst, and J. J. Lee. Bayesian adaptive design for targeted therapy development in lung cancer—a step toward personalized medicine. Clinical Trials, 5(3):181–193, 2008. 204 AppendixA SupplementtoChapter2 A.1 Sparseinversecovariance In Section 2.1.1 (Figure 1.2) we provided a numerical example wherein the offline debiasing does not admit an asymptotically normal distribution. As we see from the heat map in Figure 2.2b, the precision matrixΩ has∼ 20% non-negligible entries per row. The goal of this section is to show that whenΩ is sufficiently sparse, the offline debiased estimator has an asymptotically normal distribution and can be used for valid inference on model parameters. The idea is to show that the decorrelating matrix M is sufficiently close to the precision matrix Ω . SinceΩ is deterministic, this helps with controlling the statistical dependence betweenM andε. Formally, starting from the decomposition (1.4) we write b θ off =θ 0 +(I− M b Σ)( b θ L − θ 0 )+ 1 n MX T ε =θ 0 +(I− M b Σ)( b θ L − θ 0 )+ 1 n (M− Ω) X T ε+ 1 n Ω X T ε, (A.1) 205 where we recall that b Σ is the empirical covariance of all the covariate vectors (episodes E 0 ,...,E K− 1 ). Therefore, we can write √ n( b θ off − θ 0 )=∆ 1 +∆ 2 + 1 √ n Ω X T ε, ∆ 1 = √ n(I− M b Σ)( b θ L − θ 0 ), ∆ 2 = 1 √ n (M− Ω) X T ε. (A.2) The termΩ X T ε/ √ n is gaussian withO(1) variance at each coordinate. For bias term∆ 1 , we show that ∆ 1 =O(s 0 (logp)/ √ n) by controlling|I− M b Σ |. To bound the bias term∆ 2 we write ∥∆ 2 ∥ ∞ ≤ 1 √ n ∥M− Ω ∥ 1 ∥X T ε∥ ∞ , (A.3) where∥M− Ω ∥ 1 denotes theℓ 1 − ℓ 1 norm ofM− Ω (the maximumℓ 1 norm of its columns). By using [12, Proposition 3.2], we have∥X T ε∥ ∞ / √ n = O P ( p log(dp)). Therefore, to bound ∆ 2 we need to control ∥M− Ω ∥ 1 . We provide such bound in our next lemma, under the sparsity assumption on the rows ofΩ . Define s Ω ≡ max i∈[dp] j∈[dp]: Ω i,j ̸=0 , the maximum sparsity of rows of Ω . In addition, let the (offline) decorrelating vectors m a be defined as follows, fora∈[dp]: m a ∈arg min m∈R dp 1 2 m T b Σ m−⟨ m,e a ⟩+µ ∥m∥ 1 . (A.4) 206 Lemma A.1.1. Consider the decorrelating vectors m a , a ∈ [dp], given by optimization (A.4) with µ = 2τ q log(dp) n . Then,forsomeproperconstantc>0andthesamplesizeconditionn≥ 32α (ω 2 ∨1)s Ω log(dp), the following happens with probability at least1− exp(− clog(dp 2 ))− exp(− cn(1∧ω − 2 )): max i∈[dp] ∥m a − Ω e a ∥ 1 ≤ 192τ α s Ω r log(dp) n , whereα andω are defined in Proposition A.6.4. The proof of Lemma A.1.1 is deferred to Section A.7.2. By employing this lemma, ifΩ is sufficiently sparse, that is s Ω = o( √ n/log(dp)), then the bias term ∥∆ 2 ∥ ∞ also vanishes asymptotically and the (offline) debiased estimator b θ off admits an unbiased normal distribution. We formalize such distributional characterization in the next theorem. Theorem A.1.2. Consider the VAR(d) model (1.5) for time series and let b θ off be the (offline) debiased es- timator (1.3), with the decorrelating matrix M = (m 1 ,...,m dp ) T ∈ R dp× dp constructed as in (A.4), with µ = 2τ p log(dp)/n. Also, let λ = λ 0 p log(dp)/n be the regularization parameter in the Lasso estimator b θ L , withτ,λ 0 large enough constants. Supposethats 0 =o( √ n/log(dp))ands Ω =o( √ n/log(dp)),thenthefollowingholdstrueforanyfixed sequence of integersa(n)∈[dp]: For allx∈R, we have lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P ( √ n( b θ off a − θ 0,a ) p V n,a ≤ x ) − Φ( x) =0, (A.5) whereV n,a ≡ σ 2 (M b Σ M T ) a,a . We refer to Section A.7.3 for the proof of Theorem A.1.2. 207 0.0 0.2 0.4 0.6 0.8 1.0 (a) Heat map of magnitudes of entries of Ω = E(xix T i ) − 1 0.0 0.1 0.2 0.3 0.4 −2 0 2 Histograms Offline Curves Normal Density Rescaled Residuals (b) Histogram of Rescaled Residuals −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Normal Offline Sample Theoretical (c) QQ plot of Rescaled Residuals 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normal Offline Sample Theoretical (d) PP plot of Rescaled Residuals Figure A.1: A Simple example of a VAR(d) process with parameters p = 25,d = 3,T = 70, and noise term covariance matrix Σ ζ s.t Σ ζ (i,j) = ρ |i− j| with ρ = 0.1. A i matrices have independent elements coming from b· Bern(q).Unif({+1,− 1}) formula with b = 0.15,q = 0.05. Normality of rescaled residuals (figures A.1b, A.1c, and A.1d) validates the successful performance of offline debiasing estimator under sparsity of precision matrix Ω ( figure A.1a) as we discussed in theorem A.1.2. Numerical example. Consider a VAR(d) model with parameters p = 25,d = 3,T = 70, and Gaussian noise terms with covariance matrixΣ ζ satisfyingΣ ζ (i,j) = ρ |i− j| forρ = 0.1. LetA i matrices have entries generated independently fromb· Bern(q)· Unif({+1,− 1}) formula with parametersb=0.15, q = 0.05. Figure A.1a shows the magnitudes of the entries of the precision matrix Ω = E(x i x T i ) − 1 ; as we see Ω is sparse. Figures A.1b, A.1c, and A.1d demonstrate normality of the rescaled residuals of the offline debiased estimator built by decorrelating matrix M with rows coming from optimization described in (A.4). 208 After this paper was posted, we learned of simultaneous work (an updated version of [11]) that also studies the performance of the (offline) debiased estimator for time series with sparse precision matrix. We would like to highlight some of the differences between our discussion in Section A.1 and that paper: 1) [11] considers decorrelating matrix M constructed by an optimization of form (2.5), using the entire sample covariance b Σ (K) , while we work with the Lagrangian equivalent (A.4). 2) [11] considers VAR(1) model, while we work withVAR(d) models. 3) [11] assumes a stronger notion of sparsity, viz. the sparsity of the entire precision matrix as well as the transition matrix to scale as o( √ n/logp). Our results only require the row-wise sparsity of the precision matrix to scale aso( √ n/logp), cf. Theorem A.1.2. We would like to also discuss the related work [208] which studies the problem of statistical inference on the coefficients of autoregressive models. This work follows the same idea of debiasing, but uses an offline debiasing, in our terminology. Specifically, they adopt the framework of [132] and propose a high dimensional test statistic based on score function, called the decorrelated score function. It shows that under proper sparsity assumptions on the model coefficients and the precision matrix, their bias-corrected estimator achieves asymptotical Gaussianity. The work [208] considers the simultaneous (group) inference on a fixed number of coefficients and in the univariate case (testing an individual coefficient) their sparsity assumption becomes equivalent to the assumptions of Theorem A.1.2 ons 0 ands Ω . However, the decorre- lated score matrix in [208] is constructed by Lasso or Dantzig selector which is different from our proposal in (A.4). Let us reiterate that the method of (A.4) is an offline debiasing approach and hence, as discussed in Section 2.1.1 (Figure 1.2), may in general fail in providing valid statistical tests for time series. Apart from this point, in order to further delineate the differences of online debiasing with offline bias-correction methods, such as [208] , we consider a set of numerical examples with sufficiently sparse precision ma- trix, where the test of [208] is also guaranteed to have valid statistical performance. We use the software package of [208] for an implementation of their method, and consider configurations that are inspired by 209 the package built-in example. We consider VAR(1) model where time-series samples are generated from X t =AX t +ε t , withε t ∼ Unif[0,1]. The generative matrixA is of formA=diag(A 0 ,A 0 ,...,A 0 ) with A 0 a symmetric2× 2 matrix: A 0 = q 2 q 1 q 1 q 2 . (A.6) Each configuration is determined by the number of samples T , the primary generative matrix coefficients q 1 ,q 2 , and the matrix dimension p. For each configuration, we test all coordinates of A and report the two measures true positive rate (TPR) and false positive rate (FPR), along with the running time of the algorithms. Note that the [208] method outputs two test statisticsU,R, which are constructed in almost similar ways (except the last step) and so has the same running time. Table A.1 demonstrates the statistical performance of our online debiasing and the bias-correction method of [208]. The reported values are averaged out over10 independent experiments, and the running times are in seconds. The first interesting observation is about the statistical power, where it can be observed that both online debiasing and [208] have comparable performance. Note that on the one hand, the approach of [208] uses the Lasso or Dantzig selector to construct decorrelated score and hence searches over the space of sparse matrices. However, the online debiasing searches over the larger space of approximately sparse matrices (cf. optimization (2.5)). This factor plays in favor of online debiasing to potentially have higher power. On the other hand, the online debiasing framework uses samples gradually and the decorrelating matrices M (ℓ) are constructed from subsets of samples. This factor plays in favor of offline debiasing methods that use the entire samples in constructing the decorrelated score matrix. The interplay between these two factors leads the two methods to have comparative statistical power. 210 Configuration OnlineDebiasing Utest Rtest (p,T,q 1 ,q 2 ) TPR FPR Time TPR FPR Time TPR FPR Time (6,400,1/15,1/15) 0.35 0.046 12.52 0.325 0.067 241.90 0.325 0.067 241.90 (6,600,1/15,1/15) 0.375 0.07 14.53 0.36 0.05 376.27 0.358 0.05 376.27 (6,1000,1/15,1/15) 0.65 0.062 16.54 0.63 0.064 660.32 0.63 0.064 660.32 (8,300,1/2,1/4) 0.993 0.02 5.63 1 0.037 348.71 1 0.031 348.71 (8,300,1/4,1/8) 0.762 0.025 17.60 0.793 0.031 335 0.793 0.031 335 (8,300,1/8,1/16) 0.35 0.035 5.62 0.493 0.043 361.49 0.493 0.043 361.49 Table A.1: Overall performance of online debiasing (test (2.34)) and the U-test and R-test proposed by [208] for testing the entries of the generative matrixA for a VAR(1) model. We considerA = diag(A 0 ,...,A 0 ) withA 0 given by (A.6). For each configuration, we report the true positive rate ( TPR), false positive rate (FPR) and the running time for each test. The reported numbers are averaged out over10 independent realizations of each configuration. The running times are in seconds. The other interesting observation is about the running time of the two methods, where it can be ob- served that online debiasing enjoys a significantly lower running time. In fact, the online debiasing method withT samples, haslogT number of episodes, and for each one adp× dp debiasing matrix is constructed by solvingdp optimization problems– because of row by row construction of each matrix. A delicate point we would like to make is that in online debiasing, at each round, we focus on one row– say i– of the generative matricesA (1) ,...,A (d) (stacked together as in equation (2.2)), and construct the decorrelating matrices M (ℓ) . However, these decorrelating matrices only depend on the covariate matrix (X in (2.2)) and so do not change across different rounds. That said, one needs to compute them once for all rows i∈[p]. The approach of [208], on the other hand, constructs a separate score vector for each entry of the generative matricesA (1) ,...,A (d) which is computationally much more demanding. 211 A.2 EstimatingnoisevarianceforVARmodel Define e V n,a ≡ 1 n P K− 1 ℓ=1 P t∈E ℓ ⟨m ℓ a ,x t ⟩ 2 . Note thatV n,a = σ 2 e V n,a and calculating e V n,a does not require the knowledge ofσ 2 . We define z a ≡ s n e V n,a b θ on a . Using the distributional characterization of the online debiased estimator b θ on , and by a very similar argu- ment in Theorem 2.1.8, we know that fora / ∈ supp(θ 0 ), θ 0,a = 0 and soz a is asymptotically zero mean Gaussian of varianceσ 2 . This suggests to use the following MAD (median absolute deviation) to estimate σ 2 . We let|z| be the vector of absolute values ofz, i.e.,|z| = (|z 1 |,|z 2 |,...,|z dp |). Denote by|z| (a) itsa-th smallest entry, i.e.,|z| (1) ≤| z| (2) ≤ ...≤| z| dp . We then estimateσ using the MAD estimator b σ = |z| (dp/2) Φ − 1 (3/4) . (A.7) A similar variance estimator has been proposed by [125] in the context of approximate message passing and in [89] for (offline) debiased estimator. The main idea here is that the MAD estimator is robust to outliers and hence including entriesz a witha∈ supp(θ 0 ) have negligible asymptotic contribution to the estimateb σ , given thats 0 =o(p). A.3 Robustnesstotheepisodegrowthrate We follow the guideline in Section 2.1.1 to choose the batch sizes where the length of episodes grow exponentially; namelyr ℓ =β ℓ , for a constantβ > 1, andℓ≥ 1. As it was proved in Theorem 2.1.4, for any 212 constantβ > 1, the online debiased estimator is asymptotically unbiased assumings 0 log(dp)/ √ n=o(1), and results in valid statistical inference (controlling type I error in the context of hypothesis testing and producing confidence intervals with the target coverage). In this section, we investigate the robustness of these outputs (p-values and confidence intervals) with respect to the choice of tuning parameter β . To this end, we consider the VAR(1) time-series data setup (1.5) with problem dimensionp = 20, and the noise covarianceΣ ζ (i,j) = 0.1 |i− j| . The entries of the time series generative matrixA are chosen i.i.d. from a Bernoulli distribution with success probability 0.01, and then multiplied by a number chosen uniformly from the set{− 2,+2}, i.e.,A ij ∼ Bern(0.01)· Unif({− 2,+2}). Fixing the matrixA, we generateT =200 samplesX 1:200 and run the online debiasing with the tuning parameterβ picked from a grid of equidistant 11 elements over the interval [2,4], i.e., β ∈ {2,2.2,2.4,...,3.8,4}. We compute the average length of confidence intervals and p-values across 100 experiments (independent realizations of X 1:200 ). For each coordinate, we will end up with11× 2 numbers. Plots A.2a and A.2b respectively denote the boxplots for the average length of80%-confidence intervals and the average p-values corresponding to the six selected entries ofA, three of which are truly zero and the other three are nonzero. As we see the outputs of the online debiasing approach is relatively robust against the choice of the episode growth rateβ . Recalling the coefficient of variation (CV) as a measure of variability, defined as the ratio of the standard deviation to the mean, in Table A.2 we give the coefficient of variation for the p-values and length of 80%-confidence intervals for the six coordinates across the 11 choices of β . The small CVs indicate the robustness ofp-values and confidence intervals to the specific choice of β . 213 0.0 0.2 0.4 0.6 0.8 1.0 zero 1 zero 2 zero 3 nonzero 1 nonzero 2 nonzero 3 Coordinate P-value (a) Boxplot of p-values 0.0 0.5 1.0 1.5 2.0 zero 1 zero 2 zero 3 nonzero 1 nonzero 2 nonzero 3 Coordinate CI length (b) Boxplot of confidence interval lengths Figure A.2: Boxplots of p-values and CI lengths for3 zero and3 nonzero coordinates of autoregressive matrixA, for 11 choices of episode growth rateβ ∈{2,2.2,...,3.8,4}. It can be seen that the variation in each box is small. This implies the robustness of online debiasing framework with respect to the choice of episode growth rateβ . zero1 zero 2 zero 3 nonzero1 nonzero 2 nonzero 3 CV ofp-value 0.0406 0.0248 0.0319 0.928 0.545 0.610 CV of CI length 0.0741 0.0366 0.107 0.047 0.036 0.049 Table A.2: Coefficient of variation (CV) for the p-values and the confidence interval lengths for the six selected entries of theA matrix. 214 A.4 Implementationandextensions A.4.1 Iterativeschemestoimplementonlinedebiasing The online debiased estimator (2.6) involves the decorrelating matricesM (ℓ) , whose rows(m ℓ a ) a∈[dp] are constructed by the optimization (2.5). For the sake of computational efficiently, it is useful to work with a Lagrangian equivalent version of this optimization. Consider the following optimization minimize ∥m∥ 1 ≤ L 1 2 m T b Σ (ℓ) m−⟨ m,e a ⟩+µ ℓ ∥m∥ 1 , (A.8) withµ ℓ andL taking the same values as in Optimization (2.5). The next result, from [83, Chapter 5] is on the connection between the solutions of the unconstrained problem (A.8) and (2.5). For the reader’s convenience, the proof is also given in Appendix A.7.1. LemmaA.4.1. A solution of optimization (A.8) is also a solution of the optimization problem (2.5). Also, if problem (2.5) is feasible then problem (A.8) has bounded solution. Using the above lemma, we can instead work with the Lagrangian version (A.8) for constructing the decorrelating vectorm ℓ a . Here, we propose to solve optimization problem (A.8) using iterative method. Note the objective func- tion evolves slightly at each episode and hence we expect the solutionsm ℓ a andm ℓ+1 a to be close to each other. An appealing property of iterative methods is that we can leverage this observation by settingm ℓ a as the initialization for the iterations that computem ℓ+1 a , yielding shorter convergence time. In the sequel we discuss two of such iterative schemes. 215 A.4.1.1 Coordinatedescentalgorithms In this method, at each iteration we update one of the coordinates of m, say m j , while fixing the other coordinates. We write the objective function of (A.8) by separatingm j from the other coordinates: 1 2 b Σ (ℓ) j,j m 2 j + X r,s̸=j b Σ (ℓ) r,s m r m s − m a +µ ℓ ∥m ∼ j ∥ 1 +µ ℓ |m j |, (A.9) where b Σ (ℓ) j,∼ j denotes the j th row (column) of b Σ (ℓ) with b Σ (ℓ) j,j removed. Likewise, m ∼ j represents the re- striction ofm to coordinates other thanj. Minimizing (A.9) with respect tom j gives m j + 1 b Σ (ℓ) j,j b Σ (ℓ) j,∼ j m ∼ j − I(a=j)+µ ℓ sign(m j ) =0. It is easy to verify that the solution of the above is given by m j = 1 b Σ (ℓ) j,j η − b Σ (ℓ) j,∼ j m ∼ j +I(a=j);µ ℓ , (A.10) withη (·;·):R× R + →R denoting the soft-thresholding function defined as η (z,µ )= z− µ ifz >µ, 0 if − µ ≤ z≤ µ, z+µ otherwise. (A.11) For a vectoru,η (u;µ ) is perceived entry-wise. 216 This brings us to the following update rule to computem ℓ a ∈R dp (solution of (A.8)). Th notationΠ L , in line 5 below, denotes the Euclidean projection onto theℓ 1 ball of radiusL and can be computed inO(dp) times using the procedure of [50]. 1: (initialization): m(0)← m (ℓ− 1) a 2: for iterationh=1,...,H do 3: forj =1,2,...,dpdo 4: m j (h)← 1 b Σ (ℓ) j,j η − b Σ (ℓ) j,∼ j m ∼ j (h− 1)+I(a=j);µ ℓ 5: m(h)← Π L (m(h)) 6: return m ℓ a ← m(H) In our experiments we implemented the same coordinate descent iterations explained above to solve for the decorrelating vectorsm ℓ a . A.4.1.2 Gradientdescentalgorithms LettingL(m)=(1/2)m T b Σ (ℓ) m−⟨ m,e a ⟩, we can write the objective of (A.8) asL(m)+µ ℓ ∥m∥ 1 . Projected gradient descent, applied to this constrained objective, results in a sequence of iterates m(h), with h = 0,1,2,... the iteration number, as follows: m(h+1)=arg min ∥m∥ 1 ≤ L n L(m(h))+⟨∇L(m(h)),m− m(h)⟩ + η 2 ∥m− m(h)∥ 2 2 +µ ℓ ∥m∥ 1 o . (A.12) In words, the next iteratem(h+1) is obtained by constrained minimization of a first order approximation toL(m), combined with a smoothing term that keeps the next iterate close to the current one. Since the 217 objective function is convex ( b Σ (ℓ) ⪰ 0), iterates (A.12) are guaranteed to converge to the global minimum of (A.8). Plugging forL(m) and dropping the constant termL(m(h)), update (A.12) reads as m(h+1)=arg min ∥m∥ 1 ≤ L n ⟨ b Σ (ℓ) m(h)− e a ,m− m(h)⟩+ η 2 ∥m− m(h)∥ 2 2 +µ ℓ ∥m∥ 1 o =arg min ∥m∥ 1 ≤ L n η 2 m− m(h)+ 1 η ( b Σ (ℓ) m(h)− e a ) 2 +µ ℓ ∥m∥ 1 o . (A.13) To compute the update (A.13), we first solve the unconstrained problem which has a closed form solution given by η m(h)− 1 η ( b Σ (ℓ) m(h)− e a ); µ ℓ η , with η the soft thresholding function given by (A.11). The solution is then projected onto the ball of radiusL. In the following box, we summarize the projected gradient descent update rule for constructing the decorrelating vectorsm ℓ a . 1: (initialization): m(0)← m (ℓ− 1) a 2: for iterationh=1,...,H do 3: m(h)← η m(h)− 1 η ( b Σ (ℓ) m(h)− e a ); µ ℓ η 4: m(h)← Π L (m(h)) 5: return m ℓ a ← m(H) A.5 Numericalexperiments In this section, we evaluate the performance of online debiasing framework on synthetic data. In the interest of reproducibility, anR implementation of our algorithm is publicly available at http://faculty. marshall.usc.edu/Adel-Javanmard/OnlineDebiasing. 218 Consider the VAR(d) time series model (1.5). In the first setting, we let p = 20, d = 3, T = 50 and construct the covariance matrix of noise terms Σ ζ by putting 1 on its diagonal and ρ = 0.3 on its off- diagonal. To make it closer to the practice, instead of considering sparse coefficient matrices, we work with approximately sparse matrices. Specifically, the entries of A (i) are generated independently from a Bernoulli distribution with success probabilityq =0.1, multiplied byb· Unif({+1,− 1}) withb=0.1, and then added to a Gaussian matrix with mean0 and standard error1/p. In formula, each entry is generated independently from b· Bern(q)· Unif({+1,− 1})+N(0,1/p 2 ). We usedr 0 = 6 (length of first episode E 0 ) andβ = 1.3 for lengths of other episodesE ℓ ∼ β ℓ . For each i∈ [p] we do the following. Letθ 0 = (A (1) i ,A (2) i ,...,A (d) i ) T ∈R dp encode thei th rows of the matrices A (ℓ) and compute the noise component of b θ on as W n ≡ 1 √ n K− 1 X ℓ=0 M (ℓ) X t∈E ℓ x t ε t , (A.14) the rescaled residual T n ∈ R dp with T n,a = q n Vn,a ( b θ on a − θ 0,a ), and V n,a given by Equation (2.14) and σ = 1. Left and right plots of Figure A.3 denote the QQ-plot, PP-plot and histogram of noise terms W n and rescaled residualsT n of all coordinates (across alli∈[p] anda∈[dp]) stacked together, respectively. True and False Positive Rates. Consider the linear time-series model (1.5) withA (i) matrices hav- ing entries drawn independently from the distribution b· Bern(q)· Unif({+1,− 1}) and noise terms be gaussian with covariance matrixΣ ζ . In this example, we evaluate the performance of our proposed online debiasing method for constructing confidence intervals and hypothesis testing as discussed in Section 2.3. 219 0.0 0.1 0.2 0.3 0.4 −2 0 2 Curves Normal Histograms Online Density Noise Terms (a) Histogram of Noise TermsWn 0.0 0.1 0.2 0.3 0.4 −2 0 2 Curves Normal Histograms Online Density Rescaled Residuals (b) Histogram of ResidualsTn −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Normal Online Sample Theoretical (c) QQ plot of Noise TermsWn −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 Normal Online Sample Theoretical (d) QQ plot of ResidualsTn 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normal Online Sample Theoretical (e) PP plot of Noise TermsWn 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Normal Online Sample Theoretical (f) PP plot of ResidualsTn Figure A.3: A simple example of an online debiased Var(3) process with dimension p = 20 and T = 50 sample data points. Plots A.3a, A.3c, A.3e demonstrate respectively the histogram, QQ-plot, and PP plot of noise values of all dp 2 = 1200 entries of A i matrices in linear time series model (1.5). Plots A.3b, A.3d, A.3f are histogram, QQ- plot, and PP-plot of rescaled residuals of all coordinates as well. Alignment of data points in these plots with their corresponding standard normal (0,1) line corroborates our theoretical results on the asymptotic normal behavior of noise terms and rescaled residuals discussed in corollary 2.1.7 and proposition 2.1.8, respectively. 220 We consider four metrics: True Positive Rate (TPR), False Positive Rate (FPR), Average length of confi- dence intervals (Avg CI length), and coverage rate of confidence intervals. Tables A.3 and A.4 summarize the results for various configurations of the Var(d) processes and significance level α = 0.05. Table A.3 corresponds to the cases where noise covariance has the structureΣ ζ (i,j)=0.1 |i− j| and Table A.4 corre- sponds to the case ofΣ ζ (i,j)=0.1 I(i̸=j) . The reported measures for each configuration (each row of the table) are average over 20 different realizations of the VAR(d) model. p T q b FPR TPR Avg CI length Coverage rate d=1 40 30 0.01 2 0.0276 1 3.56 0.9725 35 30 0.01 2 0.0354 0.9166 3.7090 0.9648 60 55 0.01 0.9 0.0314 0.7058 2.5933 0.9686 d=2 55 100 0.01 0.8 0.0424 0.8000 1.9822 0.9572 40 75 0.01 0.9 0.0343 0.9166 2.5166 0.9656 50 95 0.01 0.7 0.0368 0.6182 2.4694 0.963 d=3 45 130 0.005 0.9 0.0370 0.6858 2.070 0.9632 40 110 0.01 0.7 0.0374 0.6512 2.1481 0.9623 50 145 0.005 0.85 0.0369 0.6327 2.2028 0.9631 Table A.3: Evaluation of the online debiasing approach for statistical inference on the coefficients of a VAR(d) model under different configurations. Here the noise terms ζ i are gaussian with covariance matrixΣ ζ (i,j)=0.1 |i− j| . The results are reported in terms of four metrics: FPR (False Positive Rate), TPR (True Positive Rate), Coverage rate and Average length of confidence intervals (Avg CI length) at significance level α =0.05 221 p T q b FPR TPR Avg CI length Coverage rate d=1 40 30 0.01 2 0.0402 1 3.5835 0.96 40 35 0.02 1.2 0.0414 0.8125 2.6081 0.9575 50 40 0.015 0.9 0.0365 0.7435 2.0404 0.9632 d=2 35 65 0.01 0.9 0.0420 0.8077 2.4386 0.9580 45 85 0.01 0.9 0.0336 0.7298 2.5358 0.9655 50 70 0.01 0.95 0.0220 0.8333 2.4504 0.9775 d=3 40 115 0.01 0.9 0.0395 0.7906 1.6978 0.9598 45 130 0.005 0.95 0.0359 0.7714 2.1548 0.9641 50 145 0.005 0.85 0.0371 0.5918 2.1303 0.9624 Table A.4: Evaluation of the online debiasing approach for statistical inference on the coefficients of a VAR(d) model under different configurations. Here the noise terms ζ i are gaussian with covariance matrix Σ ζ (i,j) = 0.1 I(i̸=j) . The results are reported in terms of four metrics: FPR (False Positive Rate), TPR (True Positive Rate), Coverage rate and Average length of confidence intervals (Avg CI length) at significance level α =0.05 A.5.1 Realdataexperiments: amarketingapplication Retailers often offer sales of various categories of products and for an effective management of the business, they need to understand the cross-category effect of products on each other, e.g., how the price, promotion or sale of category A will effect the sales of category B after some time. We used data of sales, prices and promotions of Chicago-area grocery store chain Dominick’s that is publicly available at https://research.chicagobooth.edu/kilts/marketing-databases/dominicks. The same data set has been used in [62] where a sparse VAR model is fit to data and also in [194] where a VARX model is employed to estimate the demand effects (VARX models incorporate the effect of unmodeled exogenous variables (X) into the VAR). In this experiment, we use the proposed online debiasing approach to provide p-values for the category effects. 222 We consider11 categories of products ∗ over 71 weeks, so for each weekt, we have informationx t ∈ R 33 for sales, prices and promotions of the 11 categories. For thorough explanation on calculating sales, prices and promotions, we refer to [167] and [62]. We posit VAR(2) model as the generating process for covariates x i and then apply our proposed online debiasing method to calculate two-sided p-values for the null hypothesis of form H 0 : θ 0,a = 0 with θ 0,a an entry in the VAR model, as discussed earlier in Section 2.3 (See Eq. (2.33)). By running the Benjamini–Yekutieli procedure [20] (with log factor correction to account for dependence amongp-values), we obtain the following statistically significant cross category associations at level0.05: sales of canned tuna on sales of front-end-candies after one week withp-val= 5.8e-05, and price of crackers on sales of canned tuna after one week withp-val= 5.5e-04. In [62], sparse VAR models are used to construct networks of interlinked product categories, but they are not accompanied by statistical measures such asp-values. Our online debiasing method here providesp-values for individual possible cross-category associations. In the rest of this section we report thep-values obtained by the online debiasing for the cross-category effects. Figures A.4, A.5, A.6 provide the p-values corresponding to the effect of price, sale, and promotions of different categories on the other categories, after one week ( d=1) and two weeks (d=2). The darker cells indicate smallerp-values and hence higher statistical significance. ∗ Bottled Juices, Cereals, Cheeses, Cookies, Crackers, Canned Soup, Front-end-Candies, Frozen Juices, Soft Drinks, Snack Crackers and Canned Tuna 223 0.71 0.98 0.34 0.48 0.57 0.73 0.09 0.47 0.3 0.05 0.51 0.02 0.46 0.11 0.28 0.59 0.36 0.27 0.79 0.12 0.57 0.86 0.43 0.7 0.45 0.59 0.82 0.36 0.05 0.87 0.04 0.75 0.09 0.19 0.46 0.73 0.4 0.69 0.03 0.61 0.55 0.79 0.29 0.03 0.04 0.04 0.24 0.2 0.09 0.52 0.66 0.15 0.57 0.02 0.56 0.31 0.71 0.34 0.16 0.01 0 0.4 0.44 0.48 0.82 0.12 0.17 0.94 0.45 0.33 0.84 0.41 0.36 0.73 0.73 0.3 0.89 0.63 0.78 0.94 0.5 0.32 0.58 0.95 0.55 0.52 0.48 0.51 0.59 0.22 0.04 0.79 0.1 0.09 0.63 0.59 0.05 0.3 0.67 0.44 0.42 0.62 0.91 0.67 0.59 0.99 0.65 0.67 0.87 0.25 0.13 0.22 0.88 0.42 0.07 0.83 0 0.72 0.47 0.17 0.03 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (a)1-Week effect of sales ofx− axis categories onsales ofy− axis categories 0.24 0.51 0.44 0.18 0.17 0.96 0.59 0.97 0.78 0.16 0.35 0.01 0.56 0.05 0.64 0.44 0.43 0.02 0.75 0.13 0.84 0.96 0.39 0.34 0.39 0.28 0.49 0.85 0.16 0.6 0.99 0.88 0.15 0.97 0.42 0.86 0.22 0.96 0.01 0.1 0.88 0.37 0.81 0.85 0.25 0.47 0.43 0.97 0.11 0.17 0.15 0.94 0.61 0.5 0 0.54 0.78 0.46 0.91 0.32 0.07 0.15 0.95 0.14 0.4 0.01 0.03 0.74 0.3 0.43 0.01 0.34 0.02 0.98 0.48 0.93 0.02 0.71 0.67 0.58 0.7 0.09 0.44 0.69 0.82 0.14 0.37 0.89 0.13 0.44 0.62 0.84 0.02 0.44 0.63 0.83 0.8 0.9 0.6 0.97 0.2 0.16 0.25 0.63 0.17 0.13 0.64 0.36 0.93 0.01 0.09 0.71 0.61 0.24 0.35 0.99 0.12 0.92 0.14 0.11 0.69 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (b)1-Week effect of prices ofx− axis categories onsales ofy− axis categories Figure A.4: Figures A.4a, and A.4b respectively show thep-values for cross-category effects of sales and prices of x− axis categories on sales ofy− axis categories after one week. 224 0.3 0.69 0.37 0.38 0.03 0.29 0.02 0.9 0.76 0.11 0.01 0.4 0.13 0.65 0.59 0.46 0.43 0.97 0.45 0.39 0.69 0.94 0.29 0.8 0.11 0.4 0.01 0.48 0.06 0.9 0.49 0.33 0.02 0.65 0.1 0.51 0.17 0.26 0.87 0.01 0.37 0.48 0.18 0.93 0.78 0.14 0.93 0.03 0.23 0.64 0.33 0.54 0.24 0.34 0.19 0.54 0.46 0.19 0.44 0.28 0.66 0.03 0.6 0.65 0.06 0.86 0.34 0.91 0.89 0.77 0.93 0.4 0.7 0.88 0.09 0.65 0.57 0.42 0.97 0.37 0.72 0.79 0.88 0.39 0.85 0.45 0.39 0.25 0.96 0.38 0.31 0.29 0.23 0.09 0.66 0.14 0.32 0.76 0.16 0.69 0.35 0.87 0.27 0.54 0.38 0.28 0.44 0.69 0.04 0.51 0.55 0.43 0.15 0.55 0.63 0.73 0.08 0.79 0.37 0.92 0.18 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (a)1-Week effect of promotions ofx− axis categories onsales ofy− axis categories 0.44 0.73 0.89 0.11 0.59 0.08 0.67 0.69 0.5 0.95 0.75 0.34 0.68 0.92 0.72 0.81 0.72 0.94 0.71 0.29 0.46 0.52 0.02 0.18 0.77 0.14 0.62 0.45 0.45 0.2 0.77 0.72 0.23 0.5 0.31 0.9 0.1 0.46 0.98 0.07 0.88 0.36 0.99 0.74 0.73 0.64 0.35 0.39 0.45 0.52 0.58 0.42 0.1 0.69 0.34 0.22 0.96 0.25 0.67 0.43 0.04 0.56 0.49 0.94 0.66 0.29 0.07 0.1 0.55 0.1 0.27 0.26 0.12 0.65 0.06 0.21 0.13 0.88 0.1 0.81 0.65 0.47 0.08 0.66 0.72 0.47 0.06 0.02 0.12 0.5 0.47 0.53 0.72 0.41 0.06 0.43 0.42 0.89 0.76 0.49 0.2 0.39 0.09 0.55 0.4 0.28 0.37 0.44 0.02 0.14 0.79 0.47 0.46 0.67 0.93 0.33 0.86 0.75 0.4 0.1 0.5 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (b)2-Week effect of promotions ofx− axis categories onsales ofy− axis categories Figure A.5: Figures A.5a, and A.5b showp− values for cross-category effects of promotions of x− axis categories on sales ofy− axis categories, after one week and two weeks. 225 0.49 0.66 0.84 0.91 0.88 0.3 0.23 0.47 0.97 0.84 0 0.83 0.71 0.27 0.89 0.56 0.76 0.21 0.97 0.86 0.16 0.45 0.49 0.71 0.55 0.2 0.13 0.59 0.7 0.81 0.08 0.64 0.43 0.2 0.79 0.11 0.08 0.35 0.5 0.96 0.61 0.02 0.67 0.91 0.17 0.42 0.4 0.29 0.2 0.93 0 0.84 0.37 0.57 0.16 0.75 0.62 0.43 0.33 0.38 0.85 0.08 0.57 0.39 0.52 0.04 0.96 0.74 0.69 0.81 0.21 0.85 0.15 0.97 0.07 0.1 0.94 0.08 0.12 0.14 0 0.58 0.86 0.02 0.6 0.19 0.13 0.43 0.78 0.11 0.67 0.78 0.33 0.57 0.48 0.84 0.35 0.19 0.23 0 0.26 0.12 0.45 0.39 0.29 0.91 0.71 0.61 0.35 0.16 0.44 0.52 0.37 0.98 0.89 0.99 0.53 0.35 0.65 0.07 0.08 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (a)2-Week effect of sales ofx− axis categories onsales ofy− axis categories 0.37 0.39 0.95 0.72 0.15 0.81 0.21 0.63 0.1 0.3 0.36 0.74 0.69 0.19 0.57 0 0.18 0.45 0.69 0.42 0.92 0.24 0.47 0.88 0.15 0.71 0.86 0.44 0.39 0.98 0.2 0.72 0.08 0.52 0.47 0.3 0.17 0.54 0.04 0.89 0.6 0.4 0.98 0.57 0.3 0.69 0.56 0.39 0.02 0.84 0.31 0.96 0.92 0.65 0.01 0.37 0.58 0.71 0.91 0.95 0.25 0.32 0.21 0.55 0.98 0.4 0.31 0.89 0.96 0.58 0.41 0.43 0.09 0.6 0.45 0.06 0.47 0.1 0.4 0.26 0.5 0.02 0.98 0.81 0.96 0.27 0.29 0.37 0.66 0.6 0.26 0.16 0.26 0.52 0.04 0.81 0.12 0.94 0.84 0.75 0.65 0.91 0.09 0.57 0.84 0.07 0.82 0.28 0.36 0.3 0.82 0.71 0.28 0.85 0.24 0.75 0.58 0.75 0.43 0.06 0.12 Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna Bottled Juices Cereals Cheeses Cookies Crackers Canned Soup Front−end−Candies Frozen Juices Soft Drinks Snack Crackers Canned Tuna 0.25 0.50 0.75 p−values (b)2-Week effect of prices ofx− axis categories onsales ofy− axis categories Figure A.6: Figures A.6a, and A.6b respectively showp-values for cross-category effects of sales and prices of x-axis categories on sales ofy− axis categories after two weeks. 226 A.6 ProofsofSection3 A.6.1 Technicalpreliminaries Recall the definition of the regression design from Eqs.(2.2) in the time series case: θ 0 =(A (1) i ,A (2) i ,...,A (d) i ) T , X = z T d z T d− 1 ... z T 1 z T d+1 z T d ... z T 2 . . . . . . . . . . . . z T T− 1 z T T− 2 ... z T T− d , y =(z d+1,i ,z d+2,i ,...,z T,i ), ε=(ζ d+1,i ,ζ d+2,i ,...,ζ T,i ). We first establish some preliminary results for stable time series. For the stationary process x t = (z T t+d− 1 ,...,z T t ) T (rows of X), let Γ x (s) = Cov(x t ,x t+s ), for t,s ∈ Z and define the spectral density f x (r)≡ 1/(2π ) P ∞ ℓ=−∞ Γ X (ℓ)e − jℓr , forr ∈ [− π,π ] . The measure of stability of the process is defined as the maximum eigenvalue of the density M(f x )≡ sup r∈[− π,π ] σ max (f x (r)). (A.15) 227 Likewise, the minimum eigenvalue of the spectrum is defined as m(f x ) ≡ inf r∈[− π,π ] σ min (f x (r)), which captures the dependence among the covariates. (Note that for the case of i.i.d. samples,M(f x ) andm(f x ) reduce to the maximum and minimum eigenvalue of the population covariance.) Thep-dimensionalVAR(d) model (1.5) can be represented as adp-dimensionalVAR(1) model. Recall our notationx t =(z T t+d− 1 ,...,z T t ) T (rows ofX in (2.2)). Then (1.5) can be written as x t = e Ax t− 1 + ˜ ζ t , (A.16) with e A= A 1 A 2 ... A d− 1 A d I (d− 1)p 0 , ˜ ζ t = ζ t+d− 1 0 . (A.17) The reverse characteristic polynomial for theVAR(1) model reads as ˜ A=I− ˜ Az. The following lemma controls M(f x ),m(f x ) in terms of the spectral properties of the noise Σ ζ and the characteristic polynomialsA, ˜ A. LemmaA.6.1 ([12]). We have: 1 2π λ max (Σ) ≤ M(f x )≤ λ max (Σ ζ ) µ min ( ˜ A) , λ min (Σ) ≥ λ min (Σ ζ ) µ max (A) . (A.18) We also use the following bound onM(f x ) in terms of characteristic polynomialA of the time series z t . 228 LemmaA.6.2. The following holds: 1 2π λ max (Σ) ≤ M(f x )≤ dM(f z )≤ dλ max (Σ ζ ) µ min (A) . Proof. Let Γ x (ℓ) = E[x t x T t+ℓ ] to refer the autocovariance of the dp-dimensional process x t . Therefore Σ = Γ x (0). Likewise, the autocovarianceΓ z (ℓ) is defined for the p-dimensional processz t . We represent Γ x (ℓ) in terms ofd 2 blocks, each of which is ap× p matrix. The block in position(r,s) isΓ z (ℓ+r− s). Now, for a vector v ∈ R dp with unit ℓ 2 norm, decompose it as d blocks of p dimensional vectors v = (v T 1 ,v T 2 ,...,v T d ) T , by which we have v T Γ z (ℓ)v = X 1≤ r,s≤ d v T r Γ x (ℓ+r− s)v s . (A.19) Since the spectral densityf z (θ ) is the Fourier transform of the autocorrelation function, we have by Equa- tion (A.19), ⟨v,f z (θ )v⟩= 1 2π ∞ X ℓ=−∞ ⟨v,Γ z (ℓ)e − jℓθ v⟩ = 1 2π ∞ X ℓ=−∞ X 1≤ r,s≤ d ⟨v r ,Γ z (ℓ+r− s)e − jℓθ v s ⟩ = X 1≤ r,s≤ d ⟨v r , 1 2π ∞ X ℓ=−∞ Γ x (ℓ+r− s)e − j(ℓ+r− s)θ v s e j(r− s)θ ⟩ = X 1≤ r,s≤ d ⟨v r ,f x (θ )e j(r− s)θ v s ⟩ =V(θ ) ∗ f x (θ )V(θ ), 229 withV(θ )= d P r=1 e − jrθ v r . Now, we have: ∥V(θ )∥ 2 ≤ d X r=1 ∥v r ∥ 2 ≤ d d X r=1 ∥v r ∥ 2 2 1/2 ≤ √ d. Combining this with the Rayleigh quotient calculation above, yields M(f x ) ≤ dM(f z ). Now, by using [12, Equation (4.1)] for the processz t , with reverse characteristic polynomialA, we obtain λ max (Σ) ≤ 2πM (f x )≤ 2πdM (f z )≤ dλ max (Σ ζ ) µ min (A) . (A.20) The following proposition is a straightforward consequence of the spectral bounds above and [12, Proposition 2.4]. Proposition A.6.3. There exists a constant c > 0, such that for any vectors u,v ∈ R dp with∥u∥ ≤ 1, ∥v∥≤ 1, and anyη ≥ 0, P |u T ( b Σ (ℓ) − Σ) v|> dλ max (Σ ζ ) µ min (A) η ≤ 6exp − cn ℓ min{η 2 ,η } . (A.21) A.6.2 RemarksonproofofTheorem2.1.2 The key part of establishing Theorem 2.1.2 is to establish an appropriate ‘restricted eigenvalue’ condition as follows: 230 Proposition A.6.4. Let{z 1 ,...,z T } be generated according to the (stable) VAR(d) process (1.5) and let n = T − d. Then there exist constants c ∈ (0,1) and C > 1 such that for all n ≥ Cω 2 log(dp), with probability at least1− exp(− cn/ω 2 ), satisfies ⟨v,(X T X/n)v⟩≥ α ∥v∥ 2 − ατ ∥v∥ 2 1 . Here,α ,ω andτ are given by: ω = dλ max (Σ ζ )µ max (A) λ min (Σ ζ )µ min (A) , α = λ min (Σ ζ ) 2µ max (A) , τ =ω 2 r log(dp) n . (A.22) Given Proposition A.6.4, the estimation result of Theorem 2.1.2 is standard (see [26]). Proposition A.6.4 can be proved analogous to [12, Proposition 4.2], with the following considerations and minor modifica- tions: 1. [12] writes the VAR(d) model as a VAR(1) model and then vectorize the obtained equation to get a linear regression form (cf. Section 4.1 of [12]). This way, they prove I ⊗ (X T X/n) satisfies a restricted eigenvalue property. Towards this, the first step in their proof is to show that X T X/n satisfies a restricted eigenvalue property, i.e. Proposition A.6.4. 2. [12, Proposition 4.2] assumesn≥ Ckmax{ω 2 ,1}log(dp), withk = P d ℓ=1 ∥vec(A (ℓ) )∥ 0 , the total number of nonzero entries of matricesA ℓ and then it is later used to getτ ≤ 1/(Ck). However, as the restricted eigenvalue condition is independent of the sparsity of matricesA (ℓ) , we can use their result withk =1. 231 3. The proof involves upper boundingM(f x ), for which we use Lemma A.6.2 in lieu of Lemma A.6.1. A.6.3 ProofofLemma2.1.3 The idea is to use Proposition A.6.3 along with the union bound. Fix i,j ∈ [dp] and let u = Ω e i ∥Ω e i ∥ and v =e j . Then: |(Ω b Σ (ℓ) − I) ij |=|⟨Ω e i ,( b Σ (ℓ) − Σ) e j ⟩| =∥Ω e i ∥|⟨u,( b Σ (ℓ) − Σ) v⟩| ≤ λ max (Ω) |⟨u,( b Σ (ℓ) − Σ) v⟩| ≤ µ max (A) λ min (Σ ζ ) |⟨u,( b Σ (ℓ) − Σ) v⟩|, where the last line uses Lemma A.6.1 to bound λ min (Σ) from below. Combining this with Proposition A.6.3, forη ≤ 1: P n |(Ω b Σ (ℓ) − I) ij |≥ dλ max (Σ ζ )η/µ min (A) o ≤ P n |⟨u,( b Σ (ℓ) − Σ) v⟩|≥ ωη o ≤ 6exp(− cn ℓ η 2 ). Setting η = C p log(dp)/n ℓ for a large enough constant C, the probability bound above is smaller than (dp) − 8 . With a union bound overi,j∈[dp]: P ∥Ω b Σ (ℓ) − I∥ ∞ ≥ Cω s log(dp) n ℓ ≤ (dp) 2 sup i,j P |(Ω b Σ (ℓ) − I) ij |≥ Cω s log(dp) n ℓ ≤ (dp) − 6 . 232 This completes the proof. A.6.4 ProofofTheorem2.1.4 Starting from the decomposition (2.11), we have √ n( b θ on − θ 0 )=∆ n +W n , with ∆ n = B n ( b θ L − θ 0 ). As explained below (2.11), W n is a martingale with respect to filtration F j = {ε 1 ,...,ε j },j∈N and henceE(W n )=0. We also note that∥∆ n ∥ ∞ ≤∥ B n ∥ ∞ ∥ b θ L − θ 0 ∥ 1 . Our next lemma bounds∥B n ∥ ∞ . LemmaA.6.5. Suppose that the decorrelating matricesM (ℓ) are computed according to Eq.(2.5), withµ ℓ = Cω p (log(dp)/n ℓ andL≥∥ Ω ∥ 1 . Letω andγ be: ω = dµ max (A)λ max (Σ ζ ) µ min (A)λ min (Σ ζ ) , γ = dλ max (Σ ζ ) µ min (A) . Then, forB n given by (2.9), the following bound holds with probability at least1− (dp) − 8 : ∥B n ∥ ∞ ≤ r 0 √ n + K− 1 X ℓ=1 " r ℓ µ ℓ √ n +CLγ r log(dp) n r ℓ √ n ℓ + √ r ℓ # , (A.23) ≤ r 0 √ n +C(ω+Lγ ) r log(dp) n K− 1 X ℓ=1 r ℓ √ n ℓ + √ r ℓ . (A.24) 233 The bound (A.24) holds for general batch sizesr 0 ,...,r K− 1 . A natural approach to choose the values r ℓ is by minimizing this upper bound. However, this is not a convex function in terms of r ℓ . Focusing just on the last term in the bound, we have P K− 1 ℓ=1 √ r ℓ ≥ ( P K− 1 ℓ=1 r ℓ ) 1/2 = √ n. Therefore, the provided bound on∥B n ∥ ∞ is at least of order p log(dp). We next propose a choice of batch sizesr ℓ for which the bound (A.24) achieves this order. Letr 0 = √ n,r ℓ =β ℓ for someβ > 1 andℓ=1,...,K− 2. Finally we chooser K− 1 so that the total lengths of batches add up ton (that isr 0 +r 1 +...+r K− 1 =n). Following this choice, bound (A.24) simplifies to: ∥B n ∥ ∞ ≤ C β (ω+γL ) p log(dp), (A.25) for some constantC β >0 that depends on the constantβ . Next by combining Theorem 2.1.2 and Lemma A.6.5 we obtain that, with probability at least1− 2(dp) − 6 ∥∆ n ∥ ∞ ≤ C β (ω+Lγ ) p log(dp)· s 0 λ n α ≤ C β λ 0 (ω+Lγ ) α s 0 log(dp) √ n . (A.26) This implies the claim by selecting aβ bounded away from 1, sayβ =1.3. It remains to prove the claim on the biasE{ b θ on − θ 0 }. For this, define G to be the event where ∆ n satisfies the upper bound in Eq.(A.26). Therefore: ∥E{ b θ on − θ 0 }∥ ∞ = ∥E{∆ n }∥ ∞ √ n ≤ ∥E{∆ n I(G)}∥ ∞ √ n +E{∥ b θ L − θ 0 ∥ 1 I(G c )}. 234 For the first term we use the bound Eq.(A.26). For the second, we use Lemma A.9.7: ∥E{ b θ on − θ 0 }∥ ∞ ≤ Cλ 0 (ω+Lγ ) α s 0 logp n + E{∥ε∥ 2 I(G c )} nλ n +2∥θ 0 ∥ 1 P(G c ). It suffices, therefore, to show that the final two terms are at most C∥θ 0 ∥ 1 /(dp) 6 . By Holder inequality and P(G c )≤ 2(dp) − 6 : E{∥ε∥ 2 I(G c )} nλ n +2∥θ 0 ∥ 1 P(G c )≤ E{∥ε∥ 4 } 1/2 P(G c ) 1/2 nλ n +2∥θ 0 ∥ 1 P(G c ) ≤ C λ max (Σ ζ ) 2 (dp) 3 λ 0 p nlog(dp) +C ∥θ 0 ∥ 1 (dp) 6 . In the high-dimensional regime, the first term is negligible in comparison to s 0 log(dp)/n, which yields, after adjustingC appropriately: ∥E{ b θ on − θ 0 }∥ ∞ ≤ C 1 λ 0 (ω+Lγ ) α s 0 logp n +C 2 ∥θ 0 ∥ 1 (dp) 6 , as required. It remains to prove Lemma A.6.5: Proof of Lemma A.6.5. For each episodeℓ, let R (ℓ) := 1 r ℓ X t∈E ℓ x t x T t 235 be the sample covariance in episodeℓ. Fixa∈ [dp] and define B n,a ≡ √ ne a − 1 √ n P K− 1 ℓ=1 r ℓ R (ℓ) m ℓ a . We then have B n,a = √ ne a − 1 √ n K− 1 X ℓ=1 r ℓ R (ℓ) m ℓ a = r 0 √ n e a + K− 1 X ℓ=1 r ℓ √ n e a − R (ℓ) m ℓ a , (A.27) where we used that P K− 1 ℓ=0 r ℓ =n. By triangle inequality, followed by Holder inequality: ∥B n,a ∥ ∞ ≤ r 0 √ n + 1 √ n K− 1 X ℓ=1 r ℓ ∥e a − R (ℓ) m ℓ a ∥ ∞ ≤ r 0 √ n + K− 1 X ℓ=1 r ℓ √ n ∥e a − b Σ (ℓ) m ℓ a ∥ ∞ +∥( b Σ (ℓ) − Σ) m ℓ a ∥ ∞ +∥(Σ − R (ℓ) )m ℓ a ∥ ∞ ≤ r 0 √ n + K− 1 X ℓ=1 r ℓ √ n ∥e a − b Σ (ℓ) m ℓ a ∥ ∞ +∥ b Σ (ℓ) − Σ ∥ ∞ ∥m ℓ a ∥ 1 +∥Σ − R (ℓ) ∥ ∞ ∥m ℓ a ∥ 1 We now bound each of the three terms appearing in the sum above: 1. By the construction of decorrelating vectorsm ℓ a as in optimization (2.5), we have ∥ b Σ (ℓ) m ℓ a − e a ∥ ∞ ≤ µ ℓ , ℓ=0,...,K− 1. (A.28) 2. Also by construction,∥m ℓ a ∥ 1 ≤ L. From an argument similar to that of Lemma 2.1.3,∥ b Σ (ℓ) − Σ ∥ ∞ ≤ Cγ p log(dp)/n ℓ with probability at least1− K(dp) − 9 , whereγ =dλ max (Σ ζ )/µ min (A). Therefore, with the same probability, the second term is at mostCLγ p log(dp)/n ℓ . 3. Again, by construction∥m ℓ a ∥ 1 ≤ L. Similar to Lemma 2.1.3,∥R (ℓ) − Σ ∥ ∞ is at mostCγ p log(dp)/r ℓ with probability at least1− K(dp) − 9 . 236 Combining these and the fact that we setµ ℓ =Cω p log(dp)/n we have that, with probability at least 1− 2K(dp) − 9 , ∥B n,a ∥ ∞ ≤ r 0 √ n + C √ n K− 1 X ℓ=1 r ℓ ω s log(dp) n ℓ +Lγ s log(dp) n ℓ +Lγ s log(dp) r ℓ ! ≤ r 0 √ n +C(ω+Lγ ) r log(dp) n K− 1 X ℓ=1 r ℓ √ n ℓ + √ r ℓ . This bound holds uniformly overa∈ [dp], and since∥B n ∥ ∞ = sup a ∥B n,a ∥ ∞ , the same bound holds for∥B n ∥ ∞ . This completes the proof. A.6.5 ProofofLemma2.1.6 We start by proving Claim (2.14). Letm a =Ω e a be the first column of the inverse (stationary) covariance. Using the fact thatE{x t x T t }=Σ we have⟨m a ,E{x t x T t }m a ⟩=Ω a,a , which is to be the dominant term in the conditional varianceV n,a . Recall the shorthandσ 2 ≡ Σ ζ i,i , withi∈ [p] the fixed coordinate in (2.2). Therefore, we decompose the difference as follows: V n,a − σ 2 Ω a,a = σ 2 n K− 1 X ℓ=1 X t∈E ℓ h ⟨m ℓ a ,x t ⟩ 2 − Ω a,a i − r 0 σ 2 n Ω a,a = σ 2 n K− 1 X ℓ=1 X t∈E ℓ h ⟨m ℓ a ,x t ⟩ 2 −⟨ m a ,E{x t x T t }m a ⟩ i − r 0 σ 2 n Ω a,a = σ 2 n K− 1 X ℓ=1 X t∈E ℓ [⟨m ℓ a ,x t ⟩ 2 −⟨ m a ,x t ⟩ 2 ] + 1 n n− 1 X t=0 ⟨m a ,(x t x T t − E{x t x T t })m a ⟩− r 0 σ 2 n Ω a,a . (A.29) 237 We treat each of these three terms separately. Write 1 n K− 1 X ℓ=1 X t∈E ℓ [⟨m ℓ a ,x t ⟩ 2 −⟨ m a ,x t ⟩ 2 ] = 1 n K− 1 X ℓ=1 X t∈E ℓ [⟨m ℓ a − m a ,x t ⟩⟨m ℓ a +m a ,x t ⟩] ≤ 1 n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a − m a ,x t ⟩x t ∞ ∥m ℓ a +m a ∥ 1 ≤ 2L n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a − m a ,x t ⟩x t ∞ . (A.30) To bound the last quantity, note that 1 n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a − m a ,x t ⟩x t ∞ ≤ e a − 1 n K− 1 X ℓ=1 X t∈E ℓ ⟨m ℓ a ,x t ⟩x t ∞ + e a − 1 n K− 1 X ℓ=1 X t∈E ℓ ⟨m a ,x t ⟩x t ∞ = e a − 1 n K− 1 X ℓ=1 r ℓ R (ℓ) m ℓ a ∞ + e a − b Σ (K) m a ∞ = 1 √ n ∥B n,a ∥ ∞ + e a − b Σ (K) m a ∞ ≤ CLγ r log(dp) n +Cω r log(dp) n ≤ C(Lγ +ω) r log(dp) n , (A.31) for some constant C. The last inequality follows from the positive events of Lemma A.6.5 and Lemma 2.1.3. Combining Equations (A.30) and (A.31), we obtain 1 n K− 1 X ℓ=1 X t∈E ℓ [⟨m ℓ a ,x t ⟩ 2 −⟨ m a ,x t ⟩ 2 ] ≤ CL(ω+Lγ ) r log(dp) n . (A.32) 238 For the second term in (A.29), we can use Proposition A.6.3 withv =u=m a /∥m a ∥,η =C p log(dp)/n to obtain 1 n n− 1 X t=0 ⟨m a ,(x t x T t − E{x t x T t })m a ⟩ = ⟨m a ,( b Σ (K− 1) − Σ) m a ⟩ ≤ Cdλ max (Σ ζ ) µ min (A) ∥m a ∥ 2 r log(dp) n ≤ Cdλ max (Σ ζ ) µ min (A)λ min (Σ) 2 r log(dp) n (A.33) ≤ Cω α r log(dp) n , (A.34) where we used that∥m a ∥ = ∥Ω e a ∥ ≤ λ max (Ω) = λ min (Σ) − 1 ≤ 1/α . For the third term, we have r 0 = √ n. Also, Ω a,a ≤ λ max (Ω) ≤ 1/α . Therefore, this term is O(1/α √ n). Combining this bound with (A.32) and (A.34) in Equation (A.29) we get the Claim (2.14). We next prove Claim (2.15). Note that|ε t | = |ζ t+d,i | is bounded with σ p 2log(n), with high prob- ability for t ∈ [n], by tail bound for Gaussian variables. In addition, max ℓ |⟨m ℓ a ,x t ⟩| ≤ ∥ m ℓ a ∥ 1 ∥x t ∥ ∞ ≤ L∥x t ∥ ∞ ≤ L∥X∥ ∞ . Note that variance of each entryx t,i is bounded byΣ ii ≤ λ max (Σ) . Hence, by tail bound for Gaussian variables and union bounding we have P ∥X∥ ∞ < p 2λ max (Σ)log( dpn) ≥ 1− (pdn) − 2 , (A.35) 239 Putting these bounds together we get max n 1 √ n |⟨m ℓ a ,x t ⟩ε t |: ℓ∈[K− 2], t∈[n] o ≤ 1 √ n L p 2λ max (Σ)log( dpn)σ p 2log(n) ≤ 2Lσ p λ max (Σ) log(dpn) √ n ≤ 2L 0 σ ∥Ω ∥ 1 2πdλ max (Σ ζ ) µ min (A) 1/2 log(dpn) √ n =o(1), where in the last inequality we used Lemma A.6.2 to upper boundλ max (Σ ζ ). The conclusion that the final expression iso(1) follows from Assumption 2.1.5. A.6.6 ProofofProposition2.1.8 We prove that for allx∈R, lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n √ n( b θ on a − θ 0,a ) p V n,a ≤ x o ≤ Φ( x). (A.36) We can obtain a matching lower bound by a similar argument which implies the result. Invoking the decomposition (2.12) we have √ n( b θ on a − θ 0,a ) p V n,a = W n p V n,a + ∆ n p V n,a . 240 By Corollary 2.1.7, we have that f W n ≡ W n / p V n,a → N(0,1) in distribution. Fix an arbitraryε > 0 and write P n √ n( b θ on a − θ 0,a ) p V n,a ≤ x o =P n f W n + ∆ n p V n,a ≤ x o ≤ P{ f W n ≤ x+ε}+P n |∆ a | p V n,a ≥ ε o By taking the limit and using Equation (2.12), we get lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n √ n( b θ on a − θ 0,a ) p V n,a ≤ x o ≤ Φ( x+ε)+ lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n |∆ a | p V n,a ≥ ε o (A.37) We show that the limit on the right hand side vanishes for anyε>0. By virtue of Lemma 2.1.6 (Equation (2.14)), we have lim n→∞ P n |∆ a | p V n,a ≥ ε o ≤ lim n→∞ P n |∆ a | σ p Ω a,a ≥ ε o ≤ lim n→∞ P n |∆ a |≥ εσ p Ω a,a o ≤ lim n→∞ (dp) − 4 =0. (A.38) Here, in the last inequality we used thats 0 (Lγ +ω)=o( √ n/log(dp)) and therefore, for large enoughn, εσ p Ω a,a exceeds the bound (2.13) of Theorem 2.1.4. Using (A.38) in bound (A.37) and then taking the limitε→0, we obtain (A.36). 241 A.7 ProofsofSectionA.4 A.7.1 ProofofLemmaA.4.1 Rewrite the optimization problem (2.5) as follows: minimize m T b Σ (ℓ) m subject to ⟨z, b Σ (ℓ) m− e a ⟩≤ µ ℓ , ∥m∥ 1 ≤ L, ∥z∥ 1 =1, (A.39) The Lagrangian is given by L(m,z,λ )=m T b Σ (ℓ) m+λ (⟨z, b Σ (ℓ) m− e a ⟩− µ ℓ ), ∥z∥ 1 =1, ∥m∥ 1 ≤ L, (A.40) Ifλ ≤ 2L, minimizing Lagrangian overm is equivalent to ∂L ∂m = 0 and we getm ∗ =− λz ∗ /2. The dual problem is then given by maximize − λ 2 4 z T b Σ (ℓ) z− λ ⟨z,e a ⟩− λµ ℓ subject to λ 2 ≤ L, ∥z∥ 1 =1, (A.41) As∥z∥ 1 =1, by introducingβ =− λ 2 z, we get∥β ∥ 1 = λ 2 . Rewrite the dual optimization problem in terms ofβ to get minimize 1 2 β T b Σ (ℓ) β −⟨ β,e a ⟩+µ ℓ ∥β ∥ 1 subject to ∥β ∥ 1 ≤ L, (A.42) 242 Givenβ ∗ as the minimizer of the above optimization problem, from the relation ofβ andz we realize that m ∗ =β ∗ . Also note that since optimization (A.42) is the dual to problem (A.39), we have that if (A.39) is feasible then the problem (A.42) is bounded. A.7.2 ProofofLemmaA.1.1 By virtue of Proposition A.6.4, the sample covariance b Σ satisfies RE condition, b Σ ∼ RE(α,τ ), where α = λ min (Σ ζ ) 2µ max (A) , τ =Cω 2 r log(dp) n , (A.43) and by the sample size condition we haves Ω <1/32τ . Hereafter, we use the shorthand m ∗ a = Ω e a and letL(m) be the objective function in the optimiza- tion (A.4). By optimality ofm a , we haveL(m ∗ a )≤L (m a ). Defining the error vector ν ≡ m a − m ∗ a and after some simple algebraic calculation we obtain the equivalent inequality 1 2 ν T b Σ ν ≤⟨ ν,e a − b Σ m ∗ a ⟩+µ n (∥m ∗ a ∥ 1 −∥ m ∗ a +ν ∥ 1 ). (A.44) In the following we first upper bound the right hand side. By Lemma 2.1.3 (for ℓ = K andn K = n), we have that with high probability ⟨ν,e a − b Σ m ∗ a ⟩≤∥ ν ∥ 1 a r log(dp) n =(∥ν S ∥ 1 +∥ν S c∥ 1 ) µ n 2 , 243 whereS =supp(Ω e a ) and hence|S|≤ s Ω . On the other hand, ∥m a +ν ∥ 1 −∥ m ∗ a ∥ 1 ≥ (∥m ∗ a,S ∥ 1 −∥ ν S ∥ 1 )+∥ν S c∥ 1 −∥ m ∗ a ∥ 1 =∥ν S c∥ 1 −∥ ν S ∥ 1 . Combining these pieces we get that the right-hand side of (A.44) is upper bounded by (∥ν S ∥ 1 +∥ν S c∥ 1 ) µ n 2 +µ n (∥ν S ∥ 1 −∥ ν S c∥ 1 )= 3 2 µ n ∥ν S ∥ 1 − 1 2 µ n ∥ν S c∥ 1 , (A.45) Given that b Σ ⪰ 0, the left hand side of (A.44) is non-negative, which implies that∥ν S c∥ 1 ≤ 3∥ν S ∥ 1 and hence ∥ν ∥ 1 ≤ 4∥ν S ∥ 1 ≤ 4 √ s Ω ∥ν S ∥ 2 ≤ 4 √ s Ω ∥ν ∥ 2 . (A.46) Next by using the restricted eigenvalue condition for b Σ we write ν T b Σ ν ≥ α ∥ν ∥ 2 2 − ατ ∥ν ∥ 2 1 ≥ α (1− 16s Ω τ )∥ν ∥ 2 2 ≥ α 2 ∥ν ∥ 2 2 , (A.47) where we usedτ ≤ 1/(32s Ω ) in the final step. Putting (A.44), (A.45) and (A.47) together, we obtain α 4 ∥ν ∥ 2 2 ≤ 3 2 µ n ∥ν S ∥ 1 ≤ 6 √ s Ω µ n ∥ν ∥ 2 . 244 Simplifying the bound and using equation A.46, we get ∥ν ∥ 2 ≤ 24 α √ s Ω µ n , ∥ν ∥ 1 ≤ 96 α s Ω µ n , which completes the proof. A.7.3 ProofofTheoremA.1.2 Continuing from the decomposition (A.2) we have √ n( b θ off − θ 0 )=∆ 1 +∆ 2 +Z, (A.48) withZ = Ω X T ε/ √ n. By using Lemma 2.1.3 (forℓ = K) and recalling the choice ofµ = τ p log(dp)/n we have that the following optimization is feasible, with high probability: minimize m T b Σ m subject to ∥ b Σ m− e a ∥ ∞ ≤ µ. Therefore, optimization (A.4) (which is shown to be its dual in Lemma (A.4.1)) has bounded solution. Hence, its solution should satisfy the KKT condition which reads as b Σ m a − e a +µ sign(m a )=0, (A.49) 245 which implies∥ b Σ m a − e a ∥ ∞ ≤ µ . Invoking the estimation error bound of Lasso for time series (Proposition 2.1.2), we bound∆ 1 as ∥∆ 1 ∥ ∞ ≤ C √ nµs 0 r logp n =O P s 0 log(dp) √ n . (A.50) We next bound the bias term ∆ 2 . By virtue of [12, Proposition 3.2] we have the deviation bound ∥X T ε∥ ∞ / √ n=O P ( p log(dp)), which in combination with Lemma A.1.1 gives us the following bound ∥∆ 2 ∥ ∞ ≤ max i∈[dp] ∥(M− Ω) e i )∥ 1 √ n ∥X T ε∥ ∞ =O P s Ω log(dp) √ n . (A.51) Therefore, letting∆=∆ 1 +∆ 2 , we have∥∆ ∥ ∞ =o P (1), by recalling our assumptions 0 =o( √ n/log(dp)) ands Ω =o( √ n/log(dp)). Our next lemma is analogous to Lemma 2.1.6 for the covariance of the noise component in the offline debiased estimator, and its proof is deferred to Section A.7.1. LemmaA.7.1. Assume thats Ω =o( √ n/log(dp)) andΛ min (Σ ϵ )/µ max (A)>c min > 0 for some constant c min >0. Forµ =τ p log(dp)/n and the decorrelating vectorsm i constructed by (A.4), the following holds. For any fixed sequence of integers a(n)∈[dp], we have m T a b Σ m a =Ω a,a +o P (1/ p log(dp)). (A.52) We are now ready to prove the theorem statement. We show that lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P ( √ n( b θ off a − θ 0,a ) p V n,a ≤ u ) ≤ Φ( u). (A.53) 246 A similar lower bound can be proved analogously. By the decomposition (A.48) we have √ n( b θ off a − θ 0,a ) p V n,a = ∆ a p V n,a + Z a p V n,a . Define e Z a ≡ Z a σ p Ω a,a = 1 σ p nΩ a,a (Ω X T ε) a = 1 σ p nΩ a,a n X i=1 e T a Ω x i ε i . Sinceε i is independent ofx i , the summand P n i=1 e T a Ω x i ε i is a martingale. Furthermore,E[(e T a Ω x i ε i ) 2 ]= σ 2 Ω a,a . Hence, by a martingale central limit theorem [71, Corollary 3.2], we have that e Z a → N(0,1) in distribution. In other words, lim n→∞ P{ e Z a u}=Φ( u). (A.54) Next, fix δ ∈(0,1) and write P ( √ n( b θ off a − θ 0,a ) p V n,a ≤ u ) =P ( p Ω a,a p V n,a e Z a + ∆ a p V n,a ≤ u ) ≤ P ( p Ω a,a p V n,a e Z a ≤ u+δ ) +P ( ∆ a p V n,a ≥ δ ) ≤ P n e Z a ≤ u+2δ +δ |u| o +P ( p Ω a,a p V n,a − 1 ≥ δ ) +P ( ∆ a p V n,a ≥ δ ) . 247 Now by taking the limit of both sides and using (A.54) and Lemma A.7.1, we obtain lim sup n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P ( √ n( b θ off a − θ 0,a ) p V n,a ≤ u ) ≤ Φ( u+2δ +δ |u|)+lim sup n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P ( ∆ a p V n,a ≥ δ ) . (A.55) Sinceδ ∈ (0,1) was chosen arbitrarily, it suffices to show that the limit on the right hand side vanishes. To do that, we use Lemma A.7.1 again to write lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n |∆ a | p V n,a ≥ δ o ≤ lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n |∆ a | σ p (Ω a,a ≥ δ o ≤ lim n→∞ sup ∥θ 0 ∥ 0 ≤ s 0 P n |∆ a |≥ δσ p Ω a,a o =0, where the last step follows since we showed∥∆ ∥ ∞ =o P (1). The proof is complete. A.7.3.1 ProofofLemmaA.7.1 By invoking bound (A.18) on minimum eigenvalue of the population covariance, we have λ min (Σ) ≥ λ min (Σ ζ ) µ max (A) , (A.56) bounded away from 0 by our assumption. Therefore, λ max (Ω) = λ min (Σ) − 1 is bounded away from∞. Since Ω ≽ 0, we have|Ω a,b | ≤ p Ω a,a Ω b,b for any two indices a,b ∈ [dp]. Hence,|Ω | ∞ ≤ 1/λ min (Σ) . 248 This implies that∥Ω e a ∥ 1 ≤ s Ω /λ min (Σ) . Using this observation along with the bound established in Lemma A.1.1, we obtain ∥m a ∥ 1 ≤∥ Ω e a ∥+∥m a − Ω e a ∥ 1 ≤ s Ω λ min (Σ) + 192τ α s Ω r log(dp) n =O(s Ω ). (A.57) We also have ∥m a − Ω e a ∥ ∞ ≤∥ m a − Ω e a ∥ 1 =O s Ω r log(dp) n . (A.58) In addition, by the KKT condition (A.49) we have ∥ b Σ m a − e a ∥ ∞ ≤ µ. (A.59) Combining bounds (A.57), (A.58) and (A.59), we have |m T a b Σ m a − Ω a,a |≤| (m T a b Σ − e T a )m a |+|e T a m a − Ω a,a | ≤∥ m T a b Σ − e T a ∥ ∞ ∥m a ∥ 1 +∥m a − Ω e a ∥ ∞ =O s Ω r log(dp) n =o(1/ p log(dp)), which completes the proof. 249 A.8 ProofsofSection2.2 A.8.1 ConsistencyresultsforLASSOunderadaptivelycollectedsamples Theorem 2.2.2 shows that, under an appropriate compatibility condition, the LASSO estimate admits ℓ 1 error at a rate of s 0 p logp/n. Importantly, despite the adaptivity introduced by the sampling of data, the error of LASSO estimate has the same asymptotic rate as expected without adaptivity. With slightly stronger restricted-eigenvalue conditions on the covariancesE{xx T } andE{xx T |⟨x, b θ 1 ⟩ ≥ ς}, it is also possible to extend Theorem 2.2.2 to show ℓ 2 error of order s 0 logp/n, analogous to the non-adaptive setting. However, since theℓ 2 error rate will not be used for our analysis of online debiasing, we do not pursue this direction here. A.8.1.1 ProofofTheorem2.2.2 The important technical step is to prove that, under the conditions specified in Theorem 2.2.2, the sample covariance b Σ=(1 /n) P i x i x T i is(ϕ 0 /4,supp(θ 0 )) compatible. PropositionA.8.1. Withprobabilityexceeding1− p − 4 thesamplecovariance b Σ is(ϕ 0 /4,supp(θ 0 ))com- patible whenn 1 ∨n 2 ≥ C(κ 4 /ϕ 2 0 )s 2 0 logp, for an absolute constantC >0. Let b Σ (1) and b Σ (2) denote the sample covariances of each batch, i.e. b Σ (1) = (1/n 1 ) P i≤ n 1 x i x T i and similarly b Σ (2) = (1/n 2 ) P i>n 1 x i x T i . We also let Σ (2) be the conditional covariance Σ (2) = Σ (2) ( b θ 1 ) = E{xx T |⟨x, b θ 1 ⟩ ≥ ς}. We first prove that at least one of the sample covariances b Σ (1) and b Σ (2) closely approximate their population counterparts, and that this implies they are(ϕ 0 /2,supp(θ 0 ))-compatible. 250 LemmaA.8.2. With probability at least1− p − 4 ∥ b Σ (1) − Σ ∥ ∞ ∧∥ b Σ (2) − Σ (2) ∥ ∞ ≤ 12κ 2 r logp n , Proof. Sincen = n 1 +n 2 ≤ 2max(n 1 ,n 2 ), at least one ofn 1 andn 2 exceedsn/2. We assume thatn 2 ≥ n/2, and prove that∥ b Σ (2) − Σ (2) ∥ ∞ satisfies the bound in the claim. The case n 1 ≥ n/2 is similar. Since we are proving the casen 2 ≥ n/2, for notational convenience, we assume probabilities and expectations in the rest of the proof are conditional on the first batch (y 1 ,x 1 ),...(y n 1 ,x n 1 ), and omit this in the notation. For a fixed pair (a,b)∈[p]× [p]: b Σ (2) a,b − Σ (2) a,b = 1 n 2 X i>n 1 x i,a x i,b − E{x i,a x i,b } Using Lemma A.9.4 we have that∥x i,a x i,b ∥ ψ 1 ≤ 2∥x i ∥ 2 ψ 2 ≤ 2κ 2 almost surely. Then using the tail in- equality Lemma A.9.5 we have for anyε≤ 2eκ 2 P n | b Σ (2) a,b − Σ (2) a,b |≥ ε o ≤ 2exp n − n 2 ε 2 6eκ 4 o Withε=ε(p,n 2 ,κ )=12κ 2 p logp/n 2 ≤ 20κ 2 p logp/n we have thatP{| b Σ (2) a,b − Σ (2) a,b |≥ ε(p,n 2 ,κ )}≤ p − 8 , whence the claim follows by union bound over pairs(a,b). Lemma A.8.3 ([26, Corollary 6.8]). Suppose that Σ is (ϕ 0 ,S)-compatible. Then any matrix Σ ′ such that ∥Σ ′ − Σ ∥ ∞ ≤ ϕ 0 /(32|S|) is(ϕ 0 /2,S)-compatible. We can now prove Proposition A.8.1. 251 Proof of Proposition A.8.1. Combining Lemmas A.8.2 and A.8.3 yields that, with probability1− p − 4 , at least one of b Σ (1) and b Σ (2) are(ϕ 0 /2,supp(θ 0 ))-compatible provided 12κ 2 r logp n ≤ ϕ 0 32s 0 , which is implied byn≥ 400κ 2 s 0 ϕ 0 p logp 2 . Since b Σ=( n 1 /n) b Σ (1) +(n 2 /n) b Σ (2) and at least one ofn 1 /n andn 2 /n exceed1/2, this implies that b Σ is (ϕ 0 /4,supp(θ 0 ))-compatible with probability exceeding1− p − 4 . The following lemma shows thatX T ε is small entrywise. LemmaA.8.4. For anyλ n ≥ 40κσ p (logp)/n, with probability at least1− p − 4 ,∥X T ε∥ ∞ ≤ nλ n /2. Proof. The a θ coordinate of the vector X T ε is P i x ia ε i . As the rows of X are uniformly κ -subgaussian and∥ε i ∥ ψ 2 = σ , Lemma A.9.4 implies that the sequence (x ia ε i ) 1≤ i≤ n is uniformly 2κσ -subexponential. Applying the Bernstein-type martingale tail bound Lemma A.9.6, forε≤ 12eκσ : P n X i x ia ε i ≥ εn o ≤ 2exp n − nε 2 24eκ 2 σ 2 o 252 Set ε = ε(p,n,κ,σ ) = 20κσ p (logp)/n, the exponent on the right hand side above is at least 5logp, which implies after union bound overa that P{∥X T ε∥ ∞ ≥ εn}=P n max a X i x ia ε i ≥ εn o ≤ X a P n X i x ia ε i ≥ εn o ≤ 2p − 6 . This implies the claim forp large enough. The rest of the proof is standard, cf. [73] and is given below for the reader’s convenience. Proof of Theorem 2.2.2. Throughout we condition on the intersection of good events in Proposition A.8.1 and Lemma A.8.4, which happens with probability at least 1− 2p − 4 . On this good event, the sample covariance b Σ is(ϕ 0 /4,supp(θ 0 ))-compatible and∥X T ε∥ ∞ ≤ 20κσ √ nlogp≤ nλ n /2. By optimality of b θ L : 1 2 ∥y− X b θ L ∥ 2 +λ n ∥ b θ L ∥ 1 ≤ 1 2 ∥y− Xθ 0 ∥ 2 +λ n ∥θ 0 ∥ 1 . Usingy =Xθ 0 +ε, the shorthandν = b θ L − θ 0 and expanding the squares leads to 1 2 ⟨ν, b Σ ν ⟩≤ 1 n ⟨X T ε,ν ⟩+λ n (∥θ 0 ∥ 1 −∥ b θ L ∥ 1 ) ≤ 1 n ∥ν ∥ 1 ∥X T ε∥ ∞ +λ n (∥θ 0 ∥ 1 −∥ b θ L ∥ 1 ) ≤ λ n n 1 2 ∥ν ∥ 1 +∥θ 0 ∥ 1 −∥ b θ L ∥ 1 o . (A.60) 253 First we show that the error vector ν satisfies ∥ν S c 0 ∥ 1 ≤ 3∥ν S 0 ∥ 1 , where S 0 ≡ supp(θ 0 ). Note that ∥ b θ L ∥ 1 =∥θ 0 +ν ∥ 1 =∥θ 0 +ν S 0 ∥ 1 +∥ν S c 0 ∥ 1 . By triangle inequality, therefore: ∥θ 0 ∥ 1 −∥ b θ L ∥ 1 =∥θ 0 ∥ 1 −∥ θ 0 +ν S 0 ∥ 1 −∥ ν S c 0 ∥ 1 ≤∥ ν S 0 ∥ 1 −∥ ν S c 0 ∥ 1 . Combining this with the basic lasso inequality Eq.(A.60) we obtain 1 2 ⟨ν, b Σ ν ⟩≤ λ n n 1 2 ∥ν ∥ 1 +∥ν S 0 ∥ 1 −∥ ν S c 0 ∥ 1 o = λ n 2 n 3∥ν S 0 ∥ 1 −∥ ν S c 0 ∥. o As b Σ is positive-semidefinite, the LHS above is non-negative, which implies ∥ν S c 0 ∥ 1 ≤ 3∥ν S 0 ∥ 1 . Now, we can use the fact that b Σ is(ϕ 0 /4,S 0 )-compatible to lower bound the LHS by∥ν ∥ 2 1 ϕ 0 /2s 0 . This leads to ϕ 0 ∥ν ∥ 2 1 2s 0 ≤ 3λ n ∥ν S 0 ∥ 1 2 ≤ 3λ n ∥ν ∥ 1 2 . Simplifying this results in∥ν ∥ 1 =∥ b θ L − θ 0 ∥ 1 ≤ 3s 0 λ n /ϕ 0 as required. 254 A.8.2 Biascontrol: ProofofTheorem2.2.7 Recall the decomposition (2.21) from which we obtain: ∆ n =B n ( b θ L − θ 0 ), B n = √ n I p − n 1 n M (1) b Σ (1) − n 2 n M (2) b Σ (2) , W n = 1 √ n X i≤ n 1 M (1) x i ε i + 1 √ n X n 1 n 1 I(a=b)−⟨ e a ,Ω (2) x i ⟩x ib . Now,E{⟨e a ,Ω (2) x i ⟩x i,b ⟩} = I(a = b) and⟨e a ,Ω (2) x i ⟩ is (∥Ω (2) ∥ 2 κ )-subgaussian. Since Σ (2) ≽ Λ 0 I p , we have that∥Ω (2) ∥ 2 ≤ Λ 0 − 1 . This observation, coupled with Lemma A.9.4, yields⟨e a ,Ω (2) x i ⟩x i,b is 2κ 2 /Λ 0 -subexponential. Then we may apply Lemma A.9.5 forε≤ 12κ 2 /Λ 0 as below: P{(I p − Ω (2) b Σ (2) ) a,b ≥ ε}≤ exp − n 2 ε 2 36κ 2 Λ 0 − 1 . Keepingε=ε(p,n 2 ,κ, Λ 0 )=15κ Λ 0 − 1/2 p (logp)/n 2 we obtain: P n (I p − Ω (2) b Σ (2) ) a,b ≥ 15κ Λ 0 − 1/2 r logp n 2 o ≤ p − 6 . Union bounding over the pairs(a,b) yields the claim. The requirementn 2 ≥ 2(Λ 0 /κ 2 )logp ensures that the choiceε above satisfies ε≤ 12κ 2 /Λ 0 . 258 A.8.3 Centrallimitasymptotics: proofsofProposition2.2.9andTheorem2.2.10 Our approach is to apply a martingale central limit theorem to show thatW n,a is approximately normal. An important first step is to show that the conditional covariance V n,a is stable, or approximately constant. Recall thatV n,a is defined as V n,a =σ 2 n 1 n ⟨m (1) a , b Σ (1) m (1) a ⟩+ n 2 n ⟨m (2) a , b Σ (2) m (2) a ⟩ . We define its deterministic equivalent as follows. Consider the function f :S n →R by: f(Σ)= {min⟨m,Σ m⟩:∥Σ m− e a ∥ ∞ ≤ µ, ∥m∥ 1 ≤ L}. We begin with two lemmas about the stability of the optimization program used to obtain the online debiasing matrices. LemmaA.8.6. On its domain (and uniformly inµ,e a ),f isL 2 -Lipschitz with respect to the∥·∥ ∞ norm. Proof. For two matricesΣ ,Σ ′ in the domain, letm,m ′ be the respective optimizers (which exist by com- pactness of the set{m:∥Σ m− v∥ ∞ ≤ µ, ∥m∥ 1 ≤ L}. We prove that|f(Σ) − f(Σ ′ )|≤ L 2 ∥Σ − Σ ′ ∥ ∞ . f(Σ) − f(Σ ′ )=⟨Σ ,mm T ⟩−⟨ Σ ′ ,m ′ (m ′ ) T ⟩ ≤⟨ Σ ,m ′ (m ′ ) T ⟩−⟨ Σ ′ ,m ′ (m ′ ) T ⟩ =⟨(Σ − Σ ′ )m ′ ,m ′ ⟩ ≤∥ (Σ − Σ ′ )m ′ ∥ ∞ ∥m ′ ∥ 1 ≤∥ Σ − Σ ′ ∥ ∞ ∥m ′ ∥ 2 1 ≤ L 2 ∥Σ − Σ ′ ∥ ∞ . 259 Here the first inequality follows from optimality of m and the last two inequalities are Hölder inequality. The reverse inequalityf(Σ) − f(Σ ′ )≥− L 2 ∥Σ − Σ ′ ∥ ∞ is proved in the same way. LemmaA.8.7. We have the following lower bound on the optimization value reached to computef(Σ) : (1− µ ) 2 λ max (Σ) ≤ f(Σ) ≤ 1 λ min (Σ) . Proof. We first prove the lower bound for f(Σ) . Supposem is an optimizer for the program. Then ∥Σ m∥ 2 ≥∥ Σ m∥ ∞ ≥∥ e a ∥ ∞ − µ =1− µ. On the other hand, the value is given by ⟨m,Σ m⟩=⟨Σ m,Σ − 1 (Σ m)⟩≥ λ min (Σ − 1 )∥Σ m∥ 2 2 =∥Σ m∥ 2 2 λ max (Σ) − 1 . Combining these gives the lower bound. For the upper bound, it suffices to consider any feasible point; we choose m=Σ − 1 e a , which is feasible since∥Σ − 1 ∥ 1 ≤ L. The value is then⟨e a ,Σ − 1 e a ⟩≤ λ max (Σ − 1 ) which gives the upper bound. Lemma A.8.8. (Stability ofW n,a ) Define Σ (2) (θ ) =E{xx T |⟨x 1 ,θ ⟩ ≥ ς}. Then, under Assumptions 2.2.5 and 2.2.8 lim n→∞ V n,a − σ 2 n 1 f(Σ) n + n 2 f(Σ 2 (θ 0 )) n =0, in probability. 260 Proof. Using Lemma A.8.6: V n,a − σ 2 n 1 n f(Σ)+ n 2 n f(Σ( θ 0 ) = σ 2 n 1 n (f( b Σ (1) )− f(Σ))+ σ 2 n 2 n (f( b Σ (2) − f(Σ( θ 0 )))) ≤ L 2 σ 2 n 1 n ∥Σ − b Σ (1) ∥ ∞ +L 2 σ 2 n 2 n ∥Σ (2) (θ 0 )− b Σ (2) ∥ ∞ ≤ L 2 σ 2 n 1 n ∥Σ − b Σ (1) ∥ ∞ +L 2 σ 2 n 2 n ∥Σ (2) (θ 0 )− Σ (2) ( b θ 1 )∥ ∞ +∥Σ (2) ( b θ 1 )− b Σ (2) ∥ ∞ ≤ σ 2 L 2 ∥Σ − b Σ (1) ∥ ∞ +σ 2 L 2 K∥ b θ 1 − θ 0 ∥ 1 +∥Σ (2) ( b θ 1 )− b Σ (2) ∥ ∞ . Using Lemma A.8.2 the first and third term vanish in probability. It is straightforward to apply Theorem 2.2.2 to the intermediate estimate b θ 1 ; indeed Assumption 2.2.8 guarantees that n 1 ≥ cn for a universal c. Therefore the intermediate estimate has an error∥ b θ 1 − θ 0 ∥ 1 of orderκσϕ − 1 0 p (s 2 0 logp)/n with prob- ability converging to one. In particular, the second term is, with probability converging to one, of order KL 2 σ 3 κϕ − 1 0 p s 2 0 (logp)/n=o(1) by Assumption 2.2.8. LemmaA.8.9. Under Assumptions 2.2.5 and 2.2.8, with probability at least1− p − 2 max i |⟨m a ,x i ⟩|≤ 10Lκ p logp, In particularlim n→∞ max i |⟨m a ,x i ⟩|=0 in probability. 261 Proof. By Hölder inequality, max i ⟨|⟨m a ,x i ⟩| ≤ max i ∥m a ∥ 1 ∥x i ∥ ∞ ≤ Lmax i ∥x i ∥ ∞ . Therefore, it suf- fices to prove that, with the required probability max i,a |x i,a |≤ 10κ √ logp. Letu = 10κ √ logp. Sincex i are uniformlyκ -subgaussian, we obtain forq >0: P{|x i,a |≥ u}≤ u − q E{|x i,a | q }≤ ( √ qκ/u ) q =exp − q 2 log u 2 κ 2 q ≤ exp − u 2 2κ 2 ≤ p − 5 , where the last line follows by choosingq =u 2 /eκ 2 . By union bound overi∈[n],a∈[p], we obtain: P{max i,a |x i,a |≥ u}≤ X i,a P{|x i,a |≥ u}≤ p − 3 , which implies the claim (note thatp≥ n as we are focusing on the high-dimensional regime). With these in hand we can prove Proposition 2.2.9 and Theorem 2.2.10. Proof of Proposition 2.2.9. Consider the minimal filtration F i so that 1. Fori<n 1 ,y 1 ,...,y i ,x 1 ,...x n 1 andε 1 ,...,ε i are measurable with respect toF i . 2. Fori≥ n 1 y 1 ,...,y i ,x 1 ,...,x n andε 1 ,...ε i are measurable with respect toF i . The martingale W n (and therefore, its a θ coordinate W n,a ) is adapted to the filtration F i . We can now apply the martingale central limit theorem [71, Corollary 3.1] toW n,a to obtain the result. From Lemmas A.8.7 and A.8.8 we know thatV n,a is bounded away from0, asymptotically. The stability and conditional Lindeberg conditions of [71, Corollary 3.1] are verified by Lemmas A.8.8 and A.8.9. 262 Proof of Theorem 2.2.10. This is a straightforward corollary of the bias bound of 2.2.7 and Proposition 2.2.9. We will show that: lim n→∞ P n r n V n,a ( b θ on a − θ 0,a )≤ x o ≤ Φ( x). The reverse inequality follows using the same argument. Fix aδ > 0. We decompose the difference above as: r n V n,a ( b θ on a − θ 0,a )= W n,a p V n,a + ∆ n,a p V n,a . Therefore, P n r n V n,a ( b θ on a − θ 0,a )≤ x o ≤ P n W n,a p V n,a ≤ x+δ o +P{|∆ n,a |≥ p V n,a δ }. By Proposition 2.2.9 the first term converges to Φ( x+δ ). To see that the second term vanishes, observe first that Lemma A.8.7 and Lemma A.8.8, imply that V n,a is bounded away from 0 in probability. Using this: lim n→∞ P{|∆ n,a |≥ p V n,a δ }≤ lim n→∞ P{∥∆ n ∥ ∞ ≥ p V n,a δ } ≤ lim n→∞ P n ∥∆ n ∥ ∞ ≥ 4000 κ 2 σ √ Λ 0 ϕ 0 s 0 logp √ n o =0 by applying Theorem 2.2.7 and that forn large enough, p V n,a δ exceeds the bound on∥∆ n ∥ ∞ used. Since δ is arbitrary, the claim follows. 263 A.8.4 ProofsforGaussiandesigns In this Section we prove that Gaussian designs of Example 2.2.6 satisfy the requirements of Theorem 2.2.2 and Theorem 2.2.7. The following distributional identity will be important. LemmaA.8.10. Consider the parametrizationς = ¯ς⟨ b θ, Σ b θ ⟩ 1/2 . Then x| ⟨x, b θ ⟩≥ ς d = Σ b θ ⟨ b θ, Σ b θ ⟩ 1/2 ξ 1 + Σ − Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ 1/2 ξ 2 , whereξ 1 ,ξ 2 are independent,ξ 2 ∼ N(0,I p ) andξ 1 has the density: dP ξ 1 du (u)= 1 √ 2π Φ( − ¯ς) exp(− u 2 /2)I(u≥ ¯ς). Proof. This follows from the distribution ofx|⟨x, b θ ⟩ beingN(µ ′ ,Σ ′ ) with µ ′ = Σ b θ ⟨ b θ, Σ b θ ⟩ ⟨x, b θ ⟩, Σ ′ =Σ − Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ . The following lemma shows that they satisfy compatibility. LemmaA.8.11. LetP x = N(0,Σ) forapositivedefinitecovariance Σ . Then,foranyvector b θ andsubsetS⊆ [p], the second momentsE{xx T } andE{xx T |⟨x, b θ ⟩≥ ς} are(ϕ 0 ,S)-compatible withϕ 0 =λ min (Σ) /16. 264 Proof. Fix anS ⊆ [p]. We prove thatΣ = E{x 1 x T 1 } is(ϕ 0 ,S)-compatible withϕ 0 = λ min (Σ) /16. Note that, for anyv satisfying∥v S c∥ 1 ≤ 3∥v S ∥, itsℓ 1 norm satisfies ∥v∥ 1 ≤ 4∥v S ∥ 1 . FurtherΣ ≽ λ min (Σ) I p implies: |S|⟨v,Σ v⟩ ∥v∥ 2 1 ≥ λ min (Σ) |S|∥v∥ 2 ∥v∥ 2 1 ≥ λ min (Σ) |S|∥v S ∥ 2 16∥v S ∥ 2 1 ≥ λ min (Σ) 16 . ForE{xx T |⟨x, b θ ⟩≥ ς}, we use Lemma A.8.10 to obtain E{xx T |⟨x, b θ ⟩≥ ς}=Σ+( E{ξ 2 1 }− 1) Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ , whereξ 1 is as in Lemma A.8.10. SinceE{ξ 2 1 }=1+¯ςφ(¯ς)/Φ( − ¯ς)≥ 1+¯ς 2 whenever ¯ς ≥ 0: E{xx T |⟨x, b θ ⟩≥ ς}≥ Σ+¯ ς 2 Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ ≽λ min (Σ) I p . The rest of the proof is as forΣ . Lemma A.8.12. LetP x = N(0,Σ) for a positive definite covariance Σ . Then, for any vector b θ and subset S ⊆ [p], the random vectors x and x| ⟨x, b θ ⟩≥ ς are κ -subgaussian with κ = 3λ max (Σ) 1/2 (¯ς ∨ ¯ς − 1 ), where ¯ς =ς/⟨ b θ, Σ b θ ⟩ 1/2 . Proof. By definition, ⟨x,v⟩∼ N(0,v T Σ v) is √ v T Σ v-subGaussian. Optimizing over all unit vectorsv,x is λ 1/2 max (Σ) -subgaussian. 265 Forx| ⟨x, b θ ⟩≥ ς , we use the decomposition of Lemma A.8.10: x| ⟨x, b θ ⟩≥ ς d = Σ b θ ⟨ b θ, Σ b θ ⟩ 1/2 ξ 1 + Σ − Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ 1/2 ξ 2 . Clearly,ξ 2 is 1-subgaussian, which means the second term isλ 1/2 max (Σ) -subgaussian. For the first term, we claim thatξ 1 is1-subgaussian and therefore the first term is λ 1/2 max (Σ) -subgaussian. To show this, we start with the moment generating function ofξ 1 . Recall that ¯ς =ς/⟨ b θ, Σ b θ ⟩ 1/2 : E{e λξ 1 }= Z ∞ ¯ς e λu e − u 2 /2 du √ 2π Φ( − ¯ς) =e λ 2 /2 Φ( λ − ¯ς) Φ( − ¯ς) . Hereφ andΦ are the density and c.d.f. of the standard normal distribution. It follows that: d 2 dλ 2 logE{e λξ 1 }= 1 2 + (λ − ¯ς)φ(λ − ¯ς) Φ( λ − ¯ς) − φ(λ − ¯ς) 2 Φ( λ − ¯ς) 2 ≤ 1 2 +sup λ ≥ ¯ς (λ − ¯ς)φ(λ − ¯ς) Φ( λ − ¯ς) ≤ 1 2 +sup λ ≥ 0 λφ (λ ) Φ( λ ) <1. Now, consider the centered versionξ ′ 1 =ξ 1 − E{ξ 1 }. The above bound also holds ford 2 /dλ 2 (logE{e λξ ′ 1 }). Therefore, by integration,dlogE{e λξ ′ 1 }/dλ ≤ λ +C, for some constantC independent ofλ . Now dlogE{e λξ ′ 1 } dλ λ =0 =E{ξ ′ 1 }=0. Therefore, we can take the constantC to be 0. Repeating this integration argument, we obtainlogE{e λξ ′ 1 }≤ λ 2 /2, which implies thatξ ′ 1 =ξ 1 − E{ξ 1 } is 1-subgaussian. 266 It follows, by triangle inequality, thatξ 1 is(1+E{ξ 1 })-subgaussian. It only remains to boundE{ξ 1 } as below: E{ξ 1 }= φ(¯ς) Φ( − ¯ς) ≤ 1+¯ς 2 ¯ς ≤ 2(¯ς∨¯ς − 1 ). Therefore, the subgaussian constant ofx| ⟨x, b θ ⟩≥ ς is at mostλ max (Σ) 1/2 (2¯ς∨¯ς − 1 +1)≤ 3λ max (Σ) 1/2 (¯ς∨ ¯ς − 1 ). For Example 2.2.6, it remains only to show the constraint on the approximate sparsity of the inverse covariance. We show this in the following Lemma A.8.13. LetP x = N(0,Σ) and b θ be any vector such that ∥ b θ ∥ 1 ∥ b θ ∥ ∞ ≤ Lλ min (Σ) ∥ b θ ∥ 2 /2 and ∥Σ − 1 ∥ 1 ≤ L/2. Then, withΩ= E{xx T } − 1 andΩ (2) ( b θ )=E{xx T |⟨x, b θ ⟩≥ ς} − 1 : ∥Ω ∥ 1 ∨∥Ω (2) ∥ 1 ≤ L. Proof. By assumption∥Ω ∥ 1 ≤ L/2, so we only require to prove the claim forΩ (2) =E{xx T |⟨x, b θ ⟩≥ ς} − 1 . Using Lemma A.8.10, we can compute the precision matrix: Ω (2) =E{xx T |⟨x, b θ ⟩≥ ς} − 1 = Σ+( E{ξ 2 1 }− 1) Σ b θ b θ T Σ ⟨ b θ, Σ b θ ⟩ − 1 =Ω+( E{ξ 2 1 } − 1 − 1) b θ b θ T ⟨ b θ, Σ b θ ⟩ , 267 where the last step follows by an application of Sherman–Morrison formula. SinceE{ξ 2 1 }=1+¯ςφ(¯ς)/Φ( − ¯ς), where ¯ς =ς/⟨ b θ, Σ b θ ⟩ 1/2 this yields: Ω (2) =Ω − ¯ςφ(¯ς) Φ( − ¯ς)+¯ςφ(¯ς) b θ b θ T ⟨ b θ, Σ b θ ⟩ . By triangle inequality, for any ¯ς ≥ 0: ∥Ω (2) ∥ 1 ≤∥ Ω ∥ 1 + ∥ b θ b θ T ∥ 1 ⟨ b θ, Σ b θ ⟩ ≤ L 2 + ∥ b θ ∥ 1 ∥ b θ ∥ ∞ λ min (Σ) ∥ b θ ∥ 2 ≤ L. Next we show that the conditional covariance ofx is appropriately Lipschitz. Lemma A.8.14. Suppose ς = ¯ς⟨θ, Σ θ ⟩ 1/2 for a constant ¯ς ≥ 0. Then The conditional covariance function Σ (2) (θ )=E{xx T |⟨x,θ ⟩≥ ς} satisfies: ∥Σ (2) (θ ′ )− Σ (2) (θ )∥ ∞ ≤ K∥θ ′ − θ ∥, whereK = √ 8(1+¯ς 2 )λ max (Σ) 3/2 /λ min (Σ) 1/2 . Proof. Using Lemma A.8.10, Σ (2) (θ )=Σ+( E{ξ 2 1 }− 1) Σ θθ T Σ ⟨θ, Σ θ ⟩ . 268 Letv =Σ 1/2 θ/ ∥Σ 1/2 θ ∥ andv ′ =Σ 1/2 θ ′ /∥Σ 1/2 θ ′ ∥. With this, ∥Σ (2) (θ ′ )− Σ (2) (θ )∥ ∞ =(E{ξ 2 1 }− 1)∥Σ 1/2 (vv T − v ′ v ′T )Σ 1/2 ∥ ∞ ≤ (E{ξ 2 1 }− 1)λ max (Σ) ∥vv T − v ′ v ′T ∥ 2 ≤ (E{ξ 2 1 }− 1)λ max (Σ) ∥vv T − v ′ v ′T ∥ F (a) ≤ √ 2(E{ξ 2 1 }− 1)λ max (Σ) ∥v− v ′ ∥ (b) ≤ √ 8λ max (Σ) 3/2 λ min (Σ) 1/2 (E{ξ 2 1 }− 1)∥θ − θ ′ ∥ (c) ≤ √ 8λ max (Σ) 3/2 λ min (Σ) 1/2 (¯ς 2 +1)∥θ − θ ′ ∥. Here,(a) follows by noting that for two unit vectorsv,v ′ , we have ∥vv T − v ′ v ′T ∥ 2 F =2− 2(v T v ′ ) 2 =2(1− v T v ′ )(1+v T v ′ )≤ 2∥v− v ′ ∥ 2 . Also,(b) holds using the following chain of triangle inequalities ∥v− v ′ ∥= Σ 1/2 θ ∥Σ 1/2 θ ∥ − Σ 1/2 θ ′ ∥Σ 1/2 θ ′ ∥ ≤ ∥Σ 1/2 (θ − θ ′ )∥ ∥Σ 1/2 θ ∥ +∥Σ 1/2 θ ′ ∥ 1 ∥Σ 1/2 θ ∥ − 1 ∥Σ 1/2 θ ′ ∥ ≤ 2 ∥Σ 1/2 (θ − θ ′ )∥ ∥Σ 1/2 θ ∥ ≤ 2 s λ max (Σ) λ min (Σ) ∥θ − θ ′ ∥ Finally(c) holds since E{ξ 1 1 }− 1= ¯ςφ(¯ς)/Φ( − ¯ς)≤ ¯ς 2 +1, using standard tail boundφ(¯ς) ¯ς ¯ς 2 +1 ≤ Φ( − ¯ς). 269 A.9 Technicalpreliminaries DefinitionA.9.1. (Subgaussian norm) The subgaussian norm of a random variableX, denoted by∥X∥ ψ 2 , is defined as ∥X∥ ψ 2 ≡ sup q≥ 1 q − 1/2 E{|X| q } 1/q . For a random vectorX the subgaussian norm is defined as ∥X∥ ψ 2 ≡ sup ∥v∥=1 ∥⟨X,v⟩∥ ψ 2 . DefinitionA.9.2. (Subexponential norm) The subexponential norm of a random variableX is defined as ∥X∥ ψ 1 ≡ sup q≥ 1 q − 1 E{|X| q } 1/q . For a random vectorX the subexponential norm is defined by ∥X∥ ψ 1 ≡ sup ∥v∥=1 ∥⟨X,v⟩∥ ψ 1 . DefinitionA.9.3. (Uniformlysubgaussian/subexponentialsequences)Wesayasequenceofrandomvariables {X i } i≥ 1 adapted to a filtration {F i } i≥ 0 is uniformlyK-subgaussian if, almost surely: sup i≥ 1 sup q≥ 1 q − 1/2 E{|X i | q |F i− 1 } 1/q ≤ K. 270 A sequence of random vectors{X i } i≥ 1 is uniformlyK-subgaussian if, almost surely, sup i≥ 1 sup ∥v∥=1 sup q≥ 1 E{|⟨X i ,v⟩| q |F i− 1 } 1/q ≤ K. Subexponential sequences are defined analogously, replacing the factor q − 1/2 withq − 1 above. LemmaA.9.4. For a pair of random variablesX,Y,∥XY∥ ψ 1 ≤ 2∥X∥ ψ 2 ∥Y∥ ψ 2 . Proof. By Cauchy Schwarz: ∥XY∥ ψ 1 =sup q≥ 1 q − 1 E{|XY| q } 1/q ≤ sup q≥ 1 q − 1 E{|X| 2q } 1/2q E{|Y| 2q } 1/2q ≤ 2 sup q≥ 2 (2q) − 1/2 E{|X| 2q } 1/2q · sup q≥ 2 (2q) − 1/2 E{|Y| 2q } 1/2q ≤ 2∥X∥ ψ 2 ∥Y∥ ψ 2 . The following lemma from [186] is a Bernstein-type tail inequality for sub-exponential random vari- ables. LemmaA.9.5 ([186, Proposition 5.16]). LetX 1 ,X 2 ,...,X n beasequenceofindependentrandomvariables withmax i ∥X i ∥ ψ 1 ≤ K. Then for anyε≥ 0: P n 1 n n X i=1 X i − E{X i } ≥ ε o ≤ 2exp n − nε 6eK min ε eK ,1 o (A.62) We also use a martingale generalization of [186, Proposition 5.16], whose proof is we omit. 271 Lemma A.9.6. Suppose(F i ) i≥ 0 is a filtration, X 1 ,X 2 ,...,X n is a uniformlyK-subexponential sequence of random variables adapted to(F i ) i≥ 0 such that almost surelyE{X i |F i− 1 }=0. Then for anyε≥ 0: P n 1 n n X i=1 X i ≥ ε o ≤ 2exp n − nε 6eK min ε eK ,1 o (A.63) The following is a rough bound on the LASSO error. LemmaA.9.7 (Rough bound on LASSO error). ForLASSOestimate b θ L withregularizationλ n thefollowing bound holds: ∥ b θ L − θ 0 ∥ 1 ≤ ∥ε∥ 2 2nλ n +2∥θ 0 ∥ 1 . Proof of Lemma A.9.7. We first bound the size of b θ L . By optimality of b θ L : λ n ∥ b θ L ∥ 1 ≤ 1 2n ∥ε∥ 2 2 +λ n ∥θ 0 ∥ 1 − 1 2n ∥y− X b θ L ∥ 2 2 ≤ 1 2n ∥ε∥ 2 2 +λ n ∥θ 0 ∥ 1 . We now use triangle inequality and the bound above to get the claim: ∥ b θ L − θ 0 ∥ 1 ≤∥ b θ L ∥ 1 +∥θ 0 ∥ 1 ≤ 1 2nλ n ∥ε∥ 2 +2∥θ 0 ∥ 1 . 272 AppendixB SupplementtoChapter3 B.1 Proofoftheoremsandtechnicallemmas B.1.1 Technicalpreliminaries LemmaB.1.1 ( Pearson’sχ 2 − test size and power). ConsideramultinomialmodelwithLlabels{1,2,...,L}, andnnumberofsamples. Fors∈[L],letW s denotethenumberofsampleswithlabelsandp s betheoccur- rence probability of labels in one realization of the multinomial model . Consider the following uniformity hypothesis, at the significance level α ∈(0,1): H 0 :p s = 1 L , for1≤ s≤ L, (B.1) with the following decision ruleΨ n,L , which is based on the Pearson’s Chi-squared statisticU n,L : Ψ n,L =I U n,L := L n L X s=1 W s − n L 2 ≥ L+ r 2L α ! . The following statements hold: 273 1. Under the null hypothesis (B.1),U n,L d ⇒χ 2 L− 1 , asn→∞. 2. Under the null hypothesis (B.1), the size of this test is controlled at levelα : P(Ψ n,L =1)≤ α. 3. If for someβ > 0, we have the following: L X s=1 p s − 1 L ≥ 32L 1/4 √ n 1 √ α ∨ 1 β 1/2 , then the type II error does not exceedβ : P(Ψ n,L =0)≤ β. Regarding the proof of Lemma B.1.1, note that the first part is a classic result on the asymptotic null distribution of the Pearson’s Chi-squared test (See e.g. [107], Theorem 14.3.1.) For the proof of parts 2 and 3, we refer to [5]. More specifically, [5] proves similar claims for the ‘truncated’ χ 2 -test statistic and for more general hypotheses regarding the nominal probabilities of the labels under multinomial models. For the special case of the uniformity testing problem (B.1), the truncated Chi-squared statistic reduces to the classic Pearson’s Chi-squared test statistic. The next lemma is the Berry-Esseen theorem for non-identical independent random variables and its statement is borrowed from [9, Section 5]. 274 LemmaB.1.2. ([9,Section5])Forzero-meanindependentrandomvariablesξ 1 ,...,ξ n with n P i=1 E[ξ 2 i ]=1,let W = n P i=1 ξ i . If n P i=1 E[|ξ 3 i |]≤ γ , then we have sup −∞≤ z≤∞ |P(W ≤ z)− Φ( z)|≤ γ. B.1.2 ProofofTheorem3.6.1 We start by establishing a concentration bound on the normalized rank given by (1.11). Proposition B.1.3. Consider an even functiong, and a dataset(X,Y) ofn i.i.d pairs{(X i ,Y i )} n i=1 , with X i ,Y i ∈R, generated from the following regression model: X ∼ N(0,1), Y =g(X)+ε, ε∼ N(0,1). (B.2) For the marginal covariance score functionT(X,Y) = n − 1 X T Y, and counterfeit datasets e X j ∼ N(0,I n ), recall the CRTp statistic: p (M) n = 1+ M P j=1 I{T(X,Y)≥ T( e X j ,Y)} M +1 . (B.3) Then, the statisticp (M) n concentrates around1/2. In particular, for anyδ > 0 andM >1/δ , we have lim n→∞ P p (M) n − 1/2 ≥ δ ≤ 1 (δ − 1/M) 2 1 4M + M− 1 M E Z∼ N(0,1) [Φ 2 (ηZ )]− 1 4 , withη = 1+E[X 2 g(X) 2 ] 1+E[g(X) 2 ] 1/2 . 275 The proof of this proposition is given in Section B.1.3. We next show that the deviation of thep-statistic from 1/2 can be controlled by the choice of η . Note that for the normal distribution function Φ and the normal densityϕ we have 0≤ Φ( ηz )=Φ(0)+ Z ηz 0 ϕ (t)dt≤ 1 2 + η |z| √ 2π . Consequently forZ∼ N(0,1), E Φ 2 (ηZ ) ≤ E h 1 2 + η |Z| √ 2π 2 i = 1 4 + η 2 2π + η π . Therefore, E Φ 2 (ηZ ) − 1 4 ≤ η 2 +2η 2π , which along with the result of Proposition B.1.3 implies lim M→∞ lim n→∞ P p (M) n − 1 2 ≥ δ ≤ η 2 +2η 2πδ 2 . The proof of part (a) for two-sided CRT, follows by settingδ =(1− α )/2 and using thatδ > 1/4. Proof of part (b) follows along the same lines. The only modification is that time we set δ =1/2− α ≥ γ , which brings us to lim M→∞ lim n→∞ P p (M) n ≥ 1− α ≤ α 2 , lim M→∞ lim n→∞ P p (M) n ≤ α ≤ α 2 . 276 This completes the proof of part (b) for one-sided CRT. B.1.3 ProofofPropositionB.1.3 Consider M counterfeits e X 1 , e X 2 ,..., e X M sampled independently from N(0,I n ). For j ∈ [M], let I j = I{T(X,Y)≥ T( e X j ,Y)}. LetT(X,Y)=n − 1 X T Y and µ n =E h I{T(X,Y)≥ T( e X 1 ,Y)} i , σ 2 n =Var h I{T(X,Y)≥ T( e X 1 ,Y)} i . It is easy to see thatσ 2 n =µ n (1− µ n ). Before proceeding further we establish a lemma which will be used in proving the result. LemmaB.1.4. The followings hold: lim n→∞ µ n = lim n→∞ P(T( e X,Y)≤ T(X,Y))=1/2, lim n→∞ E P {T(X,Y)≥ T( e X,Y)}|X,Y 2 =E Z∼ N(0,1) [Φ 2 (ηZ )], whereη = 3+2E[X 2 g(X) 2 ] 3+2E[g(X) 2 ] . 277 By applying Chebyshev’s inequality we get P M X j=1 I j M − 1/2 ≥ δ ≤ P M X j=1 I j M − µ n ≥ δ −| µ n − 1/2| ≤ 1 (δ −| µ n − 1/2|) 2 · E h M X j=1 I j M − µ n 2 i , = 1 (δ −| µ n − 1/2|) 2 · E h 1 M 2 M X j=1 (I j − µ n ) 2 + 1 M 2 X i̸=j (I i − µ n )(I j − µ n )) i = 1 (δ −| µ n − 1/2|) 2 · σ 2 n M + M− 1 M E[(I 1 − µ n )(I 2 − µ n )] = 1 (δ −| µ n − 1/2|) 2 · 1 M µ n (1− µ n )+ M− 1 M E[I 1 I 2 − µ 2 n ] . (B.4) We next computeE[I 1 I 2 − µ 2 n ]. E[I 1 I 2 ]=P {T(X,Y)≥ T( e X 1 ,Y)}∩{T(X,Y)≥ T( e X 2 ,Y)} =E h P {T(X,Y)≥ T( e X 1 ,Y)}∩{T(X,Y)≥ T( e X 2 ,Y)}|X,Y i =E h P {T(X,Y)≥ T( e X 1 ,Y)}|X,Y P {T(X,Y)≥ T( e X 2 ,Y)}|X,Y i =E P {T(X,Y)≥ T( e X,Y)}|X,Y 2 , where we used the fact that conditioned onX andY, score valuesT( e X 1 ,Y),T( e X 2 ,Y) are independent. Therefore, by using Lemma B.1.4, we write lim n→∞ E[(I 1 I 2 − µ 2 n )]= lim n→∞ E P {T(X,Y)≥ T( e X,Y)}|X,Y 2 − µ 2 n = lim n→∞ E P {T(X,Y)≥ T( e X,Y)}|X,Y 2 − 1/4 =E Z∼ N(0,1) [Φ 2 (ηZ )]− 1/4. (B.5) 278 To summarize, we letS M = M P j=1 I{T(X,Y)≥ T( e X j ,Y)} and use (B.5) in (B.4) along withlim n→∞ µ n = 1/2 per Lemma B.1.4 to obtain lim n→∞ P S M M − 1 2 ≥ δ ≤ 1 4Mδ 2 + M− 1 Mδ 2 · (E Z∼ N(0,1) [Φ 2 (ηZ )]− 1/4), ∀δ > 0. (B.6) Recalling thep statistic (B.3), we havep (M) n = 1+S M M+1 . AsS M ≤ M, we have S M M ≤ S M +1 M +1 ≤ S M M + 1 M , which implies that p (M) n − 1 2 ≤ S M M − 1 2 + 1 M . Using this relation along with triangle inequality in (B.6) to arrive at the following: P p (M) n − 1 2 ≥ δ ≤ P S M M − 1 2 ≥ δ − 1 M ≤ 1 (ε− 1/M) 2 1 4M + M− 1 M · (E Z∼ N(0,1) [Φ 2 (ηZ )]− 1/4) . This completes the proof of Proposition B.1.3. B.1.3.1 ProofofLemmaB.1.4 We start by establishing a lemma which characterizes the conditional probability that the original data score exceeds a counterfeit score. LemmaB.1.5. The following holds P(T( e X,Y)≤ T(X,Y)|X,Y)− Φ nT(X,Y) ∥Y∥ 2 ≤ C 1 ∥Y∥ 3 3 ∥Y∥ 3 2 . (B.7) 279 with whereC 1 is an absolute constant. We next show thatE ∥Y∥ 3 3 ∥Y∥ 3 2 →0 asn→∞. E ∥Y∥ 3 3 ∥Y∥ 3 2 = 1 √ n E " n − 1 P n i=1 |Y i | 3 n − 1 P n i=1 |Y i | 2 3/2 # . By recalling the strong law of large numbers, quantities n − 1 P n i=1 |Y i | 3 and n − 1 P n i=1 |Y i | 2 will almost surely converge toE[|g(x)+ε| 3 ], andE[|g(x)+ε| 2 ], respectively. This implies the almost sure convergence of ∥Y∥ 3 3 ∥Y∥ 3 2 to0 asn grows to infinity. In the next step, by using ||Y|| 3 /||Y|| 2 ≤ 1 along with the dominant convergence theorem, we arrive at lim n→∞ E ∥Y∥ 3 3 ∥Y∥ 3 2 =0. (B.8) In the next lemma, we characterize the distribution of the other quantity in (B.7). LemmaB.1.6. We have nT(X,Y) ∥Y∥ 2 d → N(0,η 2 ), asn→∞. withη = 1+E[X 2 g(X) 2 ] 1+E[g(X) 2 ] 1/2 . Using the result of Lemma B.1.6 and by an application of the Portmanteau theorem for the bounded continuous functionΦ we get lim n→∞ E Φ nT(X,Y) ∥Y∥ 2 =E Z∼ N(0,1) [Φ( ηZ )]. (B.9) 280 Combining (B.8) and (B.9) with (B.7) we arrive at lim n→∞ P(T( e X,Y)≤ T(X,Y))=E Z∼ N(0,1) [Φ( ηZ )] . In the next lemma we show thatE Z∼ N(0,1) [Φ( ηZ )]=1/2, which completes the proof of the first part. Lemma B.1.7. Let Φ( ·) denote the distribution of standard normal variable. Then, for any constant η we have E Z∼ N(0,1) [Φ( ηZ )]=1/2. The second part of the lemma follows by a similar argument. From Lemma B.1.5 we have P(T( e X,Y)≤ T(X,Y)|X,Y) 2 − Φ 2 nT(X,Y) ∥Y∥ 2 ≤ 2C 1 ∥Y∥ 3 3 ∥Y∥ 3 2 . Using (B.8) yields lim n→∞ E h P(T( e X,Y)≤ T(X,Y)|X,Y) 2 i = lim n→∞ E Φ 2 nT(X,Y) ∥Y∥ 2 (B.10) Also, by using Lemma B.1.6 and an application of the Portmanteau theorem for the bounded continuous functionΦ 2 we obtain lim n→∞ E Φ 2 nT(X,Y) ∥Y∥ 2 =E Z∼ N(0,1) [Φ 2 (ηZ )], which invoking (B.10) completes the proof of Lemma B.1.4 second part. 281 B.1.3.2 ProofofLemmaB.1.5 We focus on the distribution ofT( e X,Y)|X,Y and treatX,Y as deterministic values, so the only source of randomness is e X. To lighten the notation, we introduce ξ i = e X i Y i ∥Y∥ 2 , fori∈[n]. By simple algebraic computations, we get thatE[ξ i |X,Y] = 0 and n P i=1 E[ξ 2 i |X,Y] = 1. Also, condi- tioned onX,Y, random variables ξ i are independent. We next use the Berry-Essen theorem to charac- terize the distribution of P n i=1 ξ i . For the reader’s convenience, the version of the Berry-Esseen theorem for non-identical random variables is provided in Lemma B.1.2. First, we need to bound the sum of third moments: n X i=1 E[|ξ 3 i |]= n X i=1 E[| e X i | 3 ]|Y i | 3 ∥Y∥ 3 2 =C 1 ∥Y∥ 3 3 ∥Y∥ 3 2 , where the coefficients C 1 is a universal constant that can be precisely computed by using the third moment of the half-normal distribution. Note that here the expectation is with respect to e X i . Now, we employ the Berry–Esseen theorem B.1.2 to get: sup z P( n X i=1 ξ i ≤ z|X,Y)− Φ( z) ≤ C 1 ∥Y∥ 3 3 ∥Y∥ 3 2 . (B.11) 282 From the definition of ξ i and recalling the definition of score T( e X,Y) we have T( e X,Y)= 1 n e X T Y = 1 n ∥Y∥ 2 n X i=1 ξ i . Using the above relation and choosingz = nT(X,Y) ∥Y∥ 2 in (B.11) (note thatz is a measurable function of X,Y), we get P(T( e X,Y)≤ T(X,Y)|X,Y)− Φ nT(X,Y) ∥Y∥ 2 ≤ C 1 ∥Y∥ 3 3 ∥Y∥ 3 2 . B.1.3.3 ProofofLemmaB.1.6 Substituting forT(X,Y) we get nT(X,Y) ∥Y∥ 2 = 1 √ n P n i=1 X i Y i 1 n n P i=1 Y 2 i 1/2 (B.12) By an application of the central limit theorem, the numerator converges in distribution to a normal random variable. More precisely, E[XY]=E[X(g(X)+ε)] =E[Xg(X)]+E[Xε]=0, 283 where in the last relation we used the property thatg is an even function andX ∼ N(0,1). In addition, Var[(XY)]=1+E[X 2 g(X) 2 ] by simple calculation. Therefore, by CLT we have n − 1/2 X T Y d → N(0,1+E[X 2 g(X) 2 ]) (B.13) On the other hand, from the weak law of large numbers we have that the denominator in (B.12) con- verges in probability to1+E[g(X) 2 ]. The proof is completed by using the Slutsky’s theorem. B.1.3.4 ProofofLemmaB.1.7 Let Φ( ·) and ϕ (·) respectively denote the distribution and the density function of the standard normal random variable. We write ∂ ∂η E[Φ( ηZ )]=E[Zϕ (ηZ )] (a) = E[ηϕ ′ (ηZ )] (b) =E[− η 2 Zϕ (ηZ )]=− η 2 ∂ ∂η E[Φ( ηZ )], (B.14) where we used Stein’s lemma in step(a), and the identityϕ ′ (t)=− tϕ (t) in step (b). Therefore, ∂ ∂η E[Φ( ηZ )]= 0 and soE[Φ( ηZ )]=E[Φ(0)]=1 /2. B.1.4 ProofofTheorem3.1.2 Because of Assumption 3.1.1 (continuity of probability laws), with probability one all the score values are distinct and so there is no ambiguity (tie) in labeling the data points. Recall W ℓ as the number of data points with label ℓ. By construction, the joint distribution of (n g W 1 ,n g W 2 ,...,n g W L ) is a multinomial distribution withL distinct values (number of labels). Denote byp ℓ the probability of getting labelℓ. Then, 284 the statistic U ng,L , given in Algorithm 1. , is the standard Pearson’s χ 2 test statistic for testing the null hypothesis H ′ 0 :p ℓ = 1 L , forℓ∈[L]. (B.15) Now by using Lemma B.1.1 (Part 1), we getU ng,L d ⇒ χ 2 L− 1 , asn g → ∞. The claim aboutθ finite L,α follows from Part 2 of Lemma B.1.1. B.1.5 ProofofRemark3.2.5 Define g(u,z,y):= ∂ ∂u F T|ZY F − 1 T|Z u;Z,Y ;Z,Y . Then by Assumption 3.2.3 we have R 1 0 E (Z,Y)∼L (Z,Y) [|g(u,Z,Y)|] <∞ and as an application of the Fubini’s theorem, we can change the order of integration and the expectation and get: Z u 0 E (Z,Y)∼L (Z,Y) [g(v,Z,Y)]dv =E (Z,Y)∼L (Z,Y) h Z u 0 g(v,Z,Y)dv i =E (Z,Y)∼L (Z,Y) h F T|ZY F − 1 T|Z u;Z,Y ;Z,Y − F T|ZY F − 1 T|Z 0;Z,Y ;Z,Y i =E (Z,Y)∼L (Z,Y) h F T|ZY F − 1 T|Z u;Z,Y ;Z,Y Now taking derivative of both sides with respect tou, we arrive at E (Z,Y)∼L (Z,Y) [g(v,Z,Y)]= r T (u). (B.16) We next prove part (b) of the remark. Part(a) follows readily from part (a) since under the null hypoth- esisT(X,Z,Y) andT( e X,Z,Y) have the same distribution. 285 By definition of the conditional dependency power ∆ T (L(X,Z,Y)), cf. Definition 3.2.4 we have ∆ T (L(X,Z,Y))= Z 1 0 |r T (u)− 1|du (a) = 1 Z 0 E (Z,Y)∼L (Z,Y) ∂ ∂u F T|ZY F − 1 T|Z u;Z,Y ;Z,Y − 1 du (b) ≤ 1 Z 0 E (Z,Y)∼L (Z,Y) ∂ ∂u F T|ZY F − 1 T|Z u;Z,Y ;Z,Y − 1 du (c) =E (Z,Y)∼L (Z,Y) 1 Z 0 ∂ ∂u F T|ZY F − 1 T|Z u;Z,Y ;Z,Y − 1 du =E (Z,Y)∼L (Z,Y) 1 Z 0 f T|ZY F − 1 T|Z u;Z,Y ;Z,Y f T|Z F − 1 T|Z u;Z,Y ;Z,Y − 1 du =E (Z,Y)∼L (Z,Y) ∞ Z −∞ f T|Z,Y (t;Z,Y) f T|Z (t;Z,Y) − 1 f T|Z (t;Z,Y)dt =E (Z,Y)∼L (Z,Y) h 2d TV (T( e X,Z,Y)|Z,Y),(T(X,Z,Y)|Z,Y) i , withX∼L (X|Z,Y), e X ∼L (X|Z), andf T|Z,Y andf T|Z representing the density functions correspond- ing to cdfsF T|Z,Y andF T|Z . Note that in (a) we used (B.16); (b) is a direct result of Jenson’s inequality, and (c) follows from Assumption 3.2.3 in conjunction with the Fubini’s theorem. B.1.6 ProofofProposition3.2.6 Based on the Pearson χ 2 -CI statistic U ng,L construction that is described in Algorithm 1, for a group G = (X,Z,Y) and itsM constructed counterfeits{ e G i = ( e X i ,Z,Y)} i=1:M we have the following rank value R=1+ M X j=1 I{T(G)≥ T( e G j )}. 286 This allows us to compute the probability ofG getting labelt∈[L]: P(G has labelt)=P((t− 1)K +1≤ R≤ tK) = Kt X j=K(t− 1)+1 P(R=j) = Kt X j=K(t− 1)+1 E ZY [P(R=j|Y = y,Z= z)]. (B.17) Note that by conditioning on (Z,Y) = (z,y), random variables T(G) and T( e G i ) are independent as G = (X,Z,Y) and e G i = ( e X i ,Z,Y). To lighten the notation, we use the shorthands T := T(G) and e T i =T( e G i ), and proceed as follows: P(R=j|Y = y,Z= z)=P T is exactly larger thanj− 1 of e T i |Z= z,Y = y (a) = Z P t is exactly larger thanj− 1 of e T i |Z= z,Y = y dF T|ZY (t;z,y) (b) = M j− 1 Z F T|Z (t;z,y) j− 1 (1− F T|Z (t;z,y)) M− j+1 dF T|ZY (t;z,y) = M j− 1 1 Z 0 u j− 1 (1− u) M− j+1 dF T|ZY F − 1 T|Z (u;z,y);z,y , (B.18) where (a) comes from the fact thatT|ZY has densityF T|ZY (.), and (b) holds since e T i |ZY is distributed according to F T|Z (.), independent of T . Note that for a function f(x), the notation df(x) = f ′ (x)dx denotes the differential of f(x). We next plug in equation (B.18) into (B.17) to get 287 P(G has labelt)= Kt X j=K(t− 1)+1 M j− 1 E ZY 1 Z 0 u j− 1 (1− u) M− j+1 dF T|ZY F − 1 T|Z (u;Z,Y);Z,Y (a) = Kt X j=K(t− 1)+1 M j− 1 1 Z 0 u j− 1 (1− u) M− j+1 E ZY h dF T|ZY F − 1 T|Z (u;Z,Y);Z,Y i (b) = Kt X j=K(t− 1)+1 M j− 1 1 Z 0 u j− 1 (1− u) M− j+1 dE ZY h F T|ZY F − 1 T|Z (u;Z,Y);Z,Y i = Kt X j=K(t− 1)+1 M j− 1 1 Z 0 u j− 1 (1− u) M− j+1 dR T (u) = Kt− 1 X j=K(t− 1) M j 1 Z 0 u j (1− u) M− j r T (u)du, (B.19) in (a) we used the Fubini’s theorem along with Assumption 3.2.3 and the fact that for every 0 ≤ u ≤ 1 we have|u j (1− u) M− j |≤ 1. Also, (b) is a direct result of Assumption 3.2.3 and dominated convergence theorem. This completes the proof of claim (3.7). 288 It is worth noting that, whenX ⊥ ⊥ Y|Z, we haveP X|ZY = P X|Z which implies R T (u) = u, so the conditional relative density functionr T (u) always attains the constant value1. In this case, we have p t = Kt− 1 X j=K(t− 1) M j 1 Z 0 u j (1− u) M− j du = Kt− 1 X j=K(t− 1) M j B(j+1,M− j+1) = Kt− 1 X j=K(t− 1) M j Γ( j+1)Γ( M− j+1) Γ( M +2) = Kt− 1 X j=K(t− 1) M j j!(M− j)! (M +1)! = Kt− 1 X j=K(t− 1) 1 M +1 = K M +1 = 1 L , (B.20) whereB(a,b) is the Beta function andΓ( a) is the Gamma function. Now, we are ready to prove Part (i). First note that deriving a more explicit characterization of p t from (B.19) is in general intractable, due to the relative density term r T (u) in the inner integral expres- sion . However, it is useful to note that if r T (u) is a polynomial of u, then this probability can be easily computed by absorbing that into the integral formulation of the Beta function and then leveraging the con- nection between the Gamma function and Beta function for integer values. Inspired by this observation, our strategy is to approximater T (u) with polynomials. To this end, note that by Assumption 3.2.2,r T (u) is a continuous function over[0,1] interval, which allows us to use the Weierstrass theorem to uniformly 289 approximater T (u) as closely as desired by polynomials. Formally, for anyε>0 there exists a polynomial e r(u) with real coefficients such that sup u∈[0,1] |e r(u)− r T (u)|<ε. (B.21) In addition, from (B.19) for everyℓ∈[L] we have ℓ X t=1 p t = ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j r T (u)du ≥ ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j e r(u)du− ε ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j du = ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j e r(u)du− ℓε L , (B.22) 290 where in the last equality we used the result in (B.20) that when R T (u) = u, we havep t = 1/L. We are left with lower bounding the right-hand side summation in (B.22). Lete r(u) be a polynomial of degreeN and coefficients a i , i.e.e r(u)= N P i=0 a i u i . We have ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j e r(u)du = ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j N X i=0 a i u i du = ℓK− 1 X j=0 N X i=0 a i M j 1 Z 0 u j+i 1− u M− j du = ℓK− 1 X j=0 N X i=0 a i M j B(j+i+1,M− j+1) = ℓK− 1 X j=0 N X i=0 a i M j (j+i)!(M− j)! (M +i+1)! = N X i=0 a i M!i! (M +i+1)! ℓK− 1 X j=0 j+i i = N X i=0 a i M!i! (M +i+1)! ℓK +i i+1 = N X i=0 a i i+1 i Y h=0 ℓK +h M +1+h , (B.23) where in the penultimate equation, we used the Hockey-stick identity. Next, use the following simple inequality in (B.23) ℓK +h M +1+h ≥ ℓK M +1 = ℓ L , 291 to arrive at ℓK− 1 X j=0 M j 1 Z 0 u j 1− u M− j e r(u)du≥ N X i=0 a i i+1 ℓ L i+1 = ℓ L Z 0 e r(u)du. Next we plug the above lower bound into (B.22) to get ℓ X t=1 p t ≥ ℓ L Z 0 e r(u)du− ℓε L , which along with (B.21) implies that ℓ X t=1 p t ≥ ℓ L Z 0 r T (u)du− 2ℓε L = R T ℓ L − 2ℓε L . Finally, sinceε>0 can be chosen arbitrarily small, by lettingε→0 we get the desired claim of (3.8). We next proceed to Part (ii). In Part (i), we use a general form of the Weierstrass approximation theorem, to uniformly approximater T as closely as desired, while the rate of convergence (in terms of the polynomial degree) was not needed. For establishing an upper bound on the sum of labels probabilities, ℓ P s=1 p s , we need to upper bound the polynomial-approximation error, and knowing the convergence rate becomes important. For this reason, we use a more refined version of the Weierstrass approximation theorem. For the reader’s convenience, we state this version in the following lemma, borrowed from [70]: Lemma B.1.8 ( [70], Theorem 1). Let f be a B-bounded and C− Lipschitz continuous function on [0,1]. Then, for every positive integerN, there exists a polynomial e f N of degreeN such that sup u∈[0,1] |f(u)− e f N (u)|≤ (C/2+2B) r logN N . 292 Recall that by Assumption 3.2.2,r T (u) isB-bounded andC-Lipschitz, and therefore, by an application of Lemma B.1.8 there exists a polynomiale r N of degreeN, such that forD =C/2+2B we have ∥r T − e r N ∥ ∞ ≤ D r logN N . (B.24) Lete r N (u)= N P i=0 a i u i . By a similar argument used in deriving (B.22) and (B.23), we get ℓ X t=1 p t ≤ N X i=0 a i i+1 i Y h=0 ℓK +h M +1+h + ℓD L r logN N . (B.25) To further simplify the right-hand side, we use the following simple algebraic manipulations. Sinceh≤ i≤ N andM +1=LK≥ ℓK we have(M +1− ℓK)(N− h)≥ 0, from which we get ℓK +h M +1+h ≤ ℓK +N M +1+N = ℓK +N LK +N = ℓ L K + N ℓ K + N L ! ≤ ℓ L 1+ N K . Using this bound in (B.25), forh≥ 1, we arrive at ℓ X t=1 p t ≤ 1+ N K N N X i=0 a i i+1 ℓ L i+1 + ℓD L r logN N = 1+ N K N ℓ L Z 0 e r N (u)du+ ℓD L r logN N ≤ e N 2 /K ℓ L Z 0 e r N (u)du+ ℓD L r logN N . 293 By using (B.24) again, we obtain ℓ X t=1 p t ≤ e N 2 /K ℓ L Z 0 r T (u)du+ ℓD L r logN N 1+e N 2 /k =e N 2 /K R T ℓ L + ℓD L r logN N 1+e N 2 /k . SetN = p Klog(1+δ ) for a fixed 0<δ < 1 and rewrite the above bound as ℓ X t=1 p t ≤ (1+δ )R T ℓ L + 3ℓD L log(Klog(1+δ )) 2 p Klog(1+δ ) ! 1/2 . By using the relationsℓ≤ L,δ < 1,R T (u)≤ 1, andlog(1+δ )≥ δ/ 2, forδ ∈[0,1], we obtain ℓ X t=1 p t ≤ R T ℓ L +δ +3D logK √ Kδ 1/2 . Minimizing the right-hand side over δ , we get δ = 9D 2 logK √ K 2/5 , which is smaller than one for k sufficiently large. Plugging in for this value of δ we obtain ℓ X t=1 p t ≤ R T ℓ L +ν K , withν K =2 9D 2 logK √ K 2/5 . We next proceed to prove Part(iii). For1≤ t≤ ℓ, let q t := R T t L − R T t− 1 L , (B.26) 294 By employing the results of parts(i) and(ii) we have |p t − q t |≤ t X j=1 p j − R T t L − t− 1 X j=1 p j +R T t− 1 L ≤ ν K . Therefore, L X t=1 p t − 1 L ≥ L X t=1 q t − 1 L − L X t=1 |p t − q t |≥− Lν K + L X t=1 q t − 1 L . (B.27) Next, by applying the mean value theorem in the definition of q t in (B.26), for everyt∈ [L], there exists ξ t ∈ t− 1 ℓ , t ℓ , such thatq t = r T (ξ t )/L. Therefore, L X t=1 q t − 1 L = 1 L L X t=1 |r T (ξ t )− 1| = L X t=1 t L Z t− 1 L |r T (ξ t )− 1|du ≥ L X t=1 t L Z t− 1 L |r T (u)− 1|du− L X t=1 t L Z t− 1 L |r T (u)− r T (ξ t )|du ≥ 1 Z 0 |r T (u)− 1|du− L X t=1 t L Z t− 1 L C|u− ξ t |du ≥ 1 Z 0 |r T (u)− 1|du− L X t=1 C L 2 = 1 Z 0 |r T (u)− 1|du− C L . Using the above lower bound into (B.27) gives L X t=1 p t − 1 L ≥ 1 Z 0 |r T (u)− 1|du− Lν K − C L . 295 B.1.7 ProofofTheorem3.2.7 The primary arguments here are similar to the initial reasonings in the proof of Theorem 3.1.2, where we arrived at the point that the joint distribution of (n g W 1 ,n g W 2 ,...,n g W L ) is a multinomial distribution withL categories, such that categoryℓ∈ [L] happens with probabilityp ℓ . Next, recall Lemma B.1.1, part 3, where it implies that if for someβ > 0, the following holds: L X ℓ=1 p ℓ − 1 L ≥ 32L 1/4 √ n g 1 √ α ∨ 1 β 1/2 , (B.28) then the type II error is bounded byβ . On the other hand, from Proposition 3.2.6 we have L X ℓ=1 p ℓ − 1 L ≥ 1 Z 0 |r T (u)du− 1|− Lν L − C L . (B.29) Combining equations (E.3) and (B.29), in conjunction with the definition of the conditional dependency in Definition 3.2.4 completes the proof. B.1.8 ProofofTheorem3.2.8 The first step is similar to initial arguments of the proof of Theorem 3.1.2, where we showed that the joint distribution of (n g W 1 ,n g W 2 ,...,n g W L ) is a multinomial distribution with L categories, such that the categoryℓ∈[L] occurs with probabilityp ℓ . We then use the following asymptotic result on the Pearson’s χ 2 test statistic for multinomial models (see e.g., [107, Theorem 14.3.1]): U ng,L (d) ⇒χ 2 λ,L − 1 , (B.30) 296 whereχ 2 λ,L − 1 stands for theχ 2 distribution withL− 1 degrees of freedom and the non-central parameter λ = L P ℓ=1 n g L p ℓ − 1 L 2 . This gives us lim ng→∞ P U ng,ℓ ≥ θ asym L,α = lim ng→∞ P Q∼ χ 2 λ,L − 1 Q≥ θ asym L,α . (B.31) Using the Cauchy inequality forλ = L P ℓ=1 n g L p ℓ − 1 L 2 yields √ λ ≥ √ n g L X ℓ=1 p ℓ − 1 L . We then recall the result from Proposition 3.2.6 to get L X ℓ=1 p ℓ − 1 L ≥ 1 Z 0 |r T (u)du− 1|− Lν K − C L . (B.32) In the next step, combine (B.32) and the lower bound on the conditional dependency power in Theorem (3.2.7) to getλ ≥ e λ with e λ =A 2 L 1/2 , whereA is given by: A=max r 3log 1 β + 3log 1 β +2 r log 1 α +2log 1 α ! 1/2 , 1 . (B.33) Usingλ ≥ e λ in (B.31) yields lim ng→∞ P U ng,ℓ ≥ θ asym L,α ≥ lim ng→∞ P Q∼ χ 2 e λ,L − 1 Q≥ θ asym L,α . (B.34) 297 We then provide the following inequality borrowed from [24] on tails of non-centralχ 2 random vari- ables. LemmaB.1.9 ([24], Lemma 8.1). Suppose that random variableX has aχ 2 distribution withm degrees of freedom and non-central parameterλ . Then for everyt≥ 0 we have P X ≤ m+λ − 2 p (m+2λ )t ≤ exp(− t), P X ≥ m+λ +2 p (m+2λ )t+2t ≤ exp(− t). As an immediate consequence of Lemma B.1.9, we can obtain the following upper bound on the(1− α )- th quantile of the centralχ 2 distribution withm degrees of freedom: χ 2 m (1− α )≤ m+2 r mlog 1 α +2log 1 α . (B.35) By exploiting (B.35) inθ asym L,α =χ 2 L− 1 (1− α ) we arrive at θ asym L,α ≤ L− 1+2 r (L− 1)log 1 α +2log 1 α . (B.36) Using (B.36) in (B.34) brings us lim ng→∞ P U ng,ℓ ≥ θ asym L,α ≥ P Q∼ χ 2 e λ,L − 1 Q≥ L− 1+2 r (L− 1)log 1 α +2log 1 α ! . (B.37) We next claim that 298 2 r (L− 1)log 1 α +2log 1 α ≤ A 2 L 1/2 − 2 r (L− 1+2A 2 L 1/2 )log 1 β . (B.38) Deploying (B.38) (we provide the proof of claim (B.38) later) in (B.37) yields lim ng→∞ P U ng,ℓ ≥ θ asym L,α ≥ P Q∼ χ 2 e λ,L − 1 Q≥ L− 1+A 2 L 1/2 − 2 r (L− 1+2A 2 L 1/2 )log 1 β . (B.39) Next by using the first tail bound of Lemma B.1.9 (for values m = L− 1, e λ = A 2 L 1/2 , andt = log 1 β ) in (B.39) we obtain lim ng→∞ P U ng,ℓ ≥ θ asym L,α ≥ 1− β. This completes the proof. Finally, we are left to prove the claim (B.38). AsL≥ 1, we have e θ :=A 2 L 1/2 − 2 r (L− 1+2A 2 L 1/2 )log 1 β ≥ √ L A 2 − 2 r (1+2A 2 )log 1 β . In the next step, by usingA≥ 1, we get e θ ≥ √ L A 2 − 2A r 3log 1 β ≥ √ L A− r 3log 1 β 2 − 3 √ Llog 1 β ≥ √ L 2 r log 1 α +2log 1 α ! , (B.40) 299 where the last inequality follows from the definition of A in (B.33). We then useL≥ 1 in (B.40) to arrive at e θ ≥ 2 r (L− 1)log 1 α +2log 1 α . This proves (B.38). B.1.9 ProofofTheorem3.6.5 We have n g number of groups, each of size n. Suppose that each groupG ∼ L (X,Z,Y) admits label ℓ with probabilityp ℓ . By construction of the PCR test the rank of a subgroupG is given by R=1+ M X j=1 I{T(G)≥ T( ˜ G j )}. In particular, the probability of admitting label1 isp 1 =P(R≤ K), which by usingKL=M +1 can be written as p 1 =P 1 M +1 R≤ 1 L . Therefore, withL=1/α and recalling the condition (3.22) we getp 1 ≥ α +δ , which implies that L X ℓ=1 p ℓ − 1 L ≥ δ. (B.41) We now focus on proving the asymptotic result. From the classic Pearson Chi-squared test in multi- nomials we know that U ng,L (d) →χ 2 L− 1 (λ ), 300 where λ is the non-central parameter and is given by λ = n g L P L ℓ=1 (p ℓ − 1/L) 2 . By an application of Cauchy-Shwarz inequality we have √ λ ≥ √ n g L X ℓ=1 p ℓ − 1 L . Combining the above relation with (B.41) yields √ λ ≥ √ n g δ . By invoking the lower bound forδ in (3.23) we arrive atλ ≥ e λ with e λ =A 2 L 1/2 , whereA is given by A=max r 3log 1 β + 3log 1 β +2 r log 1 α +2log 1 α ! 1/2 , 1 . This implies that lim inf ng→∞ P U ng,L ≥ θ asym L,α ≥ lim inf ng→∞ P Q∼ χ 2 e λ,L − 1 Q≥ θ asym L,α . The rest of the proof follows a similar argument as per (B.34) in Section B.1.8. We then proceed to prove the result for the finite-sample threshold. By recalling the third part of Lemma B.1.1 we have that if L X ℓ=1 p ℓ − 1 L ≥ 32L 1/4 √ n g 1 √ α ∨ 1 β 1/2 , then the type II error with finite-sample threshold is bounded by β . Finally, combining (B.41) and (3.24) completes the proof. 301 B.1.10 ProofofTheorem3.4.2 Consider a group G = (X,Z,Y) and its M = KL− 1 counterfeits G i = ( e X (i) ,Z,Y) where e X (i) is sampled from b P X|Z (·|Z), for i ∈ [M]. Assume b X is also drawn from b P X|Z (·|Z), independently of e X (1:M) ,X, andY. From Algorithm 1 we know that the test statisticsU ng,L is a function ofX,Y,Z and M sampled counterfeits e X (1:M) . Fort≥ 0 and for fixed values of Z,Y, we let A t = n (x,e x (1) ,...,e x (M) )∈R n× (M+1) :U ng,L (Z,Y,x,e x (1) ,...,e x (M) )≥ t o . We have P U ng,L Z,Y,X, e X (1:M) ≥ t|Z,Y − P U ng,L Z,Y, b X, e X (1:M) ≥ t|Z,Y (a) = P (X, e X (1:M) )∈A t |Z,Y − P ( b X, e X (1:M) )∈A t |Z,Y (b) ≤ d TV L(X, e X (1:M) |Z,Y),L( b X, e X (1:M) |Z,Y) (c) = d TV L(X|Z,Y),L( b X|Z,Y) (d) = d TV (X|Z),( b X|Z) =d TV P n X|Z (·|Z), b P n X|Z (·|Z) , (B.42) where(a) comes from the definition of the set A t ; (b) is a direct result from the definition of total varia- tion; in(c) we used the property that conditioned on(Z,Y), random variables(X, e X, e X (1) ,..., e X (M) ) are independent;(d) comes from the fact that the under the null hypothesis,X⊥ ⊥Y|Z and also b X⊥ ⊥Y|Z by construction of b X. If we denote the constructed test statistics via(X, e X (1:M) ) byU ng,L and the other variant used( b X, e X (1:M) ) by e U ng,L , then the above relation implies that sup t≥ 0 P(U ng,L ≥ t|Z,Y)− P( e U ng,L ≥ t|Z,Y) ≤ d TV P n X|Z (·|Z), b P n X|Z (·|Z) . 302 Next by marginalizing overZ,Y and an application of Jenson’s inequality (namely|E[V]|≤ E[|V|] for a random variableV ) we arrive at sup t≥ 0 P(U ng,L ≥ t)− P( e U ng,L ≥ t) ≤ E h d TV P n X|Z (·|Z), b P n X|Z (·|Z) i . (B.43) Since e U ng,L is constructed from b X, e X 1:M , which are drawn i.i.d from b P X|Z , by using Theorem 3.1.2 we have P( e U ng,L ≥ θ finite L,α )≤ α, lim sup n→∞ P( e U ng,L ≥ θ asym L,α )≤ α. The above bounds together with (B.43) complete the proof of the claim. B.1.11 ProofofTheorem3.4.1 Consider a groupG = (X,Z,Y) and its M = KL− 1 counterfeitsG i = ( e X (1:M) ,Z,Y) where e X (j) is sampled from b P X|Z (·|Z), for j ∈ [M]. Assume b X is also drawn from b P X|Z (·|Z), independently of e X (1:M) ,X, andY. We fix the values of Z,Y, and forℓ∈[L] define A ℓ = (x,e x (1) ,...,e x (M) ):(ℓ− 1)K≤ M X j=1 I{T((x,Z,Y))≥ T((e x (j) ,Z,Y))}≤ ℓK− 1 . 303 We have P(G has labelℓ|Z,Y)− 1 L (a) = P (X, e X (1) ,..., e X (M) )∈A ℓ |Z,Y − 1 L (b) = P (X, e X (1) ,..., e X (M) )∈A ℓ |Z,Y − P ( b X, e X (1) ,..., e X (M) )∈A ℓ |Z,Y (c) ≤ d TV ((X, e X (1) ,..., e X (M) )|Z,Y),(( b X, e X (1) ,..., e X (M) )|Z,Y) (d) = d TV (X|Z,Y),( b X|Z,Y) (e) = d TV (X|Z),( b X|Z) =d TV P X|Z (·|Z), b P X|Z (·|Z) , (B.44) where(a) comes from the process of labeling the data points; in (b) we used the fact that conditioned on Z,Y random variables b X, e X (1) ,..., e X (M) are i.i.d., hence M P j=1 I{T(( b X,Z,Y))≥ T(( e X (j) ,Z,Y))} takes values{0,1,...,M}, uniformly at random; (c) is a direct result from the total variation definition; in (d) we used the property that conditioned on(Z,Y), random variables(X, e X, e X (1) ,..., e X (M) ) are independent; (e) comes from the fact that the under the null hypothesis,X⊥ ⊥Y|Z and also b X⊥ ⊥Y|Z by construction of b X. 304 In the current scenario that counterfeits are drawn from the approximate law b P X|Z (.|Z), define q ℓ to be the probability that under the null hypothesis, a regular groupG = (X,Z,Y) has label ℓ. Then by marginalizing outZ, we can upper bound the deviation amount ofq ℓ from1/L. q ℓ − 1 L = P(G has labelℓ)− 1 L = Z P(G has labelℓ|Z,Y)dP ZY − 1 L = Z P G has labels|Z,Y − 1 L dP ZY ≤ Z P G has labels|Z,Y − 1 L dP ZY (a) ≤ Z d TV P X|Z (·|Z), b P X|Z (·|Z) dP ZY =E Z h d TV P X|Z (·|Z), b P X|Z (·|Z) i ≤ δ, where (a) comes from (B.44). In summary we get q ℓ − 1 L ≤ δ, forℓ=1,2,...,L. (B.45) RecallW ℓ as the number of data points with labelℓ. Clearly, (W 1 ,...,W L )= multi(n g ;q 1 ,...,q L ) . 305 We next use a result on the size of truncated χ 2 test from [5, Theorem 3.2], which implies the first inequality in the chain of inequalities below: α ≥ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 − W ℓ max{q ℓ , 1 L } ≥ n g v u u t 2 α L X ℓ=1 q ℓ max{q ℓ ,1/L} 2 ≥ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 − W ℓ max{q ℓ , 1 L } ≥ n g r 2 α L ! =P L X ℓ=1 (W ℓ − n g q ℓ ) 2 max{q ℓ , 1 L } ≥ L X ℓ=1 W ℓ max{q ℓ , 1 L } +n g r 2 α L ! ≥ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 max{q ℓ , 1 L } ≥ L L X ℓ=1 W ℓ +n g r 2 α L ! =P L X ℓ=1 (W ℓ − nq ℓ ) 2 max{q ℓ , 1 L } ≥ n g L+n g r 2 α L ! (a) ≥ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 1 L +δ ≥ n g L+n g r 2 α L ! ≥ P L n g (1+Lδ ) L X ℓ=1 (W ℓ − n g q ℓ ) 2 ≥ L+ r 2 α L ! ≥ P U ng,L (δ )≥ L+ r 2 α L ! , where(a) comes from (B.45) and the last inequality follows from the definition of U ng,L . This concludes the proof of claim (3.14). For the claim (3.15), we use the following asymptotic result on the Pearson’sχ 2 test statistic for multi- nomial models (see e.g, [107, Theorem 14.3.1]): lim ng→∞ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 n g q ℓ ≥ θ asym L,α ! ≤ α, (B.46) 306 whereθ asym L,α is theα -th upper quantile of a Chi-squared distribution withL− 1 degrees of freedom. By definition of U ng,L (δ ), we have P U ng,ℓ (δ )≥ θ asym L,α ≤ P L n g (1+Lδ ) L X ℓ=1 (W ℓ − n g q ℓ ) 2 ≥ θ asym L,α ! ≤ P L X ℓ=1 (W ℓ − n g q ℓ ) 2 n g q ℓ ≥ θ asym L,α ! , where in the last inequality we used (B.45). Finally, plug the above relation into (B.46) to get the following relation: lim ng→∞ P U ng,L (δ )≥ θ asym L,α ≤ α. This concludes the proof. 307 AppendixC SupplementtoChapter4 C.1 Proofsoftheoremsandlemmas C.1.1 ProofofLemma4.1.2 Using (4.1) forf(t)= 1 2 |t− 1| we get E[D TV (Bern(η (X))∥Bern(b η (X)))]=E b η (X) 2 η (X) b η (X) − 1 + 1− b η (X) 2 1− η (X) 1− b η (X) − 1 =E[|b η (X)− η (X)|] . By a similar argument forf(t)=tlog(t) we get E[D KL (Bern(η (X)∥Bern(b η (X)))]=E b η (X) η (X) b η (X) log η (X) b η (X) +(1− b η (X)) 1− η (X) 1− b η (X) log 1− η (X) 1− b η (X) =E η (X)log η (X) b η (X) +(1− η (X))log 1− η (X) 1− b η (X) . 308 We then expand the expression inside logarithm to get E[D KL (Bern(η (X)∥Bern(b η (X)))]=− E[η (X)logb η (X)+(1− η (X))log(1− b η (X))] +E[η (X)logη (X)+(1− η (X))log(1− η (X))] = CE(b η )− CE(η ). Next, we prove the result for the Hellinger distance. By using (4.1) forf(t)=( √ t− 1) 2 we get E[D H (Bern(η (X)||Bern(b η (X)))]=E b η (X) s η (X) b η (X) − 1 ! 2 +(1− b η (X)) s 1− η (X) 1− b η (X) − 1 ! 2 =E p η (X)− p b η (X) 2 + p 1− η (X)− p 1− b η (X) 2 . C.1.2 ProofofProposition4.2.1 In oder to computeE[D f (L(W|X)∥Unif([0,1]))], we first need to characterize the conditional distribution L(W|X). From the mechanism to constructw, it is easy to obtain p W|X (w|x)=P(y =+1|x)p W|Y,X (w|y =+1,x)+P(y =0|x)p W|Y,X (w|y =0,x) = η (x) b η (x) I(w≤ b η (x))+ 1− η (x) 1− b η (x) I(w≥ b η (x)). (C.1) 309 Using (C.1) inf-divergence definition 4.1.1 yields E[D f (L(W|X)∥Unif([0,1]))]=E Z 1 0 f(p W|X (w|X))dw =E b η (X) Z 0 f η (X) b η (X) dt+ 1 Z b η (X) f 1− η (X) 1− b η (X) dt . =E b η (X)f η (X) b η (X) +(1− b η (X))f 1− η (X) 1− b η (X) =E[D f (Bern(η (X)∥Bern(b η (X)))] . C.1.3 ProofofProposition4.2.2 From the optimization problems used in the definition of statistics U finite τ (V n,L ) andU asym τ (V n,L ) in (4.4), it is easy to getU finite τ (V n,L )≤ U asym τ (V n,L ). In fact, this is due to the extra1/L term in the denominator of the test statisticU finite τ (V n,L ). We also need to compare the rejection thresholds. For this end, we first provide the following upper bound on the quantiles of a chi-squared distribution with L− 1 degrees of freedom. By using ([24], Lemma 8.1), χ 2 L− 1 (1− α )≤ L− 1+2 r (L− 1)log 1 α +2log 1 α . Also it is easy to get that forL≥ 60 the following holds (we prove this later): L− 1+2 r (L− 1)log 1 α +2log 1 α ≤ L+ r 2L α . (C.2) 310 Putting all together and by recalling the definition of Φ asym n,L,α,τ we get Φ asym n,L,α,τ =I(U asym τ (V n,L )≥ χ 2 L− 1 (1− α )) ≥ I(U finite τ (V n,L )≥ χ 2 L− 1 (1− α )) ≥ I U finite τ (V n,L )≥ L+ r 2L α ! =Φ asym n,L,α,τ . We only need to prove (C.2). By upper boundingL− 1 withL and multiplying both sides of (C.2) by 0.5 p α/L the relation reduces to showing r α log 1 α + r α L log 1 α ≤ 1 √ 2 . By simple algebraic calculation, it is easy to see thatα log1/α ≤ 1/e and √ α log1/α ≤ 2/e and so r α log 1 α + r α L log 1 α ≤ 1 √ e + 2 e √ L . Next note that forL≥ 55 we have 1 √ e + 2 e √ L ≤ 1 √ 2 , which completes the proof. C.1.4 ProofofProposition4.2.4 It is easy to observe that the joint distribution of(V 1 ,...,V L ) is a multinomial distribution with probabil- ities(p 1 ,...,p L ) wherep ℓ denotes the probability of sample(X,W) admitting labelℓ forℓ∈[L]. As per Algorithm 3, let f W 1 ,..., f W M be iid random variables with uniform distributionUnif[0,1]. To lighten the 311 notation, we use the shorthandsT = T(X,W), andT j = T(X, f W j ) forj ∈ [M]. In this case, the rank value is given by R=1+ M X j=1 I T ≥ T j . We get P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) E[P(R=j+1|X)] = Kℓ− 1 X j=K(ℓ− 1) E P(T is exactly larger thanj ofT j∈[M] |X) . We know that conditioned onX, random variablesT ,T j∈[M] are independent. This gives us P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) E Z P( exactlyj ofT j∈[M] are smaller thant|X)dP T|X (t|X) = Kℓ− 1 X j=K(ℓ− 1) E Z M j F c (t;X) j (1− F c (t;X)) M− j ∂F o (t;X) ∂t dt . (C.3) The last relation follows the iid property of random variables T j∈[M] conditioned on X along with the definitions of F c (t;x),F o (t;x) as per Assumption 4.2.3. By adopting the change of variableu=F c (t;X) in the integral (C.3) we arrive at P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) E Z 1 0 M j u j (1− u) M− j ∂F o (F − 1 c (u;X);X) ∂u du . We next recall the function ψ (u;X) = ∂Fo(F − 1 c (u;X);X) ∂u from Assumption 4.2.3, which allows us to write the above probability as 312 p ℓ = Kℓ− 1 X j=K(ℓ− 1) M j E Z 1 0 u j (1− u) M− j ψ (u;X)du . ∀ℓ∈[L]. (C.4) Forℓ∈[L], we define the polynomial β ℓ :[0,1]→R as β ℓ (u)= Kℓ− 1 X j=K(ℓ− 1) M j u j (1− u) M− j . By using the above in (C.4), we obtain p ℓ =E Z 1 0 β ℓ (u)ψ (u;X)du . ∀ℓ∈[L], (C.5) and hence L X ℓ=1 f(p ℓ L) L = 1 L L X ℓ=1 f LE h Z 1 0 β ℓ (u)ψ (u;X)du i . (C.6) We next upper bound the right-hand side in the above equation. Consider the following probability density function over interval[0,1]: φ ℓ (u)= β ℓ (u) R 1 0 β ℓ (u)du , ∀u∈[0,1]. (C.7) By an application of Jensen’s inequality and the fact thatf is a convex function, we have f E X h E u∼ φ ℓ [ψ (u;X)] i ≤ E X h E u∼ φ ℓ h f(ψ (u;X)) ii . (C.8) Then, by rewriting the expectation ofu in terms of density functionφ ℓ we get f E X Z 1 0 φ ℓ (u)ψ (u;X)du ≤ E X Z 1 0 φ ℓ (u)f(ψ (u;X))du . (C.9) 313 Plugging (C.7) into (C.9) yields f E " R 1 0 β ℓ (u)ψ (u;X)du R 1 0 β ℓ (u)du #! ≤ E " R 1 0 β ℓ (u)f(ψ (u;X))du R 1 0 β ℓ (u)du # . (C.10) In addition, from the definition of β ℓ (u) we have Z 1 0 β ℓ (u)du= Kℓ− 1 X j=K(ℓ− 1) M j Z 1 0 u j (1− u) M− j du = Kℓ− 1 X j=K(ℓ− 1) M j B(j+1,M− j+1) = Kℓ− 1 X j=K(ℓ− 1) M j j!(M− j)! (M +1)! = 1 L , (C.11) whereB(a,b) is the beta function. Using (C.11) in (C.10) brings us to f LE h Z 1 0 β ℓ (u)ψ (u;X)du i ≤ LE Z 1 0 β ℓ (u)f(ψ (u;X))du (C.12) In the next step, combining (C.12) with (C.6) gives us L X ℓ=1 f(p ℓ L) L ≤ L X ℓ=1 E Z 1 0 β ℓ (u)f(ψ (u;X))du =E Z 1 0 f(ψ (u;X))du , (C.13) 314 where in the last relation we used the identity P L ℓ=1 β ℓ (u)=1, because L X ℓ=1 β ℓ (u)= L X ℓ=1 Kℓ− 1 X j=K(ℓ− 1) M j u j (1− u) M− j = M X j=0 M j u j (1− u) M− j =(u+1− u) M =1. We next use definition of ψ (u;x) in (C.13) to obtain E Z 1 0 f(ψ (u;X))du =E Z 1 0 f ∂F o (F − 1 c (u;X);X) ∂u du . (C.14) In the next step, by using density functionsf o (t;x),f c (t;x) as per Assumption 4.2.3 in (C.14) we arrive at E Z 1 0 f(ψ (u;X))du =E Z 1 0 f f o (F − 1 c (u;X);X) f c (F − 1 c (u;X);X) du =E Z 1 0 f f o (t;X) f c (t;X) f c (t;X)dt =E h D f L(T(X,W)|X)∥L(T(X, f W)|X) i , (C.15) where the last relation comes from Assumption 4.2.3 and the definition of f-divergence function. In the next step, by combining (C.15) and (C.13) we get L X ℓ=1 f(p ℓ L) L ≤ E h D f L(T(X,W)|X)∥L(T(X, f W)|X) i . (C.16) Further, by an application of the data processing inequality we have D f L(T(X,W)|X)∥L(T(X, f W)|X) ≤ D f L(W|X)∥L( f W|X) . (C.17) 315 Finally, combining (C.17) and (C.16) gives us L X ℓ=1 f(p ℓ L) L ≤ E h D f L(T(X,W)|X)∥L(T(X, f W)|X) i ≤ E h D f (L(W|X)∥L( f W|X)) i . C.1.5 ProofofTheorem4.2.6 We know thatV n,L has a multinomial distribution. Letq ℓ denote the probability of occurrence for category ℓ ∈ [L]. Invoking the result of [5, Theorem 2] for multinomial hypothesis testing with truncated chi- squared test statistics, we get α ≥ P L X ℓ=1 (V ℓ − nq ℓ ) 2 − V ℓ max{q ℓ , 1 L } ≥ n v u u t 2 α L X ℓ=1 q ℓ max{q ℓ ,1/L} 2 . (C.18) In the next step, we usemax(q ℓ ,1/L)≥ 1/L along with P L ℓ=1 q ℓ =1 to get 316 α ≥ P L X ℓ=1 (V ℓ − nq ℓ ) 2 − V ℓ max{q ℓ , 1 L } ≥ n r 2 α L ! =P L X ℓ=1 (V ℓ − nq ℓ ) 2 max{q ℓ , 1 L } ≥ L X ℓ=1 V ℓ max{q ℓ , 1 L } +n r 2 α L ! ≥ P L X ℓ=1 (V ℓ − nq ℓ ) 2 max{q ℓ , 1 L } ≥ L L X ℓ=1 V ℓ +n r 2 α L ! =P L X ℓ=1 (V ℓ − nq ℓ ) 2 max{q ℓ , 1 L } ≥ nL+n r 2 α L ! ≥ P 1 n L X ℓ=1 (V ℓ − nq ℓ ) 2 q ℓ + 1 L ≥ L+ r 2 α L ! . (C.19) By deploying Theorem 4.2.5, under the null hypothesis (4.2) we have 1 L L X ℓ=1 f(Lq ℓ )≤ E[D f (Bern(η (X))∥Bern(b η (X)))]≤ τ. (C.20) This implies that(q 1 ,...,q L ) is a feasible point for the optimization problem in the definition of U finite τ (V n,L ) in (4.4), and so 1 n L X ℓ=1 (V ℓ − nq ℓ ) 2 q ℓ + 1 L ≥ U finite τ (V n,L ). (C.21) Plugging (C.21) into (C.19) yields P U finite τ (V n,L )≥ L+ r 2 α L ! ≤ α. This completes the proof by the definition of decision rule Φ finite n,L,τ . 317 We next proceed to the proof of the asymptotic result. Following a similar argument shows that (q 1 ,...,q ℓ ) is also a feasible solution for the optimization problem in the definition of U asym τ (V n,L ), which implies that L X ℓ=1 (V ℓ − nq ℓ ) 2 q ℓ ≥ U asym τ (V n,L ). (C.22) In the next step, we use the following asymptotic result on the Pearson’s chi-squared test statistic (see e.g., [104] Theorem 14.3.1): lim n→∞ L X ℓ=1 (V ℓ − nq ℓ ) 2 nq ℓ (d) →χ 2 L− 1 , (C.23) where χ 2 L− 1 denotes the chi-squared distribution with L− 1 degrees of freedom. As a direct result of (C.23) we have lim n→∞ supP L X ℓ=1 (V ℓ − nq ℓ ) 2 q ℓ ≥ χ 2 L− 1 (1− α ) ! =α. Combining (C.22) with the above relation yields lim n→∞ supP U asym τ (V n,L )≥ χ 2 L− 1 (1− α ) ≤ α. The claim then follows simply from the definition of decision rule Φ asym n,L,τ . 318 C.1.6 ProofofProposition4.3.1 Let e X ∼P X and f W ∼ Unif([0,1]), then we have D f (L(X,W)∥P X × Unif([0,1]))=D f L(X,W)∥L( e X, f W) = Z p e X, f W (x,w)f p X,W (x,w) p e X, f W (x,w) ! dxdw = Z p e X (x)p f W| e X (w|x)f p X (x)p W|X (x) p e X (x)p f W| e X (w|x) ! dxdw. Sincep e X =p X andp f W| e X (w|x)=1(uniform distribution) we get D f L(X,W)∥L( e X, f W) =E Z 1 0 f p W|X (w|X) dw . (C.24) Also from the construction ofw, it is easy to see that p W|X (w|x)=P(y =+1|x)p W|Y,X (w|y =+1,x)+P(y =0|x)p W|Y,X (w|y =0,x) = η (x) b η (x) I(w≤ b η (x))+ 1− η (x) 1− b η (x) I(w≥ b η (x)). (C.25) Plugging (C.25) into (C.24) brings us to D f L(X,W)∥L( e X, f W) =E " Z b η (X) 0 f η (X) b η (X) dw+ Z 1 b η (X) f 1− η (X) 1− b η (X) dw # =E b η (X)f η (X) b η (X) +(1− b η (X))f 1− η (X) 1− b η (X) =E[D f (Bern(η (X))∥Bern(b η (X)))] , which completes the proof. 319 C.1.7 ProofofProposition4.2.7 We adopt the shorthandτ 1 forτ finite n,L,α andτ 0 forE[D f (Bern(η (X))∥Bern(b η (X)))]. From the definitions, τ 1 is a function ofV n,L and is random. We only prove the result for the finite test statistics. The claim for the asymptotic statistics follows from the same argument. First, from the optimization problem used forU finite τ (V n,L ) in (4.4) it is easy to observe that for fixed values ofV n,L statisticsU finite τ (V n,L ) is non-increasing inτ . Therefore, P(E[D f (Bern(η (x))∥Bern(b η (x)))]≥ τ 1 )=P(τ 0 ≥ τ 1 ) =1− P(τ 1 ≥ τ 0 ) ≥ 1− P(U finite τ 0 (V n,L )≥ U finite τ 1 (V n,L )). Second, from the definition of τ 1 we have U finite τ 1 ≥ L+ p 2L/α . Plugging this into the above relation yields P(E x [D f (Bern(η (x))∥Bern(b η (x)))]≥ τ 1 )≥ 1− P(U finite τ 0 (V n,L )≥ L+ p 2L/α ). We next invoke the result of Theorem 4.2.6, on the size of the test statisticsU finite τ (V n,L ). Whenτ ≥ τ 0 the null hypothesis holds and so we have P(U finite τ (V n,L )≥ L+ p 2L/α )≤ α. This completes the proof. 320 C.1.8 ProofofProposition4.2.8 The first inequality in Proposition 4.2.8 is a direct result of Jenson’s inequality applied on the convex function f. The last equation is also shown in the proof of Proposition 4.2.4, for the complete proof we refer to the chain of relations started in (C.14). In this section, we focus on proving the first equation, in which we need to show that lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )= Z 1 0 f(E[ψ (u;X)])du. (C.26) Letb j (M,u) denote thej-th Bernstein polynomial of degreeM, which is given by b j (M,u)= M j u j (1− u) M− j . Then the probabilityp ℓ given in (C.4) can be written as p ℓ = Kℓ− 1 X j=K(ℓ− 1) Z 1 0 b j (M,u)E[ψ (u;X)]du. ∀ℓ∈[L] (C.27) In the next step, from the first assumption we have a.s. |ψ (u;X)|≤ C, and by an application of dominated convergence theorem we can change the order of derivative and expectation and get d du E[ω(u;X)]=E[ψ (u;X)]. 321 Next, by using this in (C.27) and then by partial integration we arrive at p ℓ =− Kℓ− 1 X j=K(ℓ− 1) Z 1 0 ∂b j (M,u) ∂u E[ω(u;X)]du. ∀ℓ∈[L]. (C.28) On the other hand, by simple algebraic calculation, it is easy to get the following identity for Bernstein polynomials: ∂b j (M,u) ∂u =M(b j− 1 (M− 1,u)− b j (M− 1,u)), (C.29) where we set the convention n k =0 fork >n andk <0. Using (C.29) in (C.28) gives us p ℓ =M Kℓ− 1 X j=K(ℓ− 1) Z 1 0 E[ω(u;X)](b j (M− 1,u)− b j− 1 (M− 1,u))du. ∀ℓ∈[L] Moving the summation inside the integration we obtain p ℓ =M Z 1 0 E[ω(u;X)](b Kℓ− 1 (M− 1,u)− b K(ℓ− 1)− 1 (M− 1,u))du. ∀ℓ∈[L] (C.30) On the other hand, it is easy to check that Z 1 0 b Kℓ− 1 (M− 1,u)du= Z 1 0 b K(ℓ− 1)− 1 (M− 1,u)du= 1 M . (C.31) 322 Using the identities (C.31) and (C.30), we write E ω ℓ L ;X − E ω ℓ− 1 L ;X − p ℓ =M Z 1 0 E ω ℓ L ;X − E[ω(u;X)] b Kℓ− 1 (M− 1,u)du +M Z 1 0 E[ω(u;X)]− E ω ℓ− 1 L ;X b K(ℓ− 1)− 1 (M− 1,u)du. Since almost surely| ∂ ∂u ψ (u;X)| ≤ C for u ∈ (0,1), we realize that|ψ (u;X)| is bounded by C almost surely, therebyω(u;X) isC-Lipschitz. Using this along with the triangle inequality we get E ω ℓ L ;X − E ω ℓ− 1 L ;X − p ℓ ≤ MC Z 1 0 ℓ L − u b Kℓ− 1 (M− 1,u)du +MC Z 1 0 u− ℓ− 1 L b K(ℓ− 1)− 1 (M− 1,u)du. (C.32) We continue by writing the above expressions in terms of a Beta distribution. Recall that the pdf of Beta distribution Beta(α,β ) with shape parameters (α,β ) is given by u α − 1 (1− u) β − 1 B(α,β ) , where B(α,β ) := R 1 0 u α − 1 (1− u) β − 1 du. For integer values a,b, B(a,b) has a closed form and is given by B(a,b) = (a− 1)!(b− 1)! (a+b− 1)! . Therefore, we have Mb Kℓ− 1 (M− 1,u)=M M− 1 Kℓ− 1 u Kℓ− 1 (1− u) M− Kℓ = M! (Kℓ− 1)!(M− Kℓ)! u Kℓ− 1 (1− u) M− Kℓ = u Kℓ− 1 (1− u) M− Kℓ B(Kℓ,M− Kℓ+1) . 323 Using a similar expression forMb K(ℓ− 1)− 1 (M− 1,u) we write (C.32) as follows: 1 C E ω ℓ L ;X − E ω ℓ− 1 L ;X − p ℓ ≤ E u∼ Beta(Kℓ,M− Kℓ+1) ℓ L − u +E u∼ Beta(K(ℓ− 1),M− K(ℓ− 1)+1) u− ℓ− 1 L . (C.33) Since the mean of Beta(α,β ) is given by α/ (α +β ), the above Beta distributions have mean values of ℓ/L and(ℓ− 1)/L, respectively (recall thatM +1=KL). Therefore the terms on the right-hand side of (C.33) are indeed the mean absolute deviation of two Beta distributions. Further, using Jenson’s inequality we know that for arbitrary random variableS we have E[|S− E[S]|]≤ Var[S] 1/2 . Using this in (C.33) yields 1 C E ω ℓ L ;X − E ω ℓ− 1 L ;X − p ℓ ≤ Var[Beta(Kℓ,M− Kℓ+1)] 1/2 +Var[Beta(K(ℓ− 1),M− K(ℓ− 1)+1)] 1/2 . (C.34) In the next step, note that Var(Beta(α,β ))= αβ (α +β ) 2 (α +β +1) ≤ 1 4(α +β ) , (C.35) where the last inequality follows from(α +β ) 2 ≥ 4αβ . Combining (C.34) and (C.35) yields 1 C E ω ℓ L ;X − E ω ℓ− 1 L ;X − p ℓ ≤ 2 2 √ M +1 = 1 √ KL . (C.36) 324 Now from the mean value theorem, we know that there existss ℓ (X)∈ ℓ− 1 L , ℓ L such that ω ℓ L ;X − ω ℓ− 1 L ;X = 1 L ψ (s ℓ (X);X). (C.37) Combining (C.37) and (C.36) yields lim K→∞ Lp ℓ =E[ψ (s ℓ (X);X)] , wheres ℓ (x)∈ ℓ− 1 L , ℓ L . (C.38) We are now ready to prove the claim of (C.26). Introduce functionσ (u)=E[ψ (u;X)], we have 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(E[ψ (u;X)])du= 1 L L X ℓ=1 " f(Lp ℓ )− L Z ℓ L ℓ− 1 L (f◦ σ )(u)du # . (C.39) Given thatψ (u,X) is almost surely continuous, it is easy to observe thatσ (u) is also continuous. In the next step, by another application of the mean value theorem for the continuous functionf◦ σ , there exists t ℓ ∈ ℓ− 1 L , ℓ L such that L Z ℓ L ℓ− 1 L (f◦ σ )(u)du=(f◦ σ )(t ℓ ). (C.40) By combining (C.39) and (C.40) we get 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(E[ψ (u;X)])du= 1 L L X ℓ=1 [f(Lp ℓ )− f◦ σ (t ℓ )] , (C.41) and using continuity off along with (C.38) and (C.41) we obtain lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(E[ψ (u;X)])du= 1 L L X ℓ=1 [f(E[ψ (s ℓ (X);X)])− f(E[ψ (t ℓ ;X)])] . (C.42) 325 For the rest of the proof, we show that for everyε > 0 for sufficiently large L the right hand side of (C.42) is smaller thanε. For this end, we start by the fact that sincef is a continuous function, therefore on the compact set[0,C] must be uniformly continuous. This implies that there existsδ > 0 such that for everyr 1 ,r 2 ∈ [0,C] with|r 1 − r 2 |≤ δ we have|f(r 1 )− f(r 2 )|≤ ε. In the next step, we define values r (ℓ) 1 =E[ψ (s ℓ (X);X)] andr (ℓ) 2 =E[ψ (t ℓ ;X)]. We then claim that for sufficiently large value of L such that L ≥ C δ we have|r (ℓ) 1 − r (ℓ) 2 | ≤ δ and r (ℓ) 1 ,r (ℓ) 2 ∈ [0,C] (proof of this claim is provided later). An immediate consequence of uniform continuity yields|f(r (ℓ) 1 )− f(r (ℓ) 2 )|≤ ε, by plugging this into (C.42) we arrive at lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(E[ψ (u;X)])du = 1 L L X ℓ=1 |f(E[ψ (s ℓ (X);X)])− f(E[ψ (t ℓ ;X)])| = 1 L L X ℓ=1 f(r (ℓ) 1 )− f(r (ℓ) 2 ) ≤ ε. By lettingL go to infinity, ε can be chosen arbitrarily small and so lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )= Z 1 0 f(E[ψ (u;X)])du. (C.43) We are only left to prove our claim that for everyℓ∈[L] whenL≥ B δ we haver (ℓ) 1 ,r (ℓ) 2 ∈[0,C] and |r (ℓ) 1 − r (ℓ) 2 | ≤ δ . For this purpose, given that almost surely for everyu ∈ (0,1) we haveψ (u;X) ≤ C, thereforer (ℓ) 1 ,r (ℓ) 2 ∈[0,C]. In the next step, by an application of Jenson’s inequality we have|r (ℓ) 1 − r (ℓ) 2 |≤ E[|ψ (s ℓ (X);X)− ψ (t ℓ ;X)|]. We then use the second assumption stated in Proposition 4.2.8, and get |r (ℓ) 1 − r (ℓ) 2 | ≤ BE[|s ℓ (X)− t ℓ |]. Finally, given that s ℓ (X),t ℓ both belong to ℓ− 1 L , ℓ L , we arrive at |r (ℓ) 1 − r (ℓ) 2 |≤ B L . UsingL≥ B δ completes the proof. 326 C.1.9 ProofofProposition4.2.9 Proposition 4.2.1 states that when the null hypothesis (4.2) holds (with τ = 0), the random variable w is independent from x and follows a uniform distribution over [0,1]. Because of the symmetry among w,w 1 ,...,w M , it is easy to check that the p-value p takes values in{ 1 M+1 ,..., M M+1 ,1} uniformly at random. Since the p-value p is discrete, for any t ∈ [0,1], we haveP(p ≤ t) = ⌊(M+1)t⌋ M+1 . Therefore, P(p≤ t)≤ t, which implies that under the null hypothesis, the p-valuep is superuniform. C.1.10 ProofofProposition4.3.3 The proof for the first part, is basically almost similar to the proof of Proposition 4.2.4. The minor dif- ference is that the arguments should be followed for the functionψ (u) from Assumption 4.3.2 instead of conditional functions ψ (u;x) from Assumption 4.2.3, and also careful treatment of conditional expecta- tions with respect to covariatesx. We provide the complete proof here for the reader’s convenience. First, it is easy to observe that the joint distribution of(V 1 ,...,V L ) is a multinomial distribution with probabilities(p 1 ,...,p L ) wherep ℓ denotes the probability of random variables(X,W) admitting labelℓ forℓ∈[L]. In accordance with Algorithm 5, each sample(X,W) hasM counterfeits( e X 1 , f W 1 ),...,( e X M , f W M ) where f W i are iid random variables distributed asUnif[0,1], and e X i are iid fromP X . To lighten the nota- tion, we use the shorthandsT =T(X,W), andT j =T( e X j , f W j ) forj∈[M]. In this case, the rank value is given by R=1+ M X j=1 I T ≥ T j . 327 We have P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) P(R=j+1) = Kℓ− 1 X j=K(ℓ− 1) P(T is exactly larger thanj ofT j∈[M] ). We know that random variablesT ,T j∈[M] are independent, which gives us the following: P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) Z P( exactlyj ofT j∈[M] are smaller thant)dP T (t) = Kℓ− 1 X j=K(ℓ− 1) Z M j F c (t) j (1− F c (t)) M− j f o (t)dt. (C.44) The last relation follows from the iid property of random variables T j∈[M] along with the definitions of F c (t),F o (t) given in Assumption 4.3.2. By using the dummy variable u = F c (t) in the inner integral of (C.44) we get P((X,W) admits labelℓ)= Kℓ− 1 X j=K(ℓ− 1) Z 1 0 M j u j (1− u) M− j ∂F o (F − 1 c (u)) ∂u du. Next recall the functionψ (u) as density function ofω(u)=F 0 (F − 1 c (u)) defined in Assumption 4.3.2. We write the above probability as p ℓ = Kℓ− 1 X j=K(ℓ− 1) M j Z 1 0 u j (1− u) M− j ψ (u)du. ∀ℓ∈[L]. (C.45) 328 Forℓ∈[L], we define the polynomial β ℓ :[0,1]→R + as follows: β ℓ (u)= Kℓ− 1 X j=K(ℓ− 1) M j u j (1− u) M− j . Using the definition of β ℓ (u) in (C.45) we write p ℓ = Z 1 0 β ℓ (u)ψ (u)du. ∀ℓ∈[L], (C.46) and so 1 L L X ℓ=1 f(Lp ℓ )= 1 L L X ℓ=1 f L Z 1 0 β ℓ (u)ψ (u)du . (C.47) We continue by upper bounding the right-hand side in the above equation. Consider the following prob- ability density function over[0,1]: φ ℓ (u)= β ℓ (u) R 1 0 β ℓ (u)du , ∀u∈[0,1]. (C.48) By an application of Jensen’s inequality and using the convexity off we have f E u∼ φ ℓ [ψ (u)] i ≤ E u∼ φ ℓ h f(ψ (u)) i . (C.49) Equivalently this can be rewritten as f Z 1 0 φ ℓ (u)ψ (u)du ≤ Z 1 0 φ ℓ (u)f(ψ (u))du. (C.50) 329 Plugging (C.48) into (C.50) yields f R 1 0 β ℓ (u)ψ (u)du R 1 0 β ℓ (u)du ! ≤ R 1 0 β ℓ (u)f(ψ (u))du R 1 0 β ℓ (u)du . (C.51) To use the above inequality in (C.47), we establish two properties of functionβ ℓ (u). First from the definition of polynomial β ℓ (u) we have Z 1 0 β ℓ (u)du= Kℓ− 1 X j=K(ℓ− 1) M j Z 1 0 u j (1− u) M− j du = Kℓ− 1 X j=K(ℓ− 1) M j B(j+1,M− j+1) = Kℓ− 1 X j=K(ℓ− 1) M j j!(M− j)! (M +1)! = 1 L , (C.52) whereB(a,b) is the beta function. Second, we have L X ℓ=1 β ℓ (u)= L X ℓ=1 Kℓ− 1 X j=K(ℓ− 1) M j u j (1− u) M− j = M X j=0 M j u j (1− u) M− j =(u+1− u) M =1. (C.53) Using (C.52) in (C.51) we have f L Z 1 0 β ℓ (u)ψ (u)du ≤ L Z 1 0 β ℓ (u)f(ψ (u))du, (C.54) 330 which together with (C.47) gives 1 L L X ℓ=1 f(Lp ℓ )≤ L X ℓ=1 Z 1 0 β ℓ (u)f(ψ (u))du= Z 1 0 f(ψ (u))du. (C.55) The last step above follows from (C.53). We next use the definition of ψ (u) in the right hand side of (C.55) to obtain Z 1 0 f(ψ (u))du= Z 1 0 f f o (F − 1 c (u)) f c (F − 1 c (u)) du = Z ∞ −∞ f f o (t) f c (t) f c (t)dt =D f L(T(X,W))∥L(T( e X, f W)) , (C.56) where the last relation comes from the definition of f− divergence and density functionsf o (t) andf c (t). We next proceed to the proof of the second part. By virtue of characterization in (C.56), we need to show that lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )= Z 1 0 f(ψ (u))du. (C.57) Letb j (M,u) denote thej-th Bernstein polynomial of degreeM, which is given by b j (M,u)= M j u j (1− u) M− j . 331 Then the probabilityp ℓ given in (C.45) can be written as p ℓ = Kℓ− 1 X j=K(ℓ− 1) Z 1 0 b j (M,u)ψ (u)du. ∀ℓ∈[L] Next, partial integration yields p ℓ =− Kℓ− 1 X j=K(ℓ− 1) Z 1 0 ∂b j (M,u) ∂u ω(u)du. ∀ℓ∈[L]. (C.58) On the other hand, by simple algebraic calculation, it is easy to get the following identity for Bernstein polynomials: ∂b j (M,u) ∂u =M(b j− 1 (M− 1,u)− b j (M− 1,u)), (C.59) where we set the convention n k =0 fork >n andk <0. Using (C.59) in (C.58) gives us p ℓ =M Kℓ− 1 X j=K(ℓ− 1) Z 1 0 ω(u)(b j (M− 1,u)− b j− 1 (M− 1,u))du. ∀ℓ∈[L] Moving the summation inside the integration we obtain p ℓ =M Z 1 0 ω(u)(b Kℓ− 1 (M− 1,u)− b K(ℓ− 1)− 1 (M− 1,u))du. ∀ℓ∈[L] (C.60) On the other hand, it is easy to check that Z 1 0 b Kℓ− 1 (M− 1,u)du= Z 1 0 b K(ℓ− 1)− 1 (M− 1,u)du= 1 M . (C.61) 332 Using the identities (C.61) and (C.60), we write ω ℓ L − ω ℓ− 1 L − p ℓ =M Z 1 0 ω ℓ L − ω(u) b Kℓ− 1 (M− 1,u)du +M Z 1 0 ω(u)− ω ℓ− 1 L b K(ℓ− 1)− 1 (M− 1,u)du. Sinceψ (.) is continuous on[0,1], therefore it isB-bounded. This means thatω(.) isB-Lipschitz. Using this along with the triangle inequality we get ω ℓ L − ω ℓ− 1 L − p ℓ ≤ MB Z 1 0 ℓ L − u b Kℓ− 1 (M− 1,u)du +MB Z 1 0 u− ℓ− 1 L b K(ℓ− 1)− 1 (M− 1,u)du. (C.62) We continue by writing the above expressions in terms of a Beta distribution. Recall that the pdf of Beta distribution Beta(α,β ) with shape parameters (α,β ) is given by u α − 1 (1− u) β − 1 B(α,β ) , where B(α,β ) := R 1 0 u α − 1 (1− u) β − 1 du. For integer values a,b, B(a,b) has a closed form and is given by B(a,b) = (a− 1)!(b− 1)! (a+b− 1)! . Therefore, we have Mb Kℓ− 1 (M− 1,u)=M M− 1 Kℓ− 1 u Kℓ− 1 (1− u) M− Kℓ = M! (Kℓ− 1)!(M− Kℓ)! u Kℓ− 1 (1− u) M− Kℓ = u Kℓ− 1 (1− u) M− Kℓ B(Kℓ,M− Kℓ+1) . 333 Using a similar expression forMb K(ℓ− 1)− 1 (M− 1,u) we write (C.62) as follows: 1 B ω ℓ L − ω ℓ− 1 L − p ℓ ≤ E u∼ Beta(Kℓ,M− Kℓ+1) ℓ L − u +E u∼ Beta(K(ℓ− 1),M− K(ℓ− 1)+1) u− ℓ− 1 L . (C.63) Since the mean of Beta(α,β ) is given by α/ (α +β ), the above Beta distributions have mean values of ℓ/L and(ℓ− 1)/L, respectively (recall thatM +1=KL). Therefore the terms on the right-hand side of (C.63) are indeed the mean absolute deviation of two Beta distributions. Further, using Jenson’s inequality we know that for arbitrary random variableS we have E[|S− E[S]|]≤ Var[S] 1/2 . Using this in (C.63) yields 1 B ω ℓ L − ω ℓ− 1 L − p ℓ ≤ Var[Beta(Kℓ,M− Kℓ+1)] 1/2 +Var[Beta(K(ℓ− 1),M− K(ℓ− 1)+1)] 1/2 . (C.64) In the next step, note that Var(Beta(α,β ))= αβ (α +β ) 2 (α +β +1) ≤ 1 4(α +β ) , (C.65) where the last inequality follows from(α +β ) 2 ≥ 4αβ . Combining (C.64) and (C.65) yields 1 B ω ℓ L − ω ℓ− 1 L − p ℓ ≤ 2 2 √ M +1 = 1 √ KL . (C.66) 334 Now from the mean value theorem, we know that there existss ℓ ∈ ℓ− 1 L , ℓ L such that ω ℓ L − ω ℓ− 1 L = 1 L ψ (s ℓ ). (C.67) Combining (C.67) and (C.66) yields lim K→∞ Lp ℓ =ψ (s ℓ ), for somes ℓ ∈ ℓ− 1 L , ℓ L . (C.68) We are now ready to prove the claim of (C.57). Write 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(ψ (u))du= 1 L L X ℓ=1 " f(Lp ℓ )− L Z ℓ L ℓ− 1 L (f◦ ψ )(u)du # . (C.69) By another application of the mean value theorem for the continuous function f ◦ ψ , there exists t ℓ ∈ ℓ− 1 L , ℓ L such that L Z ℓ L ℓ− 1 L (f◦ ψ )(u)du=(f◦ ψ )(t ℓ ). (C.70) By combining (C.69) and (C.70) we get 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(ψ (u))du= 1 L L X ℓ=1 [f(Lp ℓ )− f◦ ψ (t ℓ )] , (C.71) and using continuity off along with (C.68) and (C.71) we obtain lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(ψ (u))du= 1 L L X ℓ=1 [f◦ ψ (s ℓ )− f◦ ψ (t ℓ )] . (C.72) 335 Next sincef◦ ψ is continuous on the compact set[0,1], it is uniformly continuous. This implies that for every arbitraryε>0, there existsδ > 0 such that if|u 1 − u 2 |≤ δ , then we have|f◦ ψ (u 1 )− f◦ ψ (u 2 )|≤ ε. ChooseL sufficiently large such that 1/L<δ . Since botht ℓ ,s ℓ belong to the interval[(ℓ− 1)/L,ℓ/L] we get that|s ℓ − t ℓ | < δ , and therefore|f◦ ψ (s ℓ )− f◦ ψ (t ℓ )|≤ ε. Using this observation in (C.72) we see that for every arbitrary smallε>0 and sufficiently large L we have lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )− Z 1 0 f(ψ (u))du ≤ 1 L L X ℓ=1 |f◦ ψ (s ℓ )− f◦ ψ (t ℓ )|≤ ε. By lettingL go to infinity, ε can be chosen arbitrarily small and so lim L→∞ lim K→∞ 1 L L X ℓ=1 f(Lp ℓ )= Z 1 0 f(ψ (u))du. (C.73) This completes the proof of the claim (C.57), and so the proof of the second part of Proposition 4.3.3. C.1.11 ProofofTheorem4.3.4 As we showed in Proposition 4.3.3 (first part), we have: 1 L L X ℓ=1 f(Lp ℓ )≤ D f (L(T(X,W))∥L(T( e X, f W))). Also, by an application of the data processing inequality we have D f L(T(X,W))∥L(T( e X, f W)) ≤ D f L(X,W)∥L( e X, f W) . The result then follows from Proposition 4.3.1. 336 C.1.12 ProofofLemma4.3.6 Using the definition of f-divergence and conjugate function we have D f (p∥q)= Z qf p q dµ ≥ Z qsup φ∈G φp q − f ∗ (φ) dµ ≥ sup φ∈G Z (φp− f ∗ (φ)q)dµ. This completes the proof of the lower bound. Next we show that this bound becomes tight ifσ ∈∂f(p/q), for anyx∈X (Here,p,q andσ are evaluated at anyx∈X ). The definition of subdifferential implies that for every measurable functionh:X →R we have f(h)− f p q ≥ σh − σp q . This gives us f p q ≤ σp q − (σh − f(h)). Since this holds for every measurable functionh we get f p q ≤ σp q − sup h (σh − f(h)). In the next step, by using the definition of the conjugate dual function we get qf p q ≤ σp − qf ∗ (σ ). 337 This completes the proof. C.1.13 ProofofProposition4.3.7 Since X p ∼ p and X q ∼ q, we get D f (p∥q) = D f (L(X p )∥L(X q )) and by an application of the data processing inequality we obtain D f (p∥q)=D f (L(X p )∥L(X q ))≥ D f (L(σ (X p ))∥L(σ (X q ))). We next show the other direction, i.e., D f (L(σ (X p ))∥L(σ (X q )))≥ D f (p∥q). (C.74) Let e p ande q denote the density functions of σ (X p ) and σ (X q ) respectively. By using Lemma 4.3.6 for density functionse p,e q and a class of measurable functionsG ={g :R→R} we get D f (L(σ (X p ))∥L(σ (X q )))≥ sup g∈G Z (ge p− f ∗ (g)e q)dµ. (C.75) Next, by rewriting (C.75) in terms of expectations we arrive at D f (L(σ (X p ))∥L(σ (X q )))≥ sup g∈G E t∼ e p [g(t)]− E t∼ e q [f ∗ (g(t))] . We then change the expectation measures to get D f (L(σ (X p ))∥L(σ (X q )))≥ sup g∈G {E[g(σ (X p ))]− E[f ∗ (g(σ (X q )))]} . (C.76) 338 We consider the identity functiong(t)=t defined over the real line, with the convention that g(∞)=∞. Evaluating the right-hand side of (C.76) atg we get D f (L(σ (X p ))∥L(σ (X q )))≥ E[e g(σ (X p ))]− E[f ∗ (e g(σ (X q ))) =E[σ (X p )]− E[f ∗ (σ (X q ))]. (C.77) We also know from Lemma 4.3.6 that forσ ∈∂f(p/q) we have E[σ (X p )]− E[f ∗ (σ (X q )))=D f (p∥q). (C.78) Combining (C.77) and (C.78) yields (C.74) and completes the proof. C.2 ProofofProposition4.5.1 The Lagrangian of min q∈Uτ q T x is given by L(q;η,λ )=q T x+λ 1 L L X ℓ=1 f(Lq ℓ )− τ ! +η L X ℓ=1 q ℓ − 1 ! . In this case, the dual objective function is D(λ,η )=min q≥ 0 " q T x+ λ L L X ℓ=1 f(Lq ℓ )− λτ +η ( L X ℓ=1 q ℓ − 1) # . 339 The constraints are decoupled and minimization can be moved inside, this yields D(λ,η )=− η − λτ + L X ℓ=1 min q ℓ ≥ 0 q ℓ x ℓ + λ L f(Lq ℓ )+ηq ℓ . Using the conjugate dual definition f(s)=sup t≥ 0 [st− f(t)] yields D(λ,η )=− η − λτ − λ L L X ℓ=1 f ∗ − x ℓ +η Lλ . In this formulation the optimalq ℓ is such that − x ℓ +η Lλ ∈∂f(q ℓ ). (C.79) Given that in this problem the strong duality holds (Slater’s condition forτ > 0), usingλ ∗ ,η ∗ as solutions ofargmin λ ≥ 0,η − D(λ,η ) in (C.79) completes the proof. 340 AppendixD SupplementtoChapter5 D.1 Proofoftheoremsandtechnicallemmas D.1.1 ProofofProposition5.1.2 From the definition of robust surrogate in (5.9) for the setting of Proposition 5.1.2 we have ϕ γ (θ ;z 0 ):=sup x n (y 0 − x T θ ) 2 − γ ∥x− x 0 ∥ 2 ℓ 2 o , by introducingg γ (x):=(y 0 − x T θ ) 2 − γ ∥x− x 0 ∥ 2 ℓ 2 , for every scalarc we get g γ (x 0 +cθ )=g γ (x 0 )+2c(x T 0 θ − y 0 )∥θ ∥ 2 ℓ 2 +c 2 ∥θ ∥ 2 ℓ 2 (∥θ ∥ 2 ℓ 2 − γ ), 341 this implies if γ < ∥θ ∥ 2 ℓ 2 , then ϕ γ (θ ;z 0 ) = +∞. Consider γ ≥ ∥ θ ∥ 2 ℓ 2 , then from relation∇ 2 g γ (x) = 2(θθ T − γI ) we realize thatg γ is concave. Writing the first order optimal condition we have (y 0 − x T θ )θ +γ (x− x 0 )=0. (D.1) Multiplying byθ and solving forx T θ , we get x T θ = γx T 0 θ − y 0 ∥θ ∥ 2 γ −∥ θ ∥ 2 . Substituting forx T θ in the stationary condition (D.1) implies x ∗ =x 0 + x T 0 θ − y 0 γ −∥ θ ∥ 2 ℓ 2 θ. Replacingx ∗ ing γ yields ϕ γ (θ ;z)= +∞ ifγ < ∥θ ∥ 2 ℓ 2 , γ (y 0 − x T θ ) 2 (γ −∥ θ ∥ 2 ℓ 2 ) ifγ ≥∥ θ ∥ 2 ℓ 2 . (D.2) Then, we use dual formulation (5.8) to compute the Wasserstein adversarial risk: AR(θ ):= sup Q∈Uε(P Z ) E z∼ Q [ℓ(θ ;z)] = inf γ ≥ 0 {γε 2 +E P Z [ϕ γ (θ ;z)]} = inf γ ≥∥ θ ∥ 2 ℓ 2 {γε 2 +E P Z [ϕ γ (θ ;z)]} = inf γ ≥∥ θ ∥ 2 ℓ 2 {γε 2 + γ E P Z [ℓ(θ ;z)] γ −∥ θ ∥ 2 ℓ 2 }, 342 the infimum is achieved at γ ∗ = 1 ε q E P Z [ℓ(θ ;z)]∥θ ∥ ℓ 2 +∥θ ∥ 2 ℓ 2 . Finally, this gives us AR(θ )= q E P Z [ℓ(θ ;z)]+ε∥θ ∥ ℓ 2 2 . D.1.2 ProofofTheorem5.1.3 Define R(θ ):=λ SR(θ )+AR(θ ). Proposition 5.1.2 impliesAR(θ )= SR(θ )+2ε∥θ ∥ ℓ 2 p SR(θ )+ε 2 ∥θ ∥ 2 ℓ 2 , then by expanding adversarial risk relationAR(θ ) inR(θ ) we get R(θ )=(1+λ )SR(θ )+ε 2 ∥θ ∥ 2 ℓ 2 +2ε∥θ ∥ ℓ 2 p SR(θ ). (D.3) It is easy to seeSR(θ )=σ 2 y +θ T Σ θ − 2v T θ . Replace∇ θ SR(θ )=2(Σ θ − v) in (D.3) to get ∇ θ R(θ )=2(1+λ )(Σ θ − v)+2ε 2 θ +2ε θ ∥θ ∥ ℓ 2 p SR(θ )+(Σ θ − v) ∥θ ∥ ℓ 2 p SR(θ ) ! , (D.4) therefore stationary points (solutions of∇ θ R(θ ) = 0) and a critical pointθ = 0 are candidates for global minimizers. From equationSR(θ ) =σ 2 y +θ T Σ θ − 2v T θ and adversarial risk relation in Proposition 5.1.2 it is clear that for θ = 0 we have SR(θ ) = AR(θ ) = σ 2 y . Next, we focus on characterizing stationary 343 minimizers ofR(θ ) and their corresponding standard and adversarial risk values. If θ ∗ is a stationary point, then putting (D.4) to be zero yields 1+λ + ε∥θ ∗ ∥ ℓ 2 p SR(θ ∗ ) ! Σ+ ε 2 + ε p SR(θ ∗ ) ∥θ ∗ ∥ ℓ 2 ! I ! θ ∗ = 1+λ + ε∥θ ∗ ∥ ℓ 2 p SR(θ ∗ ) ! v. (D.5) Introduce A ∗ := √ SR(θ ∗ ) ∥θ ∗ ∥ ℓ 2 and γ ∗ := ε 2 +εA∗ 1+λ + ε A∗ , then (D.5) can be simplified to θ ∗ = (Σ+ γ ∗ I) − 1 v. By replacingθ ∗ =(Σ+ γ ∗ I) − 1 v inA ∗ along with equationSR(θ )=σ 2 y +θ T Σ θ − 2v T θ we get A ∗ = p SR((Σ+ γ ∗ I) − 1 v) ∥(Σ+ γ ∗ I) − 1 v∥ ℓ 2 = 1 ∥(Σ+ γ ∗ I) − 1 v∥ ℓ 2 σ 2 y + Σ 1/2 (Σ+ γ ∗ I) − 1 v 2 ℓ 2 − 2v T (Σ+ γ ∗ I) − 1 v 1/2 , thereforeγ ∗ is a fixed point solution of two equations (5.15) and (5.16). Moreover, definition of A ∗ gives us SR(θ ∗ ) = A 2 ∗ (Σ+ γ ∗ I) − 1 v 2 ℓ 2 . Next, from adversarial risk relation in Proposition D.1.1 we know that AR(θ ∗ )=( p SR(θ ∗ )+ε∥θ ∗ ∥ ℓ 2 ) 2 . This impliesAR(θ ∗ )=(A ∗ +ε) 2 (Σ+ γ ∗ I) − 1 v 2 ℓ 2 . D.1.3 ProofofCorollary5.1.4 For linear data modely =x T θ 0 +w with isotropic featuresE[xx T ]=I d and Gaussian noisew∼ N(0,σ 2 ) we haveE[xy] = θ 0 . In addition, we haveE[y 2 ] = σ 2 +∥θ 0 ∥ 2 ℓ 2 . This gives us σ 2 y = σ 2 +∥θ o ∥ 2 ℓ 2 . Use Theorem 5.1.3 withv =θ 0 ,Σ= I, andσ 2 y =σ 2 +∥θ 0 ∥ 2 ℓ 2 to get Corollary 5.1.4. 344 D.1.4 ProofofProposition5.1.5 We start by proving the expression for standard risk. By definition we have SR(θ ):=E[I(y̸= ˆ y)]=P(yx T θ ≤ 0) =P y(yµ +Σ 1/2 u) T θ ≤ 0 =P (µ +Σ 1/2 u) T θ ≤ 0 =P µ T θ + Σ 1/2 θ ℓ 2 ν ≤ 0 =Φ − µ T θ Σ 1/2 θ ℓ 2 ! , (D.6) withu∼ N(0,I d ) andν ∼ N(0,1). To prove the expression for adversarial risk we use the dual form (5.8). Our next lemma characterizes the functionϕ γ given by (5.9) for the binary problem under the Gaussian mixture model. Lemma D.1.1. Consider the binary classification problem under the Gaussian mixture model with 0-1 loss. Then, the robust surrogate for the loss functionϕ γ given by (5.9) with distanced(·,·) (5.12) satisfies E P Z [ϕ γ (θ ;z)]=Φ r 2 b θ γ − a + b θ γ 2 a θ + r 2 b θ γ φ a θ − r 2 b θ γ − a θ φ(a θ )+(a 2 θ +1) h Φ a θ − r 2 b θ γ − Φ( a θ ) i , witha θ = µ T θ ∥Σ 1/2 θ ∥ ℓ 2 andb θ = ∥Σ 1/2 θ ∥ 2 ℓ 2 ∥θ ∥ 2 ℓq . 345 Proof (Lemma D.1.1). By definition of the ϕ γ function, for the setting of Lemma D.1.1 we have ϕ γ (θ ;z 0 )=sup x {I(y 0 x T θ ≤ 0)− γ 2 ∥x− x 0 ∥ 2 ℓr }. We letv 0 :=y 0 x 0 andv =y 0 x. Given thaty 0 ∈{± 1}, the functionϕ γ can be written as ϕ γ (θ ;z 0 )=sup v {I(v T θ ≤ 0)− γ 2 ∥v− v 0 ∥ 2 ℓr }. First observe that by choosing x = x 0 , we obtain ϕ γ (θ,z 0 ) ≥ 0. It is also clear that ϕ γ (θ,z 0 ) ≤ 1. We consider two cases. Case1: (v T 0 θ ≤ 0). By choosingv =v 0 we obtain thatϕ γ (θ ;z 0 )≥ 1 and henceϕ γ (θ ;z 0 )=1. Case2:(v T 0 θ > 0). Letv ∗ be the maximizer in definition of ϕ γ (θ ;z 0 ). Ifv T ∗ θ > 0, then we have ϕ γ (θ ;z 0 )=I(v T ∗ θ ≤ 0)− γ 2 ∥v ∗ − v 0 ∥ 2 ℓr =− γ 2 ∥v ∗ − v 0 ∥ 2 ℓr ≤ 0. Therefore,ϕ γ (θ ;z 0 )=0 in this case. We next focus on the case thatv T ∗ θ ≤ 0. It is easy to see that in this case,v ∗ is the solution of the following optimization: min v∈R d ∥v− v 0 ∥ ℓr subject to v T θ ≤ 0 (D.7) 346 Given thatv T 0 θ > 0 by assumption, using the Holder inequality it is straightforward to see that the optimal value is given by∥v− v 0 ∥ ℓr = v T 0 θ ∥θ ∥ ℓq , with 1 r + 1 q =1. The functionϕ γ is then given byϕ γ (θ ;z 0 )=1− γ 2 v T 0 θ ∥θ ∥ ℓq 2 . Putting the two conditionsv T ∗ θ ≤ 0 and v T 0 θ > 0 together, we obtain ϕ γ (θ ;z 0 )=max n 1− γ 2 v T 0 θ ∥θ ∥ ℓq 2 ,0 o , in this case. Combining case 1 and case 2 we arrive at ϕ γ (θ ;z 0 )=I(v T 0 θ ≤ 0) (D.8) +max 1− γ 2 v T 0 θ ∥θ ∥ ℓq 2 ,0 ! I(v T 0 θ > 0). (D.9) For (x 0 ,y 0 ) generated according to the Gaussian mixture model, we have v T 0 θ = y 0 x T 0 θ = µ T θ + Σ 1/2 θ ℓ 2 ν withν ∼ N(0,1). Hence, v T 0 θ ∥θ ∥ ℓq = µ T θ ∥θ ∥ ℓq + Σ 1/2 θ ℓ 2 ∥θ ∥ ℓq ν . 347 Lettinga θ := µ T θ ∥Σ 1/2 θ ∥ ℓ 2 , (D.8) can be written as ϕ γ (θ ;z0)=I(ν ≤− a θ ) +max 1− γ 2 Σ 1/2 θ 2 ℓ 2 ∥θ ∥ 2 ℓq (ν +a θ ) 2 ,0 · I(ν > − a θ ) =I(ν ≤− a θ ) + 1− b θ γ 2 (ν +a θ ) 2 I r 2 b θ γ − a θ >ν > − a θ , (D.10) whereb θ := ∥Σ 1/2 θ ∥ 2 ℓ 2 ∥θ ∥ 2 ℓq . By simple algebraic calculation, we get E P Z [ϕ γ (θ ;z)]=Φ r 2 b θ γ − a θ + b θ γ 2 a θ + r 2 b θ γ φ a θ − r 2 b θ γ − a θ φ(a θ ) +(a 2 θ +1) h Φ a θ − r 2 b θ γ − Φ( a θ ) i . The claim of Proposition 5.1.5 follows readily from Lemma D.1.1 and the fact that strong duality holds for the dual problem (5.8), where we use the change of variableγ 7→ γ b θ . D.1.5 ProofofRemark5.1.7 Recall the objective (5.25) and define R(a):=λ Φ( − a)+γε 2 +Φ r 2 γ − a + γ 2 n (a+ r 2 γ )φ a− r 2 γ − aφ(a) +(a 2 +1) Φ a− r 2 γ − Φ( a) ! o . 348 Then, we get dR(a) da =− λφ (− a) +γ n φ r 2 γ − a − φ(a) +a Φ r 2 γ − a − Φ( a) o . (D.11) Note that ∂ ∂t φ(t− a)− φ(a)+a(Φ( t− a)− Φ( a)) =φ(t− a)(2a− t), (D.12) and therefore the maximum ofφ(t− a)− φ(a)+a(Φ( t− a)− Φ( a)) is achieved att = 2a. As a result dR(a) da ≤− λφ (− a) < 0, which implies that the objective (5.25) is decreasing ina. Since|a|≤∥ µ ∥ ℓ 2 , its infimum is achieved at a=∥µ ∥ ℓ 2 . Equations (5.26) follows from (5.24) by substituting fora θ =∥µ ∥ ℓ 2 andb θ =1. D.1.6 ProofofCorollary5.1.8 Recall the distanced(·,·) on the spaceZ ={z = (x,y), x∈R d ,y∈R} given byd(z,˜ z) =||x− e x|| 2 + ∞· I(y− ˜ y). This metric is induced from norm∥z∥=∥x∥ ℓ 2 +∞· I(y =0) with corresponding conjugate norm∥z∥ ∗ =∥x∥ ℓ 2 . We will use Proposition 5.0.3 to find the variation of loss ℓ and derive the first-order 349 approximation for the Wasserstein adversarial risk. Denoting byu j ∈R d be thejth row of matrixU, for j =1,2,...,N, we have ∇ x ℓ(θ ;Z)=∇ x (y− θ T σ (Ux)) 2 =2(θ T σ (Ux)− y) N X j=1 θ j σ ′ (u T j x)u j =2(θ T σ (Ux)− y)U T diag(σ ′ (Ux))θ. (D.13) As we work with Wasserstein of orderp=2, we have conjugate orderq =2. Therefore, Proposition 5.0.3 gives usV P Z ,q (ℓ)= E[||∇ z ℓ(θ ;Z)|| 2 ∗ ] 1/2 . By using (D.13) we get VP Z ,q(ℓ)=2 E (θ T σ (Ux)− y) 2 U T diag(σ ′ (Ux))θ 2 ℓ 2 1/2 . Finally, relationAR(θ )= SR(θ )+εV P Z ,q (ℓ)+O(ε 2 ) from Proposition 5.0.3 completes the proof. We just need to verify that the necessary condition in Proposition 5.0.3 holds for the lossℓ(θ ;z)=(y− θ T σ (Wx)) 2 . By the setting of the problem, we have x ∈ S d− 1 ( √ d) and u j ∈ S d− 1 (1). Therefore∥x∥ ℓ 2 ≤ √ d and ∥U∥ op ≤ p max(N,d). In the following lemma we show that the solutionθ λ to (5.14) is bounded asλ varies in[0,∞). Lemma D.1.2. Under the setting of Corollary 5.1.8, and forθ λ given by (5.14), there exist constantsc 0 and c 1 , independent ofλ , such that with probability at least1− e − c 0 d we have∥θ λ ∥ ℓ 2 ≤ c 1 . Using Lemma D.1.2 we can restrict ourselves to the ball ofℓ 2 radiusc 1 . More specifically, we can define a ‘surrogate’ loss for (5.14) whereθ is constrained to be in ball of radiusc 1 , without changing its solution. We can then apply Proposition 5.0.3 to establish a relation between SR and AR. In the following part we show that the conditions of Proposition 5.0.3 are satisfied. 350 We adopt the shorthandsD = diag(σ ′ (Ux)), ˜ D = diag(σ ′ (U˜ x)), s = σ (Ux), and ˜ s = σ (U˜ x), and write 1 2 ∥∇ z ℓ(θ ;z)−∇ z ℓ(θ ;˜ z)∥ ∗ = 1 2 ∥∇ x ℓ(θ ;z)−∇ x ℓ(θ ;˜ z)∥ ℓ 2 (a) = (θ T s− y)U T Dθ − (θ T ˜ s− ˜ y)U T ˜ Dθ ℓ 2 (b) ≤ θ T sU T (D− ˜ D)θ ℓ 2 + θ T (s− ˜ s)U T ˜ Dθ ℓ 2 + yU T (D− ˜ D)θ ℓ 2 + (y− ˜ y)U T ˜ Dθ ℓ 2 (c) ≤ Nc 2 1 + √ Nc 2 1 ∥s− ˜ s∥ ℓ 2 + √ Nc 2 1 + √ Nc 1 |y− ˜ y| (d) ≤ (N + √ N)c 2 1 +Nc 2 1 ∥x− ˜ x∥ ℓ 2 + √ Nc 1 |y− ˜ y| ≤ (N + √ N)c 2 1 +Nc 2 1 ∥x− ˜ x∥ ℓ 2 +∞I {y̸=˜ y} (e) ≤ M +L∥z− ˜ z∥, where(a) comes from (D.13), in(b) we used triangle inequality,(c) is a direct result of Cauchy inequality and the fact thatσ (u)≤ u,(d) comes from Lipschitz continuity ofσ , and in(e) we usedC =(N+ √ N)c 2 1 andL=Nc 2 1 . Therefore the necessary condition in Proposition 5.0.3 is satisfied. 351 D.1.6.1 ProofofLemmaD.1.2 By comparing the objective value (5.14) atθ λ and0 and using the optimality ofθ λ we get (1+λ )SR(θ λ ) ≤ (1+λ )SR(θ λ ) +2εE x h (f d (x)− θ T λ σ (Ux)) 2 +σ 2 × U T diag(σ ′ (Ux))θ λ 2 ℓ 2 i 1/2 ≤ (1+λ )SR(0). Therefore by invoking (5.31) we get E x h (f d (x)− θ T λ σ (Ux)) 2 i ≤ E x f d (x) 2 (D.14) Using the inequality(a− b) 2 ≥ a 2 2 − b 2 , we get E[(θ T λ σ (Ux)) 2 ]≤ 4E x [f d (x) 2 ]<c 2 , (D.15) with probability at least1− e − c 3 d for some constantsc 2 ,c 3 >0. We next lower bound the eigenvalues of E[σ (Ux)σ (Ux) T ] from which we can upper bound∥θ λ ∥ ℓ 2 . Define the dual activation of σ as ˜ σ (ρ )=E (v,w)∼ Nρ [σ (v)σ (w)] 352 where N ρ denotes the two dimensional Gaussian with mean zero and covariance 1 ρ ρ 1 . With this definition, we have E[(σ (Ux)σ (Ux) T ) ij ]= ˜ σ (u T i u j ) fori,j =1,...,N. Let{a r } ∞ r=0 denote the Hermite coefficients defined by a r := 1 √ 2π Z ∞ −∞ σ (g)h r (g)e − g 2 2 dg, whereh r (g) is the normalized Hermite polynomial defined by h r (x):= 1 √ r! (− 1) r e x 2 2 d r dx r e − x 2 2 . Using the properties of normalized Hermite polynomials we have ˜ σ (ρ )=E (v,w)∼ Nρ h ( ∞ X r=0 a r h r (v))( ∞ X ˜ r=0 a ˜ r h ˜ r (u)) i = ∞ X r=0 a 2 r ρ r . (D.16) Writing in matrix form we obtain E[(σ (Ux)σ (Ux) T )]= ˜ σ (UU T )= ∞ X r=0 a 2 r (UU T ) ⊙ r , (D.17) where for a matrixA ⊙ r =A⊙ (A ⊙ (r− 1) ) with⊙ denoting the Hadamard product (entrywise product). We next use the identity(AA T )⊙ (BB T )=(A∗ B)(A∗ B) T , with∗ indicating the Khatri-Rao product. By using this identity and applying induction on r it is straightforward to get the following relation for any matrixA: (AA T ) ⊙ r =(A ∗ r )(A ∗ r ) T , (D.18) 353 withA ∗ r =A∗ (A ∗ (r− 1) ). By using the above identity in Equation (D.17) we obtain E[(σ (Ux)σ (Ux) T )]= ∞ X r=0 a 2 r (UU T ) ⊙ r = ∞ X r=0 (a r U ∗ r )(a r U ∗ r ) T ⪰ a 2 r (U ∗ r )(U ∗ r ) T , for anyr≥ 0. Using this bound withr =2 and the fact thata 2 = 1 2 √ π for ReLU activation, we get E[(σ (Ux)σ (Ux) T )]⪰ 1 4π (U∗ U)≥ c 4 , (D.19) where the last step holds with probability at least1− e − c 5 d for some constantsc 4 andc 5 using the result of [164, Corollary 7.5]. Combining Equations (D.15) and (D.19) gives us∥θ λ ∥ ℓ 2 ≤ p c 2 /c 4 , which completes the proof. 354 AppendixE SupplementtoChapter6 E.1 Proofoftheoremsandtechnicallemmas E.1.1 ProofofCorollary6.0.3 In order to characterize the Bayes-optimal classifier of the Gaussian mixture setting (6.5), we first need to compute the conditional density functionP(y =+1|x). This will help us use Proposition 6.0.2 to identify the Bayes-optimal classifiers. For this purpose, in the first step, consider the general Gaussian mixture modele x∼ N(ye µ, Σ) , where the covariance matrixΣ is not necessarily full-rank. This means thate x can be a degenerate multivariate Gaussian with the following density function: f e x|y (e x)=(|2π Σ | + ) − 1/2 exp − (e x− ye µ ) T Σ † (e x− ye µ )/2 , (E.1) where|.| + stands for the pseudo-determinant operator. By recalling the Bayes’ theorem we get 355 P(y =+1|e x)= P(y =+1)f e x|y=+1 (e x) P(y =+1)f e x|y (e x)+P(y =− 1)f e x|y=− 1 (e x) = 1 1+((1− π )/π )· f e x|y=− 1 (e x)/f e x|y=+1 (e x) . By using (E.1) in the last equation, we will arrive at the following P(y =+1|e x)= 1+exp(− 2e x T Σ † e µ +q) − 1 , (E.2) whereq =log 1− π π . On the other hand, it is easy to observe that sign((1+exp(− t)) − 1 − 1/2)=sign(t). (E.3) Finally, we can deploy (E.2) in Proposition 6.0.2 in conjunction with the identity (E.3) to derive the Bayes- optimal classifier e h ∗ =sign(e x T Σ † e µ − q/2). We can now focus on the primary setup (6.5). LetΣ:= WW T , e x = φ − 1 (x), ande µ := Wµ . It is easy to check that with these new notations we havee x ∼ N(ye µ, Σ) . Recall the Bayes-optimal classifier of this setting h ∗ (x)=sign(e x T Σ † e µ − q/2). By replacinge x,Σ , ande µ by their respective definitions φ − 1 (x),WW T , andWµ we realize that the Bayes-optimal classifier is given by h ∗ (x)=sign φ − 1 (x) T (WW T ) † Wµ − q/2 . In this part we want to characterize the Bayes-optimal classifier of the generalized linear model (6.6). To this end, note that if the proposed classifier in Corollary 6.0.3 is optimal for the identity map φ(x)=x, then we can simply considere x=φ − 1 (x), and establish its optimality for every functionφ. This means that 356 we only need to focus on the case with the identity functionφ(x)=x. For the purpose of identifying the Bayes-optimal classifiers, we can use Proposition 6.0.2. In the first step, we need to compute the conditional probabilityP(y =1|x). By using the Bayes rule we get P(y =+1|X =x)= R P X|Z (x|z)P Y|X,Z (+1|x,z)dP Z (z) R P X|Z (x|z)dP Z (z) = R I(Wz =x)P Y|Z (+1|z)dP Z (z) R I(Wz =x)dP Z (z) , where in the last equation, we used the fact that condition onz, the feature vectorx and the labely are independent. In addition, as W is a full-rank matrix (linearly independent columns), then for a fixed x, equationWz =x has the unique solutionz ∗ =(W T W) − 1 W T x. This gives us P(y =+1|X =x)=P Y|Z (+1|z ∗ ) =f(β T (W T W) − 1 W T x). By recalling Proposition 6.0.2, we realize that the Bayes-optimal classifier of this setting with φ(t) = t is given by h ∗ (x)=sign f(β T (W T W) − 1 W T x)− 1/2 . This completes the proof. 357 E.1.2 ProofofTheorem6.1.1 We first show that if the result holds for ℓ 2 adversaries, then the theorem is also true forℓ p norm bounded adversaries with powerε p . Indeed this result can be seen as an immediate consequence of {x ′ : x ′ − x ℓp ≤ ε p }⊆ B ε (x), (E.4) whereε=ε p d 1 2 − 1 p . More precisely, the boundary risk ofℓ p adversaries with powerε p will be smaller than ℓ 2 adversaries with powerε p d 1 2 − 1 p which completes the proof. In this step, we only need to prove (E.4). For this end, from the Holder’s inequality for everyu,v∈R d andr≥ 1 we have: d X i=1 |u i ||v i |≤ d X i=1 |u i | r ! 1 r d X i=1 |v i | r r− 1 ! 1− 1 r . By usingu i =|x ′ i − x i | 2 ,v i =1, andr =p/2 (note thatp≥ 2) in the above inequality we get∥x ′ − x∥ ℓ 2 ≤ ∥x ′ − x∥ ℓp d 1/2− 1/p , which yields (E.4). We next focus on showing the result for ℓ 2 norm bounded adversaries with power ε. For this end, introduceρ (.) = φ − 1 (.), Σ = WW T ,e x = ρ (x), ande µ = Wµ . From Corollary 6.0.3, we know that the Bayes-optimal classifier is given by h ∗ (x)=sign ρ (x) T (WW T ) † Wµ − q/2 . (E.5) 358 We next focus on computing the boundary risk of the Bayes-optimal classifier h ∗ . By recalling the bound- ary risk definition we get: BR(h ∗ )=P x,y h ∗ (x)y≥ 0, inf u∈Bε(x) h ∗ (x)h ∗ (u)≤ 0 . In the next step, we expand the above expression for the two possible values ofy∈{+1,− 1} to get BR(h ∗ )=P x,y h ∗ (x)≥ 0, inf u∈Bε(x) h ∗ (u)≤ 0,y =1 +P x,y h ∗ (x)≤ 0, sup u∈Bε(x) h ∗ (u)≥ 0,y =− 1 ! . We then plug (E.5) into the last equation to get BR(h ∗ )≤ P x,y inf u∈Bε(x) e µ T Σ † ρ (u)≤ q 2 ≤ e x T Σ † e µ, y =1 +P x,y e x T Σ † e µ ≤ q 2 ≤ sup u∈Bε(x) e µ T Σ † ρ (u), y =− 1 . By our assumption, dφ/dt ≥ c, for some constantc > 0. By simple algebraic manipulation it is easy to check thatρ (B ε (x)) is a subset ofB ε/c (ρ (x)). This gives us BR(h ∗ )≤ P x,y ( inf v∈B ε/c (e x) e µ T Σ † v≤ q 2 ≤ e x T Σ † e µ,y =+1) +P x,y (e x T Σ † e µ ≤ q 2 ≤ sup v∈B ε/c (e x) e µ T Σ † v,y =− 1). 359 The inner minimization in the above expression can be solved in closed form, by which we obtain BR(h ∗ )≤ P x,y q 2 ≤ e x T Σ † e µ ≤ q 2 + ε c Σ † e µ ℓ 2 ,y =1 +P x,y q 2 − ε c Σ † e µ ℓ 2 ≤ e x T Σ † e µ ≤ q 2 ,y =− 1 . From the Gaussian mixture model (6.5) we know that x ∼ N(ye µ, Σ) . Fore x + ∼ N(e µ, Σ) ande x − ∼ N(− e µ, Σ) this implies thate x|y = +1∼ e x + , ande x|y =− 1∼ e x − . By conditioning the above expression ony we get: BR(h ∗ )≤ π P e x + 0≤ e x T + Σ † e µ − q 2 ≤ ε∥Σ † e µ ∥/c +(1− π )P e x − − ε c ∥Σ † e µ ∥≤ e x T − Σ † e µ − q 2 ≤ 0 . Sincee x T + Σ † e µ ∼ N(a,a), ande x T − Σ † e µ ∼ N(− a,a) witha=e µ T Σ † e µ , we have BR(h ∗ )≤ π P u∼ N(a,a) 0≤ u− q 2 ≤ ε c ∥Σ † e µ ∥ +(1− π )P u∼ N(− a,a) − ε c ∥Σ † e µ ∥≤ u− q 2 ≤ 0 . We bound the above probabilities, using the fact that the pdf of normal N(ν,σ ) is bounded by 1/ √ 2πσ , for any values ofσ > 0 andν . This implies that BR(h ∗ )≤ ε||Σ † e µ || c √ 2π ||Σ †1/2 e µ || . (E.6) 360 In the next step, by using∥Σ †1/2 ∥=1/σ min (W) we arrive at c √ 2π BR(h ∗ )≤ ε σ min (W) . This along withε/σ min (W)=o d (1) assumption asd grows to infinity completes the proof. E.1.3 ProofofProposition6.1.3 From the definition of boundary risk in (6.4) we have BR(h)=P h(x)y≥ 0, inf x ′ ∈Bε(x) h(x ′ )h(x)≤ 0 . Note that the above probability involves the randomness of µ ∼ N(0,I k /k), and entries of W being N(0,1/k). In the first step, for the classifier h(x) = sign(x T e 1 ), expand the boundary risk for each possible values ofy∈{+1,− 1} to get BR(h)=P x T e 1 ≥ 0, inf ∥x ′ − x∥ ℓp ≤ εp sign(x ′T e 1 )≤ 0,y =+1 ! +P x T e 1 ≤ 0, sup ∥x ′ − x∥ ℓp ≤ εp sign(x ′T e 1 )≥ 0,y =− 1 . In the inner optimization problem, since only the first coordinate of x ′ is on the scene, it is easy to obtain the optimal adversarial perturbation. This gives us BR(h)=P 0≤ x T e 1 ≤ ε p ,y =+1 +P − ε p ≤ x T e 1 ≤ 0,y =− 1 . 361 Note that in the setting (6.5), φ(t) = t hence we have x|y ∼ N(yWµ,WW T ). By conditioning on the W,µ we get BR(h)=π · E W,µ h P x∼ N(Wµ,WW T ) 0≤ x T e 1 ≤ ε p i +(1− π )· E W,µ h P x∼ N(− Wµ,WW T ) − ε p ≤ x T e 1 ≤ 0 i . We next denote the first column of W T byω∈R k . It is easy to observe that conditioned onµ andω, the linear termx T e 1 has a Gaussian distribution with the meanµ T ω, and the unit variance (recall thatω lies on the unitk dimensional sphere). This brings us BR(h)=π · E µ,ω h P u∼ N(µ T ω,1) (0≤ u≤ ε p ) i +(1− π )· E µ,ω h P u∼ N(− µ T ω,1) (− ε p ≤ u≤ 0) i . We then use the standard normal c.d.f. Φ to rewrite the above probabilities. This gives us the following: BR(h)=π · E µ,ω h Φ( ε p − µ T ω)− Φ( − µ T ω) i +(1− π )· E µ,ω h Φ( µ T ω)− Φ( − ε p +µ T ω) i =E µ,ω h Φ ε p − µ T ω − Φ − µ T ω i , 362 where the last equation comes from Φ( − t)=1− Φ( t), for every real valuet. We next useµ ∼ N(0,I k /k) and∥ω∥ 2 ℓ 2 = 1. This implies that conditioned onω, the random valueµ T ω hasN(0,1/k) distribution. In the next step, by using the law of iterated expectations we get BR(h)=E µ,ω h Φ ε p − µ T ω − Φ − µ T ω i =E ω h E h Φ ε p − µ T ω − Φ − µ T ω i ω i =E z∼ N(0,1/k) [Φ( ε p +z)− Φ( z)] ≥ E z∼ N(0,1) [Φ( ε p +z)− Φ( z)]=c εp , where the last inequality follows from Lemma E.1.1. It is worth to mention thatc εp is a deterministic value, which is independent of the dimensionsk,d. We next present lemma E.1.1 along with its proof. LemmaE.1.1. For every nonnegativeε, the following function is nonincreasing inσ > 0: g(σ ):=E z∼ N(0,1) [Φ( ε+σz )− Φ( σz )] . ProofofLemmaE.1.1. Letφ(.) denote the standard normal pdf. We will show thatg(σ ) has a nonpos- itive derivative. ∂ ∂σ g(σ )=E z∼ N(0,1) [zφ(ε+σz )− zφ(σz )] =E z∼ N(0,1) [zφ(ε+σz )] , where the last equality follows from the fact thatzφ(σz ) is an odd function. In the next step, by rewriting the above relation in terms of positive values ofz, we arrive at 363 ∂ ∂σ g(σ )= 1 √ 2π Z +∞ 0 ze − ε 2 +σ 2 z 2 2 e − εσz − e εσz φ(z)dz. By noting that for z ≥ 0 we have e − εσz ≤ e εσz , we realize that the derivative of g is nonpositive. This completes the proof. ■ E.1.4 ProofofProposition6.1.2 We characterize AR(h) and then SR(h) follows the same formulation withε p = 0. It is easy to observe thath(x)=sign(a T x), and we have AR(h)=P sup ∥x− x ′ ∥p≤ εp a T x ′ y≤ 0 ! = 1 2 P sup ∥x− x ′ ∥p≤ εp a T x ′ ≤ 0|y =+1 ! + 1 2 P sup ∥x− x ′ ∥p≤ εp a T x ′ ≥ 0|y =− 1 ! . In the next step, using this well known relation for dual norms that sup ∥v∥p≤ 1 u T v =∥u∥ q we arrive at AR(h)= 1 2 P a T x≤ ε p ∥a∥ q |y =+1 + 1 2 P a T x≥− ε p ∥a∥ q |y =− 1 . (E.7) Conditioned ony =+1, we havex∼ N(Wµ,WW T ), thereforex T a|y =+1∼ N(a T Wµ, ∥W T a∥ 2 2 ). By a similar argument, we getx T a|y =− 1∼ N(− a T Wµ, ∥W T a∥ 2 2 ). Using this in (E.7) yields AR(h)= 1 2 Φ ε p ∥a∥ q − a T Wµ ∥W T a∥ 2 + 1 2 − 1 2 Φ − ε p ∥a∥ q +a T Wµ ∥W T a∥ 2 . Finally, usingΦ( t)+Φ( − t)=1 completes the proof. 364 E.1.5 ProofofProposition6.1.5 Since the standard risk can be computed from the adversarial risk by settingε p =0, we only need to prove the result for AR(h). From the definition for adversarial risk and the fact that for increasing f we have sign(f(a)− f(b))=sign(a− b), we have the following withc=f − 1 (1/2): AR(h)=P inf ∥x− x ′ ∥p≤ εp ysign(x ′T θ − c) ≤ 0 . By considering two possible cases for labely∈{+1,− 1} we arrive at AR(h)=P n inf ∥x− x ′ ∥p≤ εp (x ′T θ − c)≤ 0 o ∧{y =1} +P n sup ∥x− x ′ ∥p≤ εp (x ′T θ − c)≥ 0 o ∧{y =− 1} ! . In the next step, using the dual norm property that max ∥u∥p≤ 1 u T θ =∥θ ∥ q we get AR(h)=P n x T θ ≤ c+ε p ∥θ ∥ q o ∧{y =1} +P n x T θ ≥ c− ε p ∥θ ∥ q o ∧{y =− 1} . By conditioning onz and usingP(y =+1|z)=f(z T β ) andx=Wz we arrive at AR(h)=E z h I z T W T θ ≤ c+ε p ∥θ ∥ q f(β T z)+I z T W T θ ≥ c− ε p ∥θ ∥ q (1− f(β T z)) i . (E.8) 365 We introduce u 1 = β T z and u 2 = θ T Wz. Since u 1 ,u 2 are linear functions of z ∼ N(0,I k ), they are jointly Gaussian. It is easy to compute the covariance matrix and get[u 1 ,u 2 ]∼ N(0,Σ u ) withΣ u = h ∥β ∥ 2 2 β T W T θ β T W T θ ∥W T θ ∥ 2 2 i . Finally, usingu 1 ,u 2 in (E.8) completes the proof. For the Bayes-optimal classifier with θ ∗ = W(W T W) − 1 β we haveu 1 = u 2 = β T z. We can replace u 1 ,u 2 in the previous formulation withu∼ N(0,∥β ∥ 2 2 ) to get relations for adversarial and standard risks of the Bayes-optimal classifier. E.1.6 ProofofTheorem6.1.4 It is easy to observe that a similar argument in the proof of Theorem 6.1.1 can be adopted here to show that if the result holds for p = 2, it must hold for p ≥ 2. It is inspired by the fact that for p ≥ 2 we have∥x ′ − x∥ ℓ 2 ≤∥ x ′ − x∥ ℓp d 1 2 − 1 p . More details can be seen in Section E.1.2. We now focus on proving the theorem forℓ 2 -bounded adversaries of powerε. In the GLM setting, from Corollary 6.0.3, the Bayes- optimal classifier is given by h ∗ (x)=sign f(β T f W T e x)− 1/2 , (E.9) where f W =W(W T W) − 1 , ande x=φ − 1 (x). We focus on computing the boundary risk ofh ∗ . In the first step, from the boundary risk definition we have BR(h ∗ )=P h ∗ (x)y≥ 0, inf u∈Bε(x) h ∗ (x)h ∗ (u)≤ 0 . 366 By conditioning on the value ofy we get BR(h ∗ )=P y =+1,h ∗ (x)≥ 0, inf u∈Bε(x) h ∗ (u)≤ 0 +P y =− 1,h ∗ (x)≤ 0, sup u∈Bε(x) h ∗ (u)≥ 0 ! . By removing conditionsy =+1 andy =− 1, we can upper bound the above probabilities. This gives us BR(h ∗ )≤ P h ∗ (x)≥ 0, inf u∈Bε(x) h ∗ (u)≤ 0 +P h ∗ (x)≤ 0, sup u∈Bε(x) h ∗ (u)≥ 0 ! . In the next step, using (E.10) yields BR(h ∗ )≤ P inf u∈Bε(x) f φ − 1 (u) T f Wβ ≤ 1/2≤ f(e x T f Wβ ) +P f(e x T f Wβ )≤ 1/2≤ sup u∈Bε(x) f φ − 1 (u) T f Wβ ! . (E.10) From the manifold models in Section 6.0.1, we know that there existsc > 0 such that the derivative ofφ satisfies dφ/dt≥ c. This implies thatφ − 1 (B ε (x)) is a subset ofB ε/c φ − 1 (x) . Using this in (E.10) brings us BR(h ∗ )≤ P inf v∈B ε/c (e x) f v T f Wβ ≤ 1/2≤ f(e x T f Wβ ) ! +P f(e x T f Wβ )≤ 1/2≤ sup v∈B ε/c (e x) f v T f Wβ ! . (E.11) 367 Functionf is an increasing function, therefore the inner optimization problems can be cast with linear objectives and bounded ℓ 2 -ball constraints. This means that it is feasible to characterize a closed form solution for the optimization problem. For this end, we first introduce c 0 = f − 1 (1/2) and rewrite (E.11) as BR(h ∗ )≤ P c 0 ≤ e x T f Wβ ≤ c 0 +ε f Wβ ℓ 2 /c +P c 0 − ε f Wβ ℓ 2 /c≤ e x T f Wβ ≤ c 0 . (E.12) Ase x=Wz, thereforee x∼ N(0,WW T ). This implies thate x T f Wβ ∼ N(0,∥β ∥ 2 ℓ 2 ), and the probabilities in (E.12) can be computed by standard normal cdfΦ . By Lettingγ = f Wβ , we arrive at BR(h ∗ )≤ Φ c 0 +ε∥γ ∥ ℓ 2 /c ∥β ∥ ℓ 2 ! − Φ c 0 − ε∥γ ∥ ℓ 2 /c ∥β ∥ ℓ 2 ! . By using the fact thatΦ is1/ √ 2π -Lipschitz continuous, we get BR(h ∗ )≤ 2ε f Wβ ℓ 2 c √ 2π ∥β ∥ ℓ 2 . (E.13) In the next step, from f W T f W = (W T W) − 1 we realize that f Wβ ℓ 2 ≤ ∥ β ∥ ℓ 2 /σ min (W). Using this in (E.13) yields BR(h ∗ )≤ 2ε c √ 2πσ min (W) . Deployingε/σ min (W)=o d (1) completes the proof. 368 E.1.7 ProofofTheorem6.2.1 Similar to the proof of theorems in the previous sections, an analogous argument can be used here to show that if the result holds forp=2, it must hold forp≥ 2. In short, it comes from the fact that forp≥ 2 we have∥x ′ − x∥ ℓ 2 ≤∥ x ′ − x∥ ℓp d 1 2 − 1 p . More details can be seen in Section E.1.2. We now focus on proving the theorem forℓ 2 -bounded adversaries of sizeε. By recalling the definition of the boundary risk we get BR(h θ )=P(h θ (x)y≥ 0, inf x ′ ∈Bε(x) h θ (x ′ )h θ (x)≤ 0). We expand the above probabilities with respect to they possible values{+1,− 1}. This gives us BR(h θ )=P(y =+1,h θ (x)≥ 0, inf x ′ ∈Bε(x) h θ (x ′ )≤ 0) +P(y =− 1,h θ (x)≤ 0, sup x ′ ∈Bε(x) h θ (x ′ )≥ 0). Pluggingh θ (x)=sign(x T θ ) into to the above expression yields BR(h θ )=P(y =+1,x T θ ≥ 0, inf x ′ ∈Bε(x) θ T x ′ ≤ 0) +P(y =− 1,x T θ ≤ 0, sup x ′ ∈Bε(x) θ T x ′ ≥ 0). By solving the inner optimizations we can get the following BR(h θ )=P(y =+1,0≤ x T θ ≤ ε∥θ ∥ ℓ 2 ) +P(y =− 1,− ε∥θ ∥ ℓ 2 ≤ x T θ ≤ 0). 369 From the Gaussian mixture model (6.5) we know that x|y ∼ N(yWµ,WW T ). By using this Gaussian distribution along with conditioning on values ofy, we get BR(h θ )=π · P x∼ N(Wµ,WW T ) (0≤ x T θ ≤ ε∥θ ∥ ℓ 2 ) +(1− π )· P x∼ N(− Wµ,WW T ) (− ε∥θ ∥ ℓ 2 ≤ x T θ ≤ 0). Asx has multivariate Gaussian distribution, we can rewrite the above probabilities in terms of the standard normal cdfΦ . In addition, by usingΦ( − t)=1− Φ( t), we get BR(h θ )=Φ ε∥θ ∥ ℓ 2 − θ T Wµ ∥W T θ ∥ ℓ 2 ! − Φ − θ T Wµ ∥W T θ ∥ ℓ 2 ! . In the next step, by using the fact that the normal cdfΦ is1/ √ 2π Lipschitz continuous we arrive at BR(h θ )≤ ε∥θ ∥ ℓ 2 √ 2π ∥W T θ ∥ ℓ 2 . (E.14) Projection of the decision parameterθ onto the kernel ofW T gives the decompositionθ =θ 0 +Wα , forα ∈R k ,θ 0 ∈R d withθ 0 =P ker(W T ) (θ ). Using∥θ ∥ 2 ℓ 2 =∥θ 0 ∥ 2 ℓ 2 +∥Wα ∥ 2 ℓ 2 in (E.14) yields BR(h θ )≤ ε∥Wα ∥ ℓ 2 √ 2π ∥W T θ ∥ ℓ 2 1− ∥θ 0 ∥ 2 ℓ 2 ∥θ ∥ 2 ℓ 2 ! − 1/2 . We then use W T θ = W T Wα and the operator norm property∥Wα ∥ ℓ 2 σ min (W) ≤ W T Wα ℓ 2 . This brings us the following 370 BR(h θ )≤ ε √ 2πσ min (W) 1− ∥θ 0 ∥ 2 ℓ 2 ∥θ ∥ 2 ℓ 2 ! − 1/2 . Finally, deploying the problem assumption (6.10) completes the proof. In this part, we focus on the ERM problem. First, note that as the loss function ℓ is decreasing, it is easy to observe that the supremum ofyu T θ over the adversarial ballu∈B ε (x i ) isℓ(yx T θ − ε∥θ ∥ ℓ 2 ). This implies that the adversarial ERM problem can be written as the following: argmin θ ∈R d R n (θ ):= 1 n n X i=1 ℓ(y i x T i θ − ε∥θ ∥ ℓ 2 ). (E.15) In order to show that the classifier h b θ ε has boundary risk converging to zero, we use the first part of the theorem. For this purpose, we will show that indeed the obtained solution b θ ε has no component in the kernel space of the matrix W . In this case, the general condition (6.10) will reduce to εpd 1 2 − 1 p σ min (W) = o d (1). This means that we only need to prove that P Ker(W T ) ( b θ ε ) = 0. For this end, by projecting b θ ε onto the kernel ofW we get b θ ε =θ 0 +θ 1 withW T θ 0 =0, and there existsα ∈R k such thatθ 1 =Wα . We want to show thatθ 0 =0. Assume that∥θ 0 ∥ ℓ 2 >0, we show this will contradict the fact that b θ ε is a minimizer ofR n (θ ). By plugging b θ ε =θ 0 +θ 1 into (E.15) we get R n ( b θ ε )= 1 n n X i=1 ℓ(y i x T i b θ ε − ε∥ b θ ε ∥ 2 ) = 1 n n X i=1 ℓ(y i x T i θ 1 − ε∥ b θ ε ∥ 2 ), 371 where the last equality follows by the fact that x i = Wz i with z i ∼ N(yµ i ,I k ), and W T θ 0 = 0. In addition, from the orthogonality ofθ 0 andθ 1 we get R n ( b θ ε )= 1 n n X i=1 ℓ y i x T i θ 1 − ε ∥θ 0 ∥ 2 ℓ 2 +∥θ 1 ∥ 2 ℓ 2 1/2 . Finally, since the loss functionℓ is strictly decreasing, therefore for∥θ 0 ∥ ℓ 2 >0 we get R n ( b θ ε )> 1 n n X i=1 ℓ y i x T i θ 1 − ε∥θ 1 ∥ ℓ 2 =R n (θ 1 ). This contradicts the initial assumption that b θ ε is a minimizer of R n (θ ). This means that θ 0 = 0, and completes the proof. 372
Abstract (if available)
Abstract
The generation of large-scale fine-grained datasets is experiencing an unprecedented surge, propelling the extensive deployment of modern statistical learning models across a wide range of safety-critical applications. These include recommendation and forecasting systems, drug discovery, fraud detection, and personalized healthcare procedures. However, despite their widespread usage, decision-makers harbor reservations when it comes to confidently relying on such models. These concerns arise due to the replicability crisis observed in scientific discoveries, where the quest for reliable data-driven findings faces challenges in terms of reproducibility. As a result, findings from one study frequently struggle to be replicated in other independent investigations. In addition, there is a growing apprehension regarding the vulnerability of modern statistical learning models to minor perturbations in input data, as well as their limited ability to effectively generalize to unseen populations. These challenges underscore the utmost importance of prioritizing the reliability and robustness of modern statistical learning methods. By addressing these key concerns, we can fortify the confident deployment of these methods across diverse domains.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Robust causal inference with machine learning on observational data
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Statistical citation network analysis and asymmetric error controls
PDF
Two essays on the mutual fund industry and an application of the optimal risk allocation model in the real estate market
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Leveraging training information for efficient and robust deep learning
PDF
Invariant representation learning for robust and fair predictions
PDF
Adapting statistical learning for high risk scenarios
PDF
Implied equity duration: international evidence
PDF
Three essays on macro and labor finance
PDF
Essays on fair scheduling, blockchain technology and information design
PDF
Optimization strategies for robustness and fairness
PDF
Formal analysis of data poisoning robustness of K-nearest neighbors
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
Asset Metadata
Creator
Mehrabi, Mohammad
(author)
Core Title
Reliability and robustness in modern statistical learning
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2023-08
Publication Date
07/10/2023
Defense Date
04/04/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial robustness,OAI-PMH Harvest,statistical inference
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Javanmard, Adel (
committee chair
), Bien, Jacob (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
mehrabim@marshall.usc.edu,mehrabim@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113263122
Unique identifier
UC113263122
Identifier
etd-MehrabiMoh-12058.pdf (filename)
Legacy Identifier
etd-MehrabiMoh-12058
Document Type
Dissertation
Format
theses (aat)
Rights
Mehrabi, Mohammad
Internet Media Type
application/pdf
Type
texts
Source
20230710-usctheses-batch-1065
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial robustness
statistical inference