Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging sparsity in theoretical and applied machine learning and causal inference
(USC Thesis Other)
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEVERAGING SPARSITY IN THEORETICAL AND APPLIED MACHINE LEARNING AND CAUSAL INFERENCE by Gregory Faletto A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) August 2023 Copyright 2023 Gregory Faletto Dedication To Anna, my partner through all the twists and turns, and a remarkable person. I am grateful for every day with you. ii Acknowledgements I am lled with immense gratitude for my time here at USC. In my time here I have become convinced that research and teaching are two of the most rewarding ways a person can spend their time in this world. I have made it here with the help of some extraordinary people. First, my heartfelt appreciation goes to my advisor, Jacob. Your wisdom, kindness, and hard work have helped me learn how to manage a long-term project. You have taught me how to do research that is interesting for its own sake while never losing track of real-world impact. You naturally spread your passion for research and teaching to everyone around you. I have been lucky to be surrounded by an inspiring community of smart, driven, and passionate grad- uate students. I want to give a special mention to my fellow statistician and oce mate, Mohammad, who has been physically by my side for the last ve years and made the ups and downs a little easier. I also want to extend my gratitude to the administrators, including Dean Garrett, Dina, Julie, Rebeca, Andy, and Karla, for helping me navigate dicult processes and ensuring everything runs smoothly. I am deeply grateful to my committee—Jacob, Matteo, and Stas—for their support, guidance, and belief in me. To professors Robertas, Feng, and Inga, thank you for your guidance in teaching. I would have been lost without your advice and support. I must also thank Raman, Greys, Jinchi, Wen, Vishal, and Gourab for their leadership and dedication to making our department a great place to be. iii I would be remiss if I didn’t thank some of the people who gave me opportunities, advice, guidance, and support in applying to graduate school in the rst place. This includes Prof. Blake Thornton at Wash- ington University in St. Louis; Prof. Sanjog Misra at the University of Chicago; Prof. Nicolas Christou at University of California, Los Angeles; Prof. Tony Shaheen at California State University, Los Angeles; Prof. Carolyn Heinrich at Vanderbilt University; Jacob Evans at Live Nation; and Alex Copulsky at ZipRecruiter. Lastly and most of all, I want to thank my partner, Anna, for her incredible support and understanding. Anna, I don’t know how I would have made it here without you, and you mean the world to me. Thank you all for being a part of my journey, and for making this dissertation possible. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Cluster Stability Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Outline of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Stability Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Motivating Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 The Lasso Selects Randomly Among Highly Correlated Features . . . . . . . . . . 14 2.3.2 Theorem 2.3.1 Explains Stability Selection’s “Vote-Splitting" Problem . . . . . . . . 19 2.4 Cluster Stability Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Description of Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.2 Choice of Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2.1 Simple Averaged Cluster Stability Selection . . . . . . . . . . . . . . . . . 26 2.4.2.2 Weighted Averaged Cluster Stability Selection . . . . . . . . . . . . . . . 27 2.5 Theoretical Properties of Cluster Stability Selection . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Cluster Stability Selection Solves The “Vote-Splitting" Problem . . . . . . . . . . . 28 2.5.2 Error Control Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Simulations and Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.1 Simulation Study: Proxies with Varied Noise Levels . . . . . . . . . . . . . . . . . . 34 2.6.1.1 Known Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.2 Simulation Study: Equal Noise Proxies . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.3 Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7.1 Future Work: Sum Scoring and Single-Index Models . . . . . . . . . . . . . . . . . 46 v Chapter 3: Predicting Rare Events by Shrinking Towards Proportional Odds . . . . . . . . . . . . . 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Motivating Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Theorem 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.3 Theorem 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.4 Theorem 3.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Predicting Rare Events by Shrinking Towards proportional Odds . . . . . . . . . . . . . . . 63 3.3.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.1 Synthetic Data: Sparse Dierences Setting . . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 Synthetic Data: Dense Dierences Setting . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Real Data Experiment 1: Soup Tasting . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.4 Real Data Experiment 2: Diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Chapter 4: Fused Extended Two-Way Fixed Eects for Dierence-in-Dierences . . . . . . . . . . 78 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Basic Setup and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1 Dierence-in-Dierences Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3 Causal Estimands and Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 Fused Extended Two-Way Fixed Eects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Causal Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5.2 Restriction Selection Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.6 Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6.1 Estimation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.6.2 Restriction Selection Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6.3 Asymptotic Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Cluster Stability Selection Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.1 Additional Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 A.1.1 Low-Noise Proxies Are Good Selections For Prediction . . . . . . . . . . . . . . . . 131 A.1.2 Optimal Weights for Cluster Representatives . . . . . . . . . . . . . . . . . . . . . 132 A.2 Additional Simulation Study Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.2.1 Stability Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.2.2 Sparse Cluster Stability Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 A.2.3 More Simulation Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 vi A.2.3.1 Evaluating Cluster Stability Selection Weight Estimates . . . . . . . . . . 138 A.2.3.2 Additional Methods We Evaluate . . . . . . . . . . . . . . . . . . . . . . 141 A.2.3.3 More Plots from Section 2.6.1 Simulation Study . . . . . . . . . . . . . . 142 A.2.3.4 More Plots from Section 2.6.2 Simulation Study . . . . . . . . . . . . . . 143 A.2.4 More Data Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A.3 Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A.3.1 Proof of Statement (i) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A.3.2 Proof of Statement (ii) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 A.4 Proofs of Supporting Results For Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . 178 A.4.1 Proof of Proposition A.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 A.4.2 Proofs of Lemmas Stated in Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . 183 A.5 Other Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 A.5.1 Proofs of Corollary 2.3.2, Theorem 2.5.1, and Theorem 2.5.3 . . . . . . . . . . . . . 204 A.5.2 Statement of Theorem A.5.2 and Outline of Proofs of Theorems 2.5.5 and A.5.2 . . 211 A.5.3 Proofs of Proposition A.1.2, Corollary A.1.3, and Proposition A.1.5 . . . . . . . . . 213 A.6 Proofs of Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Predicting Rare Events by Shrinking Towards Proportional Odds Appendix . . . . . . . . . . . . 250 B.1 More Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 B.1.1 Ridge PRESTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 B.2 Statement of Lemma B.2.1 and Proofs of Theorems 3.2.2 and 3.2.3 . . . . . . . . . . . . . . 255 B.2.1 Statement of Lemma B.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 B.2.2 Proofs of Theorems 3.2.2 and 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 B.3 Investigation of Assumption (3.5) and Proof of Theorem 3.2.4 . . . . . . . . . . . . . . . . . 270 B.3.1 Investigating Whether min I 2 I 1 I > 1 I 1 1 is Bounded Away From Zero . . . 271 B.3.2 Investigating Whether Assumption (3.5) is Plausible . . . . . . . . . . . . . . . . . 277 B.3.3 Proof of Theorem 3.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 B.3.4 Proof of Lemma B.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 B.4 Proof of Lemma B.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 B.4.1 Calculating the Log Likelihood and Gradients . . . . . . . . . . . . . . . . . . . . . 288 B.4.2 Calculating the Hessian Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 B.4.3 Calculation of the Fisher Information Matrices . . . . . . . . . . . . . . . . . . . . 292 B.4.4 Mixed Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 B.4.5 Beta Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 B.4.6 Verifying the Asymptotic Distribution of Each Estimator . . . . . . . . . . . . . . . 295 B.5 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 B.5.1 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 B.5.2 Supporting Results for Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . 303 B.6 EstimatingPRESTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Fused Extended Two-Way Fixed Eects for Dierence-in-Dierences Appendix . . . . . . . . . . 321 C.1 Additional Results For Synthetic Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 321 C.2 Theorem C.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 C.3 Proofs of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 C.4 Proof of Theorem C.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 vii C.5 Proofs of Lemmas for Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 C.6 Proofs of Lemmas From Proof of Theorem C.2.1 . . . . . . . . . . . . . . . . . . . . . . . . 350 C.7 Proofs of Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 viii ListofTables 3.1 Calculatedp-values for one-tailed pairedt-tests for uniform dierences simulation setting of Section 3.4.2 testing the alternative hypothesis that PRESTO’s rare probability MSE is less than each competitor method in each rarity setting. (Statistically signicantp-values indicate better performance for PRESTO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2 Estimated rare class MSE for each method at each age cuto in prediabetes real data experiment from Section 3.4.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1 Means and standard errors for the squared error of the estimator of the ATT (4.16) for each method across all 280 simulations in the synthetic experiment from Section 4.6. . . . 111 4.2 p-values from paired one-tailedt-tests of the alternative hypothesis that the FETWFE ATT estimator (4.28) has a lower squared prediction error for the ATT (4.16) than each competitor method calculated across all 280 simulations in the synthetic experiment from Section 4.6. (Lowp-values indicate better predictive performance for FETWFE.) . . . . . . 111 A.1 Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A.2 Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. . . . . . . . . 146 A.3 Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. . . . . . . . . 149 A.4 Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.1. 150 A.5 Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. . . . . . . . . . . 150 ix A.6 Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. . . . . . . . . . . 151 A.7 Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A.8 Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. . 152 A.9 Table of p-values from one-tailed Z-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. . 152 A.10 Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 A.11 Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. . . . 153 A.12 Table of p-values from one-tailed Z-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. . . . 154 A.13 Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A.14 Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. . . . . . . . . 160 A.15 Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. . . . . . . . . 161 A.16 Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.2. 161 A.17 Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. . . . . . . . . . . 162 x A.18 Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. . . . . . . . . . . 162 A.19 Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.2. Note that both simple averaged cluster stability selection and weighted averaged cluster stability selection had perfect stability in models of size 1 and 2 (that is, both methods selected the exact same models of sizes 1 and 2 for every one of the 2000 simulations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 A.20 Table of p-values from one-tailed Z-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. Thep-values comparing simple averaged cluster stability selection and weighted averaged cluster stability selection is undened because both methods had perfect stability scores across all 2000 simulations; see Table A.19. As a result, theZ-statistic dened in Theorem 9 of Nogueira, Sechidis, and Brown [132] has the indeterminate value 0=0. . . . . . . . . . 164 A.21 Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. Thep-values comparing simple averaged cluster stability selection and weighted averaged cluster stability selection is undened because both methods had perfect stability scores across all 2000 simulations; see Table A.19. As a result, theZ-statistic dened in Theorem 9 of Nogueira, Sechidis, and Brown [132] has the indeterminate value 0=0. . . . . . . . . . 165 A.22 Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 A.23 Table of p-values from one-tailed Z-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. . . . 166 A.24 Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. . . . 166 A.25 Means of the test set mean squared errors of each method and each batch of model sizes across all 1000 replications in the data application in Section 2.6.3. . . . . . . . . . . . . . . 167 A.26 Stability metric from Nogueira, Sechidis, and Brown [132] for each method and each batch of model sizes across all 1000 replications in the data application in Section 2.6.3. . . . . . 168 B.1 Similar to Table 3.1; calculatedp-values for one-tailed pairedt-tests for sparse dierences simulation setting of Section 3.4.1 (statistically signicant p-values indicate better performance for PRESTO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 xi B.2 Means and standard errors of empirical MSEs for each method in each of three intercept settings in the sparse dierences synthetic experiment setting of Section 3.4.1. . . . . . . . 251 B.3 Means and standard errors of empirical MSEs for each method in each of four intercept settings in the uniform dierences synthetic experiment setting of Section 3.4.2. . . . . . . 252 B.4 Means and standard errors of empirical MSEs for each method in each of three intercept settings in the sparse dierences synthetic experiment setting of Section 3.4.1 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. . . 256 B.5 Calculatedp-values for one-tailed pairedt-tests for sparse dierences simulation setting of Section 3.4.1 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. (Statistically signicantp-values indicate better performance for PRESTO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 C.1 Means and standard errors for the squared errors of estimates of ATT (r) by each method for each cohort in the simulation study from Section 4.6. . . . . . . . . . . . . . . . . . . . 327 C.2 p-values for paired one-tailedt-tests of the alternative hypothesis that the FETWFE estimate of ATT (r) has lower squared error than the competitor methods for each cohort in the simulation study from Section 4.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 xii ListofFigures 2.1 Flow chart depicting stability selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Left: the top 20 selection proportions for stability selection with the lasso as the base procedure in a single simulated example. Right: the average mean squared error against selected model size for the lasso and stability selection across 2000 simulations in the setting from Example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Left and center panels: plots of average mean squared error and stability against model size (measured by number of tted coecients) across all 1000 simulations for the simulation study from Section 2.6.1. Right panel: average MSE plotted against stability. Each point corresponds to one model size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4 Same as Figure 2.3, but the methods that use clusters are provided with the known clusters, as described in Section 2.6.1.1. Note that the results for the methods that don’t use clusters (lasso, protolasso, and stability selection) are identical to those from Figure 2.3. . . . . . . . 40 2.5 Same as Figure 2.3, but displaying results for the simulation study from Section 2.6.2 (with methods provided with estimated clusters). . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6 Results from the data application in Section 2.6.3. The left two panels are plots of average MSE and stability against model size (measured by number of tted coecients) for the methods from the simulation study in Section 2.6.1 across all 1000 random draws. The right panel plots average MSE against model size. In all cases, points on the plot represent averages across 5 model sizes to smooth the plot (for example, the leftmost points in the left plot are aggregated over model sizes 1 through 5). . . . . . . . . . . . . . . . . . . . . . 44 3.1 Top left: MSE of estimated rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences simulation setting of Section 3.4.1, for intercept setting yielding rare class proportions of about 0:71% on average and sparsity 1=2. Remaining plots: ratios of MSE for PRESTO divided by MSE of each other method for each of three sets of intercepts with sparsity 1=2 (PRESTO performs better if ratio is less than 1). All plots on log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Same plots as in Figure 3.1, but for uniform dierences synthetic experiment in Section 3.4.2. 67 xiii 3.3 Left: Estimated MSEs of estimated rare class probabilities for each method across 350 random draws of training and test sets in real data experiment from Section 3.4.3. Right: ratios of estimated MSE for PRESTO divided by MSE of each other method (PRESTO performs better if ratio is less than 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4 Estimated MSEs of estimated rare class probabilities for each method and each age cuto across 49 random draws of training and test sets in real data experiment from Section 3.4.4. 75 4.1 Boxplots of squared errors for each treatment eect estimate across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.2 Boxplot displaying proportions of treatment eect restrictions successfully identied by FETWFE across each of the 280 simulations from the synthetic experiment in Section 4.6. . 112 A.1 Boxplots of realized mean squared errors of estimated weights on proxies forZ (relative to optimal weights under Proposition A.1.5) from simulation study from Section 2.6.1. . . . 139 A.2 Boxplots of realized mean squared errors of estimated weights on proxies forZ (relative to optimal weights under Proposition A.1.5) from simulation study from Section 2.6.2. . . . 140 A.3 Additional results from the simulation study in Section 2.6.1 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with estimated clusters. Note that this plot is identical to the leftmost plot of Figure 2.3 with additional methods included. . . . . . . . . . . . . . . . . . 144 A.4 Additional results from the simulation study in Section 2.6.1 for stability against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with estimated clusters. Note that this plot is identical to the center plot of Figure 2.3 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A.5 Additional results from the simulation study in Section 2.6.1 plotting average test set MSE against model size when methods are provided with estimated clusters. Note that this plot is identical to the rightmost plot of Figure 2.3 with additional methods included. . . . . . . 146 A.6 Additional results from the simulation study in Section 2.6.1.1 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with known clusters. Note that this plot is identical to the leftmost plot of Figure 2.4 with additional methods included. . . . . . . . . . . . . . . . . . 147 A.7 Additional results from the simulation study in Section 2.6.1.1 for stability against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with known clusters. Note that this plot is identical to the center plot of Figure 2.4 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.8 Additional results from the simulation study in Section 2.6.1.1 plotting average test set MSE against model size when methods are provided with known clusters. Note that this plot is identical to the rightmost plot of Figure 2.4 with additional methods included. . . . 149 xiv A.9 Additional results from the simulation study in Section 2.6.2 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations with estimated clusters. Note that this plot is identical to the left plot of Figure 2.5 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 A.10 Additional results from the simulation study in Section 2.6.2 for stability against model size (measured by number of tted coecients) across all 2000 simulations with estimated clusters. Note that this plot is identical to the center plot of Figure 2.5 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.11 Additional results from the simulation study in Section 2.6.2 plotting average test set MSE against model size with estimated clusters. Note that this plot is identical to the right plot of Figure 2.5 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . 156 A.12 Additional results from the simulation study in Section 2.6.2 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations. 157 A.13 Additional results from the simulation study in Section 2.6.2 for stability against model size (measured by number of tted coecients) across all 2000 simulations. . . . . . . . . . 158 A.14 Additional results from the simulation study in Section 2.6.2 plotting average test set MSE against model size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A.15 Additional results from the data application in Section 2.6.3 for average test set MSE against model size (measured by number of tted coecients) across all 1000 random draws. Note that this plot is identical to the leftmost plot of Figure 2.6 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A.16 Additional results from the data application in Section 2.6.3 for stability against model size (measured by number of tted coecients) across all 1000 random draws. Note that this plot is identical to the center plot of Figure 2.6 with additional methods included. . . . 168 A.17 Additional results from the data application in Section 2.6.3 plotting average test set MSE against model size. Note that this plot is identical to the rightmost plot of Figure 2.6 with additional methods included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 B.1 MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=3. (These plots are for the two intercept settings that weren’t shown in the main text for the sparsity setting of 1=3) . . . . . . . . . . . . . . . . . . . . . . . . . . 251 B.2 Same as Figure 3.1, but for the simulations with sparsity 1=2. . . . . . . . . . . . . . . . . . 252 B.3 Same as Figure B.1, but for the simulations with sparsity 1=2. . . . . . . . . . . . . . . . . . 253 B.4 MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in uniform dierences synthetic experiment setting of Section 3.4.2. (These plots are for two of the intercept settings that weren’t shown in the main text.) 253 xv B.5 MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in uniform dierences synthetic experiment setting of Section 3.4.2 for intercept setting of (0; 2:5; 4:5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 B.6 MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=3 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 B.7 MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=2 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 B.8 Boxplot of the estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study A as described in Section B.3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 B.9 Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 B.10 Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.25 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 B.11 Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.5 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 B.12 Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.75 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 B.13 Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 B.14 Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.25 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 B.15 Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.5 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 xvi B.16 Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.75 with all of the others. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 C.1 Boxplots of squared errors for each method’s estimate of ATT (2) across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 C.2 Boxplots of squared errors for each method’s estimate of ATT (3) across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 C.3 Boxplots of squared errors for each method’s estimate of ATT (4) across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 C.4 Boxplots of squared errors for each method’s estimate of ATT (5) across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 C.5 Boxplots of squared errors for each method’s estimate of ATT (6) across all 280 simulations. Vertical axis is on a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 xvii Abstract This dissertation presents three novel contributions to the elds of machine learning and causal inference. The unifying theme is leveraging sparsity to improve the interpretability and predictive accuracy of ma- chine learning models, with applications to feature selection, classication with class imbalance, and causal inference. Cluster stability selection generalizes stability selection to overcome the "vote-splitting" prob- lem that arises in data with highly correlated features. We prove for the rst time the long-noted tendency of the lasso to select a single feature from a highly correlated cluster and provide theoretical guarantees for cluster stability selection. We also demonstrate cluster stability selection’s performance in simulation studies and a GWAS data application. The second contribution, PRESTO, is a relaxation of the proportional odds model that exibly leverages more common intermediate outcomes to estimate probabilities of rare events better than logistic regression or the proportional odds model. We justify PRESTO theoretically, prove that PRESTO consistently estimates decision boundary weights under a sparsity assumption, and show in synthetic and real data experiments that PRESTO estimates rare event probabilities better than logistic regression and the proportional odds model. Finally, the third study addresses the bias/variance tradeo in treatment eect estimation using the dierence-in-dierences method. We propose a bridge fusion-penalized method to obtain data-driven estimates of treatment eects, allowing for model selection, estimation, and valid inference in a single step without data splitting. We prove the asymptotic normality of our estimated treatment eects under sparsity and demonstrate our method through simulation studies. xviii Chapter1 Introduction The works that comprise this dissertation cover a wide range of application areas, but the unifying theme is sparse estimation, especially of generalized linear models using a lasso penalty [169]. The chapters of this dissertation present many domains where practitioners may build predictive models containing a large number of parameters but there is good reason to believe that the true number of nonzero parameters is much smaller, including genomics, health and medicine, economics, education, psychology, public policy, and marketing. This may because a very large number of predictors are measured but many of them are probably irrelevant, because a exible model is desired but in practice the number of exible parameters that is needed is likely to be small, or both. Chapter 2 discussesclusterstabilityselection, a feature selection method. Stability selection [127] makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples, but it performs poorly in settings with clusters of highly correlated features. Clus- ter stability selection is designed to select clusters of features rather than individual features, overcoming this problem while retaining the theoretical guarantees of stability selection. This chapter is an updated version of the paper “Cluster Stability Selection," available on arXiv (https://arxiv.org/abs/2201.00494). 1 Chapter 3 addresses the problem of classication in settings with class imbalance. In settings where a rare outcome of interest is preceded by more common ordinal outcomes, a decision boundary correspond- ing to earlier outcomes may be much easier to estimate. If the proportional odds assumption holds, this decision boundary is identical to the decision boundary corresponding to the rare outcome in all but the intercept term. However, it is likely unrealistic to think this assumption will hold exactly. The proposed method PRESTO allows for sparse departures from the proportional odds assumption. This relaxes the rigid proportional odds assumption while still leveraging easier-to-estimate decision boundaries to im- prove estimation of rare class probabilities. This chapter is based on the paper “Predicting Rare Events by Shrinking Towards Proportional Odds," published in the Proceedings of the 40th International Conference on Machine Learning (https://proceedings.mlr.press/v202/faletto23a.html) and also available on arXiv (https://arxiv.org/abs/2305.18700). Finally, in Chapter 4 I consider causal inference using the dierence-in-dierences method. The canon- ical two-period dierence-in-dierences model can be estimated via linear regression. Generalizing this regression estimation method to a larger number of time periods without introducing bias is possible only by adding many parameters to the model [194]. These parameters mostly follow an ordered structure over time, and there is good reason to believe that some of the parameters that are adjacent in time might actu- ally be equal. But prior methods have only been able to leverage this structure if a practitioner is willing to make explicit assumptions about which of these parameters are equal—which restrictions to choose. The method proposed in Chapter 4, fused extended two-way xed eects, instead uses fusion penalties to allow for automatic data-driven selections of the restrictions. You might notice that most of this text is taken up by appendices, and mathematical proofs take up much of the appendices. On a personal note, I’d like to highlight some of my favorite proofs in this disser- tation. I worked on proving Theorem 2.3.1 (and its supporting results) for about a year, with lots of help from my advisor Jacob. I learned quite a bit about math in the process of proving this result. One of my 2 favorite proofs in this chapter is the proof of Proposition A.1.5(1) (in Appendix A.5.3), which uses a simple Lagrange multipliers argument relying only on undergraduate calculus. In the next chapter, probably my favorite result to prove was Lemma B.5.8, where I nally got to make use of a bit of random matrix theory after a long time looking for opportunities to do so. Finally, I was very happy to get the high-level idea behind the proof of Theorem C.2.1, which I outline at the beginning of Section C.4, while reading Section 3 of Tibshirani and Taylor [172]. (I should mention that this work is not the rst to use the idea of inverting the generalized lasso penalty matrix, as discussed in Section 3 of [172], to derive theoretical results for a fused lasso-type penalty. The earliest work I’m familiar with that uses this idea is [146]; see Section 2.) The same idea underlies the proof of Theorem 3.3.1 in a more disguised way. On that note, some other contributions I am particularly grateful for include Meinshausen and Bühlmann [127], Shah and Samworth [159], and Efron et al. [55], whose insights enabled Chapter 2. In Chapter 3, Theorem 3.3.1 simply would not have been proven without the groundwork laid by Ekvall and Bottai [56], PRESTO would not have been implemented without the coding work of Wurm, Rathouz, and Hanlon [195], and the idea for PRESTO would not have existed without Tibshirani et al. [170]. Similarly, Chapter 4 owes a lot to the dierence-in-dierences model of Wooldridge [194], the bridge regression theorey of Kock [106] and Huang, Horowitz, and Ma [90], and the very helpful exposition of lasso theory presented by Hastie, Tibshirani, and Wainwright [79], in addition to Tibshirani et al. [170]. Lastly, of course none of this dissertation would have been possible without Tibshirani [169]. 3 Chapter2 ClusterStabilitySelection Abstract. Stability selection [127] makes any feature selection method more stable by returning only those features that are consistently selected across many subsamples. We prove (in what is, to our knowledge, the rst result of its kind) that in data containing highly correlated proxies for an important latent variable, the lasso typically selects one proxy. As a result, stability selection with the lasso often fails to select any proxy due to what has been called the “vote-splitting" problem, resulting in selected models with worse predictive performance than the lasso alone. We introduce cluster stability selection, a generalization of stability selection which exploits the prac- titioner’s knowledge that highly correlated clusters of features exist in the data. We prove that cluster stability selection overcomes the “vote-splitting" problem, and we present generalizations of theoretical error-control guarantees from Meinshausen and Bühlmann [127] and Shah and Samworth [159] to show that cluster stability selection retains (generalizations of) the same guarantees. We propose two ways to take weighted averages of the features in each important cluster to form cluster representatives, and we show this yields better predictive models than previous proposals. We implement cluster stability selec- tion in the publicly available cssr R package and demonstrate its ecacy in data from simulations and a genome-wide association study. 4 This chapter is an updated version of the paper “Cluster Stability Selection," available on arXiv (https: //arxiv.org/abs/2201.00494). 2.1 Introduction Stability, as characterized by Yu [200], holds when “statistical conclusions are robust or stable to appropri- ate perturbations to data.” Yu and Kumbier [201] call stability one of “three core principles" necessary for “principled inquiry to extract reliable and reproducible information from data." Stability selection adds stability to any base feature selection method. Meinshausen and Bühlmann focus on the lasso [169] as the base procedure, and we will do the same. Even with a xed, the sets of features selected by the lasso can be unstable, particularly in the high-dimensional setting (pn). Figure 2.1: Flow chart depicting stability selection. Figure 2.1 illustrates stability selection. The data are repeatedly randomly split into subsamples of size bn=2c. On each subsample, the lasso (or any other feature selection method) is used to select a set of features. The proportion of subsamples in which each feature is selected is interpreted as an importance measure for that feature. Finally, the selected set returned by stability selection is the set of features whose selection proportion exceeds a predetermined threshold. This adds stability to the lasso, and it turns out that stability selection comes with guaranteed control of false discoveries under very mild assumptions [127]. 5 However, stability selection has a structural problem in the case where observed features are highly correlated. This is the primary motivation of our work and has been noted before. In the words of Shah and Samworth [158], the problem is that “highly correlated variables. . . split the vote.” ([104] and [101] make very similar points.) We will illustrate this point in a simple setting in Example 1, but rst we briey explain theerrors-in-variables framework that we will use throughout the paper as a model for why observed features may be highly correlated. Suppose that a variableZ is in the true model fory, but is not observed. Instead,q 2 equally good proxies—Z plus a little noise—are observed. Any one of these proxies would be useful for out-of-sample predictive performance because of their high correlation withZ. The lasso will tend to choose one proxy on each t, choosing uniformly at random among the proxies, so each proxy’s selection proportion tends toward 1=q rather than 1. Stability selection’s ranking of the importance of the features is then suboptimal for out-of-sample predictive performance. The following simulation study illustrates this problem. (We will explain this simulation in full detail in Section 2.6; for now, we omit details for the sake of exposition.) Example 1. We simulaten = 200 observations of 100 features: q = 5 proxies that each have correlation 0.9 with a latent variableZ, 15 independent “weak signal features” that are in the true model fory (but with smaller coecients thanZ), and 80 independent “noise features.” The left panel of Figure 2.2 shows the top 20 selection proportions after applying stability selection with the lasso on one simulated data set. Although any one proxy forZ would be a better selection for prediction than any other observed feature, the proxies forZ have lower selection proportions than many of the weak signal features. The right panel shows that this behavior substantially hurts the out-of-sample predictive performance of stability selection for most model sizes, particularly smaller model sizes since the features are ranked poorly. (The mean squared error measures the out-of-sample prediction error of a least squares model estimated on a separate data set using the selected set of features.) 6 0.00 0.25 0.50 0.75 1.00 0 5 10 15 20 Ranking Proportion of Subsamples 1 2 3 4 2 4 6 8 10 No. Fitted Coefficients MSE Legend Noise Features Proxies For Z Weak Signal Features Method Lasso Stability Selection Figure 2.2: Left: the top 20 selection proportions for stability selection with the lasso as the base procedure in a single simulated example. Right: the average mean squared error against selected model size for the lasso and stability selection across 2000 simulations in the setting from Example 1. This is not just a problem of theoretical concern. Highly correlated features yielded by repeated noisy measurements of a latent signal arise in many elds of study. 1. In economics, practitioners observe repeated measurements—multiple noisy observations of the same latent signal, like answers to survey questions [152] or the private information known to bid- ders in auctions [115, 107]. 2. Ineducation, noisy measurements of latent academic ability (like test scores) may be used as pre- dictive features [46, 47]. 3. Inpsychology, investigators are often interested in disorders like depression which cannot be ob- served directly; instead, they are evaluated using, for example, the answers to questions in a stan- dardized interview [10]. Dierent questions designed to gain information about the presence or severity of the same disorder are often interpreted as noisy proxies for the latent condition [145]. 4. In the social sciences and humanities, predictors may include the personal interpretations of game participants [185], Amazon Mechanical Turk workers [122], or the general public [118]. Each 7 person’s assessments can be interpreted as a noisy measurement of a common underlying signal [80]. 5. Inbiology, gene expression levels are used to predict health outcomes [165]. Some genes are known to share common biological pathways, and expression levels of genes in the same pathway tend to be highly correlated [78, 77, 155, 154, 50]. In Section 2.6.3, we will demonstrate how our proposed method can be applied to yet another setting with clustered features: genome-wide association studies (GWAS). It is common that multiple proxies for the same signal are observed in one way or another, and we should expect this to become more common in “big data" settings where as many features are observed as possible, such as multi-view data [166]. To address this problem, we proposeclusterstabilityselection. In brief, given a set of known or estimated clustersfC k g, we nd the proportion of subsamples in which at least one feature from clusterC k was selected. ClusterC k is interpreted to be important (that is, the underlying signal common to all the features inC k is considered important) if this selection proportion is high. IfC k is important, we construct acluster representative by taking a weighted average of the features inC k and using this cluster representative for downstream regression tasks (rather than regressing on the individual features separately). We propose two ways to calculate the weights for these averages and demonstrate the advantages each weighting scheme enjoys. We will show both theoretically and in simulation studies and a data application that cluster stability selection dominates both the lasso and stability selection in settings where there are clusters of highly correlated features in the data. Cluster stability selection retains the benets of stability selection—like higher stability of selected sets and fewer false selections—while avoiding “splitting the vote" and thereby hurting prediction accuracy. Further, when no clusters are present our proposed method reduces to a more powerful version of stability selection than previous proposals (even as it retains the same theoretical guarantees). 8 2.1.1 LiteratureReview Investigations into the stability of learning algorithms more broadly go at least as far back as works like Devroye and Wagner [51] and Kearns and Ron [100], [30], and Lange et al. [110]. [151] ∗ proposed methods of aggregating feature selection methods across subsamples that are closely related to, but distinct from, stability selection. This has led to a stream of research in the machine learning community on ensemble feature selection methods that has had surprisingly little mutual interaction with the statistics literature related to Meinshausen and Bühlmann [127]; see Bolón-Canedo and Alonso-Betanzos [26] for a review. This stream has tended to be more methodologically exible and diverse than the statistics literature, and has tended to emphasize empirical rather than theoretical justication. A few authors propose similar methods to what we discuss. Kirk, Lewin, and Stumpf [104] and Alexan- der and Lange [3] discuss selection proportions for groups of features in stability selection, as we propose, but they do not consider the importance of individual features in the cluster, nor do they form cluster representatives, opting instead to select all features from correlated clusters. They also use the stability selection selection probability estimator proposed by Meinshausen and Bühlmann [127], whereas we pro- pose a novel estimator of the selection probabilities. Finally, they do not examine the theoretical properties or predictive performance of the resulting selected models. Gauraha [70] also considers stability selection when features are highly correlated. Gauraha proposes clustering the features, forming cluster representatives, and using the lasso regressed on cluster represen- tatives as the base procedure for stability selection. Beinrucker, Dogan, and Blanchard [11] develop extended stability selection. This method divides the data into subsamples of sizen=L for someL2N rather than onlybn=2c, and additionally chooses ran- dom subsets of features to consider for each subsample. Beinrucker, Dogan, and Blanchard show through simulations that their method is helpful when predictors are highly correlated. ∗ The earliest version of Meinshausen and Bühlmann [127] was posted on arXiv around the same time that Saeys, Abeel, and Van de Peer [151] was published, so these appear to be roughly concurrent works. 9 In the presence of highly correlated features, one of the most frequently suggested ideas to practitioners is to simply drop all but one of the highly correlated features from the data; see for example Greene [73, Section 4.7.1] and James et al. [96, Section 3.3.3, p. 102]. This idea underlies the protolasso [144]. In this method, the feature within a cluster with the greatest marginal correlation with the response is chosen as aprototype for the cluster. The remaining features in each cluster are discarded, and the lasso is estimated on the prototypes. Related proposals form cluster representatives via averaging [34, 137], and other more sophisticated proposals exist as well; see Li and Vuong [116] and Schennach [152, Section 4.1]. Many earlier works have noted, but not proven, that the lasso tends to select only one feature from a highly correlated cluster [55, 208, 204, 27, 95, 161, 24, 191, 5, 117]. Our primary theoretical contribution (Theorem 2.3.1) provides what is to our knowledge the rst precise statement with proof of this observa- tion. Another stream of research engages with regularized estimation of models with measurement error, particularly in the high-dimensional setting [147, 119, 148, 164, 15, 13, 14, 205, 131]. In contrast to this literature, we are interested in stable model selection for good out-of-sample predictive performance, we focus on stabilizing existing estimation procedures rather than proposing a new one, and our method is designed to be useful when more than one noisy observation of the same lurking signal is available. Lastly, a more general setting than the one we consider is that of latent factor models, where observed features may be inuenced by multiple latent variables rather than just one [25, 94, 20, 21]. 2.1.2 OutlineofthePaper Below, we outline our main contributions as well as the structure of our paper. 1. In Section 2.3, we prove in a simple setting that the lasso tends to select one feature from a highly correlated cluster (Theorem 2.3.1). This leads to stability selection’s “vote-splitting" problem, where no one feature from even an important cluster is likely to have a high selection proportion. We show 10 that stability selection ranks features poorly due to the vote-splitting problem, both theoretically (Corollary 2.3.2) and later through simulation studies in Section 2.6. 2. Motivated by the vote-splitting problem, in Section 2.4 we propose cluster stability selection. Part of our proposal is a novel, stability-based approach to combining cluster members for downstream regression tasks (Section 2.4.2.2). 3. We explore the theoretical properties of cluster stability selection in Section 2.5. We show in the set- ting of Theorem 2.3.1 that in contrast to what we will see about stability selection in Corollary 2.3.2, cluster stability selection tends to rank clusters optimally for predictive performance whether the clusters are known (Theorem 2.5.1) or estimated (Theorem 2.5.3). We also generalize all of stability selection’s theoretical guarantees from Shah and Samworth [159] (our Theorems 2.5.5 and A.5.2) and the main theoretical guarantee of Meinshausen and Bühlmann [127], showing that cluster stability selection retains analogous error control properties. 4. Finally, we demonstrate cluster stability selection’s superior predictive performance and stability through simulation studies and a real data example in Section 2.6. First, in Section 2.2 we review stability selection in more detail. 2.2 StabilitySelection We will discuss both stability selection as proposed by Meinshausen and Bühlmann [127] and the mod- ication proposed by Shah and Samworth [159]. It will be easier to start with the Shah and Samworth formulation. We require a selection procedure ^ S n := ^ S n (X;y); (2.1) 11 typically depending on a tuning parameter. The only required property of ^ S n is that it maps a data set to a subset of [p] :=f1;:::;pg in a way that can be either random or deterministic conditional on the data. (Sometimes it will be convenient to suppress then from the notation and write ^ S , or even suppress the if the meaning is clear.) We focus on the lasso with one pre-selected > 0 as the base procedure. Similarly to Shah and Samworth, we dene the selection probability of featurej under ^ S n as p j;n; :=P j2 ^ S n (2.2) (where the randomness is with respect to both the data and any randomness in the selection procedure). Shah and Samworth frame these as the primary parameters of interest in stability selection. As shown in Figure 2.1, in the rst step of stability selection,B subsamplesA 1 ;:::;A B [n] of size bn=2c are drawn, as well as subsamplesA b [n] of the same size withA b \A b =;. † For convenience, we abuse notation and write ^ S (A b ) := ^ S X A b ;y A b ; whereX A b denotes thebn=2cp matrix obtained by selecting the rows fromA b fromX, and similarly fory A b . Then for everyj2 [p],p j;n; is estimated by ^ (SS) B (j) = 1 2B B X b=1 h 1 n j2 ^ S (A b ) o + 1 n j2 ^ S (A b ) oi : (2.3) (Notice this estimator is unbiased for p j;bn=2c; but not necessarily for p j;n; ). Having computed each ^ B (j), the practitioner has available an importance measure for each feature, and, by extension, a ranking of the features (perhaps with ties). Then the selected set consists of all features such that ^ (SS) B (j) , with2 (0; 1) a pre-selected threshold. † That is, ifn is even thenA b =[n]nA b ; ifn is odd, one random index is dropped from each[n]nA b so thatjA b j=jA b j= (n1)=2=bn=2c. 12 The original proposal by Meinshausen and Bühlmann [127] is more general in the sense that it allows for a nite set of tuning parameters . The procedure is otherwise the same except that their theoretical results require using allB = n bn=2c unique subsamples of sizebn=2c, focusing on the quantity ^ (MB) (j) := max 2 8 > < > : n bn=2c 1 ( n bn=2c ) X b=1 1 n j2 ^ S (A b ) o 9 > = > ; for each featurej (though in practice this is infeasible and they suggest that 100 subsamples works well). Shah and Samworth [159] call their modication that makes use of the setsA b (and whose theoreti- cal results do not require using all n bn=2c subsets) complementary pairs subsampling and prove stronger theoretical guarantees than Meinshausen and Bühlmann’s whenjj = 1. We emphasize that the success of stability selection hinges on the quality of the rankings it assigns the features. We will frequently draw attention to these rankings. 2.3 MotivatingTheory We present theoretical results that will highlight stability selection’s “vote-splitting" problem, which mo- tivates cluster stability selection. In particular, we will show the following: 1. (Theorem 2.3.1.) When a highly correlated cluster of features is observed, the lasso tends to choose only one of them. (In particular, if two equally low-noise proxies for an important latent feature are observed, the lasso tends to select either one of them with equal probability.) 2. (Corollary 2.3.2.) In stability selection using the lasso as the base procedure, the selection proportions for allq cluster members will therefore tend to 1=q in this setting (our theory focuses on theq = 2 case). Because of this “vote splitting," stability selection with the lasso tends to rank such features 13 lower than directly observed features, even if these directly observed features are less important for predictingy than any one proxy would be. As one might expect, low-noise proxies for an important latent signal are better selections for out-of- sample predictive performance than directly observed “weak signal" features, so the ranking yielded by stability selection is detrimental to prediction. (We prove this formally in Corollary A.1.3 in the online supplement.) 2.3.1 TheLassoSelectsRandomlyAmongHighlyCorrelatedFeatures In this section we provide theoretical support for the idea that the lasso tends to select only one feature within a highly correlated cluster: (2.4) To the best of our knowledge, no existing theoretical result veries (2.4). It is not obvious how to map this informal statement into something that can be proven. For instance, we know that if the penalty parameter can be arbitrary, the lasso can typically yield selected sets of any size from 0 to minfn;pg. So ifp n, there exist some lasso selected sets in which none of the features in a highly correlated cluster are selected, and some in which all of the features are selected. The common belief in statements like (2.4) stems back at least as early as Efron et al. [55]. Efron et al. show that the lasso is closely related to their algorithm least angle regression (LARS), which adds features one at a time according to which feature is most highly correlated with the residual between the observed response and the current model. If two featuresX 1 andX 2 are highly correlated, as one of them is added to the active set (sayX 1 ),X 2 ’s correlation with the residual will tend to drop more than competitor features. ThereforeX 2 appears to have a disadvantage relative to competitor features. 14 Following this reasoning, we construct something closer to a provable mathematical statement from (2.4) by considering the path of selected features as the penalty decreases from innity: After one feature from a highly correlated cluster enters the lasso path, it is (asymptotically) very unlikely that another feature from the cluster will enter until much later in the lasso path (that is, after other reasonable features to select are exhausted). (2.5) Next we construct a setting in which (2.5) could hold, and in Theorem 2.3.1 we prove that it does. Suppose a responsey is observed that is generated from the linear model y = Z Z +X 3 +; (2.6) whereZ;X 3 , and are independent Gaussian random variables. Assume Z > 1, soZ is a more im- portant signal for predictingy than the “weak signal" featureX 3 . (We will fully specify the setup in a moment.) The practitioner observesy,X 3 , two noisy proxiesX 1 andX 2 forZ, and noise features X 4 ;:::;X p that are independent ofy. In particular,X 1 andX 2 are equal toZ plus a small amount of independent and identically distributed (i.i.d.) noise, 1 and 2 : X j :=Z + j ; j2 [2]: (2.7) (The amount of noise is “small" in the sense that Var( 11 ) is small.) ClearlyX 1 andX 2 form a highly correlated cluster. We also assume in this setting that Z is large enough (and Var( 11 ) is small enough) that the featuresX 1 andX 2 are more highly correlated with y thanX 3 is. ‡ In this setting, either ofX 1 orX 2 is a good selection if the goal of our model is a ‡ It turns out it will also be important thatZ is nottoo large—if it is, then the selected set of size 2 that is best for out-of-sample predictive performance could beX1 andX2, rather than one of these proxies andX3. Then (2.5) will not hold. 15 sparse selected set with good out-of-sample predictive performance (particularly sinceZ is not observed), but selecting both is redundant. That is, the best selected sets of size 2 arefX 1 ;X 3 g orfX 2 ;X 3 g. Theorem 2.3.1 says that with high probability the lasso path matches this behavior: after selecting one of X 1 orX 2 , the next feature to enter the lasso path isX 3 . Now we specify the setting precisely. Supposen i.i.d. draws (Z i ; i1 ; i2 ;X i3 ;:::;X ip ; i ); i2 [n] (2.8) are observed, with the variables having a multivariate Gaussian distribution. Each variable has mean 0 and is independent from the others, and we have Var(Z 1 ) = Var(X 13 ) = ::: = Var(X 1p ) = 1, Var( 11 ) = Var( 12 ) = 2 (n), and Var( 1 ) = 2 . DenoteZ := (Z 1 ;:::;Z n ) > ; j = ( 1j ;:::; nj ) > , j2 [2];X j := (X 1j ;:::;X nj ) > ,j2f3;:::;pg; and := ( 1 ;:::; n ) > . Proving Theorem 2.3.1 requires 2 (n) to vanish at a particular rate asn!1, so thatX 1 andX 2 approachZ (and the correlation of X 1 andX 2 approaches 1). In particular, let 2 (n) := 10 p n logn : (2.9) We will consider the lasso with scaled features [169]: ^ ()2 arg min 2R p 8 < : 1 2n y p X j=1 X j kX j k 2 j 2 2 + p X j=1 j j j 9 = ; : (2.10) Theorem 2.3.1. LetX j ,j2 [p] andy be as dened in (2.6) – (2.9). For constantst 0 2 (0; 1] (dened in Lemma A.4.11) andc 2 2 0; e1 8e 2 (dened in Equation A.4.23), assumen 100 is large enough to satisfy n logn > 5 + 2 c 2 max 1 4t 2 0 (2 + 2 ) 2 ; 2 12 + 2 ; 5 1 + 2 (2.11) 16 and n (logn) 3=2 > 3:61 5 + 2 c 2 : (2.12) Assume also that 3pc 8 (logn) 3=2 for somec 8 > 0. Then 1. the interval I(n) = 0 @ 1 + 10 2 (n); 1 + 19 10 s 2 + 2 c 2 (logn) 3=4 n 1=2 1 A (2.13) is a nonempty subset of (1; 2), and 2. for any Z 2 I(n), for the lasso path calculated as in (2.10) there exists a nite constantc 3 > 0 (free of all parameters in the setup) such that for (n) :=c 3 (logn) 3=2 n 1=4 (2.14) the rst two features to enter the lasso path areX 1 followed byX 3 with probability at least 1 2 0 B @ 2 Z + 1 + 2 | {z } =Var(y 1 ) 1 C A 7=2 (n): By exchangeability, the event where the rst two features to enter the lasso path areX 2 followed by X 3 occurs with equal probability. Therefore for any between the second and third knot of the lasso path, the two selected features will beX 3 and one of the proxy featuresX 1 orX 2 with probability at least 1 2 2 Z + 1 + 2 7=2 (n): Proof. See Appendix A.3 in the online supplement. 17 Theorem 2.3.1 diers from typical lasso selection consistency results in a couple of key ways. (See Theorem 11.3, and the broader discussion in Section 11.4, of [79] for a textbook-level summary of these results.) Most notably, classical lasso selection consistency results imply that the selection probabilities for each feature tend to either 0 or 1 asymptotically, while Theorem 2.3.1 shows that the selection probabilities ofX 1 andX 2 tend to 1=2. Classical results rely on an assumption that the response can be written as y =X +w for a sparse vector and noisew that is independent ofX, but this excludes our setting where =e 3 (the selector vector inR p with third entry equal to 1 and remaining entries equal to 0) and w = Z Z+ is correlated with the observedX 1 andX 2 . (Observe that ifZ were also observed, classical lasso theory would still not hold due to the high correlation ofX 1 andX 2 withZ, which violates the irrepresentable condition.) Some of the assumptions of Theorem 2.3.1 seem somewhat restrictive, but they can likely be relaxed. First, the assumption of joint normality can be replaced with a weaker assumption that the distributions are subgaussian. Indeed, in Section 2.6.1 we present simulation study results where the predictor variables are binary and the response variable is bounded from below, but the “vote-splitting" problem illustrated by Theorem 2.3.1 still arises. Second, the assumption that the noise level on the proxies exactly equals (2.9) can be relaxed within a bounded range of multiplicative constants without changing our proof strategy. This change could in turn allow for the noise levels of the proxies to dier (with their ratio lying within a bounded range), in which case the probability of each proxy being selected rst would tend to a dierent number rather than both tending to 1=2. That said, this simpler assumption is adequate for illustrating (2.5). Finally, though we allowp!1 the permissible rate of increase inp relative ton is slow. We can allowp to grow at a faster rate inn as long asp=n 1=4 ! 0 at the price of slowing the rate of convergence of the main result accordingly. Notice also that Theorem 2.3.1 is still valid forp > n in nite samples providedc 8 is large enough. Finally, we also note that results like Theorem 11.3 in Hastie, Tibshirani, and 18 Wainwright [79] require stronger assumptions for selection consistency ifp is xed, while our result is the same for either xedp orp!1. 2.3.2 Theorem2.3.1ExplainsStabilitySelection’s“Vote-Splitting"Problem Theorem 2.3.1 leads us to the problem with stability selection using the lasso in this setting: as demon- strated in Example 1, the probability that any one proxy will be chosen by the lasso is lower than the probability of choosing the weak signal featureX 3 . We formally summarize this observation in the fol- lowing result. Corollary2.3.2. InthesettingofTheorem2.3.1,considerapplyingstabilityselectionwiththebaseprocedure denedasfollows: theselectedsetoneachiterationisthersttwofeaturestoenterthelassopath. Assume Z satisestheassumptionsofTheorem2.3.1withbn=2clargeenoughtosatisfythesamplesizerequirementsof Theorem 2.3.1. Then for (n) dened in (2.14) (wherec 3 > 0 is as dened in Theorem 2.3.1) it holds that E h ^ (MB) (j) i =E h ^ (SS) B (j) i < 1 2 + 2 2 Z + 1 + 2 7=2 (n); j2 [2]; and E h ^ (MB) (3) i =E h ^ (SS) B (3) i > 1 4 2 Z + 1 + 2 7=2 (n): Proof. See Appendix A.5.1 in the online supplement. That is, asn!1 the selection proportion forX 3 yielded by stability selection tends to 1 and the selection proportions ofX 1 andX 2 tend towards 1=2, which leads stability selection to suer from the “vote-splitting" problem of ordering the features incorrectly for out-of-sample predictive performance. Though the base procedure dened in Corollary 2.3.2 is very specic, if the broader principle (2.5) holds, then for any choice of that is large enough to avoid selecting noise features with high probability the 19 lasso will tend to choose only one feature in a highly correlated cluster, and stability selection with the lasso will “split the vote." In the next section we will introduce cluster stability selection (and later, in Theorems 2.5.1 and 2.5.3, we will show that cluster stability selection does not suer from the “vote-splitting" problem). 2.4 ClusterStabilitySelection Motivated by the insights that Theorem 2.3.1 and Corollary 2.3.2 yield on the “vote-splitting" problem with stability selection, in this section we detail our proposed procedure, cluster stability selection, to address vote-splitting while preserving the desirable properties of stability selection. 2.4.1 DescriptionofMethod In brief, cluster stability selection ranks clusters of features rather than features themselves. Like stability selection, cluster stability selection requires data (X;y) along with a feature selection method ^ S and a set of tuning parameters . Although our procedure, like stability selection, allows for responses in general spacesY and any base feature selection method that can be characterized as in (2.1), we focus on Y =R and using the lasso for feature selection. In practice, if the base feature selection procedure is the lasso, we recommend selecting a single tuning parameter =fg, and we follow this approach in the simulation studies and data application in Section 2.6. However, using more than one tuning parameter could be helpful if the practitioner is more concerned about avoiding false negatives than false positives, since the procedure has more power as the set becomes larger. This is also a helpful approach if one is using a feature selection method where identifying a single best tuning parameter is dicult. Unlike stability selection, cluster stability selection also requires a partitioning of the features into clustersC =fC 1 ;:::;C K g as an input. The clusters may be known from domain knowledge, but if not, they can be estimated by methods including hierarchical clustering as well as those methods proposed by 20 Bondell and Reich [27], She [161], Shen and Huang [162], Bühlmann et al. [34], and Sharma, Bondell, and Helen Zhang [160] and Witten, Shojaie, and Zhang [191]. We investigate the performance of cluster stabil- ity selection both with known and estimated clusters theoretically in Section 2.5 and through simulation studies and a data application in Section 2.6. We use complementary pairs subsampling similar to the estimator (2.3) proposed by Shah and Sam- worth [159], except that we allow for an arbitrary nite set of parameters . Similarly to stability selection, we calculate the individual selection proportions for each featurej2 [p] for any2 , ^ B (j) := 1 2B B X b=1 " 1 ( j2 [ 2 ^ S (A b ) ) + 1 ( j2 [ 2 ^ S (A b ) )# : (2.15) Additionally, for everyk2 [K] we calculate the proportion of subsets in which at least one feature from clusterC k is selected for at least one2 : ^ B (C k ) := 1 2B B X b=1 " 1 ( C k \ [ 2 ^ S (A b )6=; ) + 1 ( C k \ [ 2 ^ S A b 6=; )# : (2.16) The cluster selection proportions ^ B (C k ) give us a ranked ordering of clusters (perhaps with ties), which are used to select a set of clusters dened by how far down the list of ranked clusters one chooses to go. Our heuristic recommendation is to pre-specify (or estimate) a desired model size s and select the clusters with thes highest sample proportions. This approach has been proposed previously by, for example, Beinrucker, Dogan, and Blanchard [11] and Kim and Sun [102]. Our Theorems 2.5.1 and 2.5.3, along with our simulation studies in Section 2.6, provide support for the idea that the tops clusters ranked by selection proportion will be better selections than the tops features selected by a base procedure. Alternative approaches for getting a selected set of clusters include pre-selecting a threshold selection proportion and selecting all clusters above the threshold, analogously to the approach of Meinshausen 21 and Bühlmann [127] and Shah and Samworth [159]. This reasonable approach is supported by our Theo- rems 2.5.5 (to come in Section 2.5) and A.5.2 (in Section A.5.2 of the supplement), which provide theoretical guarantees controlling the expected number of undesirable selected features given. Inverting these re- sults provides a guaranteeing a desired level of error control in expectation. However, choosing by inverting the theoretical guarantees of Theorem 2.5.5 tends to be conservative in practice, resulting in low power. This may be because the proof relies on Markov’s inequality, which tends to be fairly loose. On the other hand, selecting by inverting one of the tighter guarantees of Theorem A.5.2 (enabled by tighter versions of Markov’s inequality due to [159]) is cumbersome and requires distributional assumptions for the cluster selection proportions that are hard to verify. Ultimately cluster stability selection is a feature selection method, and its output includes a set of selected features ^ S (CSS) , not just clusters. One possibility is to select all of the features from every selected cluster. In a sense, this is more or less what we propose, and this is a reasonable solution if the practitioner’s only goal is identifying a sparse set of relevant features. However, there is a complication if the selected features will be used for a downstream regression task. Because each cluster member is assumed to be a representation of the same latent signalZ, the noise that distinguishes them is assumed to be irrelevant to predictingy. Estimating a separate regression coecient for each individual selected cluster member separately would result in extra degrees of freedom that could only hurt predictive performance. Even if our latent variable assumption does not hold exactly, estimating multiple coecients for highly correlated (nearly collinear) features will tend to result in overtting, increasing the estimation variance far more than it reduces bias. Another natural idea (also mentioned in the introduction) is to form a selected set by selecting one feature from each cluster. We explore a version of cluster stability selection along these lines in Appendix A.2.2 in the online supplement. However, it turns out that if you know a cluster exists, selecting one feature from the cluster is much worse for the predictive power of a downstream regression model than 22 taking a suitably weighted average of the cluster members. See Appendix A.1.2 in the online supplement, particularly Proposition A.1.5, for a theoretical analysis of this point, and see the results in Appendices A.2.3 and A.2.4 for results conrming this in our simulation studies and data application. Instead of choosing a single member from each cluster, we construct cluster representativesX rep C k for regression tasks by taking weighted averages of the cluster members, X rep C k := X j2C k w kj X j 2R n ; (2.17) wherew k = w kj j2C k 2 jC k j1 are weights constructed for each cluster from the individual fea- ture selection proportions ^ B (j) in one two ways. Insimpleaveragedclusterstabilityselection, the weights are given by w kj = 1 jC k j 8j2C k ; (2.18) and inweightedaveragedclusterstabilityselection, the weights are w kj = ^ B (j) P j 0 2C k ^ B (j 0 ) 8j2C k : (2.19) Briey, the rst proposal assigns equal weight to each feature in the cluster and the second proposal assigns weights to each feature in the cluster in proportion to how frequently it was selected. We interpretX rep C k as representing our best approximation of the lurking signalZ and we recommend using this one feature for downstream regression models rather than the cluster members. Cluster stability selection rejects any features that are assigned weight 0 by weighted averaged cluster stability selection. Notice that for a “cluster" of size one, both weighting schemes are equivalent and result in selecting the feature itself as the cluster representative, yielding results that are equivalent to stability selection. In Section 2.4.2 we will discuss each of the weighting schemes in more detail. 23 Algorithm 1 spells out the cluster stability selection procedure. For simplicity, we focus on weighted averaged cluster stability selection, but simple averaged cluster stability selection is identical except for the 24 calculation of the weights. Notice that cluster stability selection yields not only a set of selected features, but also a set of selected clusters and weights to form a cluster representative from each cluster. Algorithm1: Cluster Stability Selection (with weighted averaging) input : Data (X;y)2R np Y n ; clustersC =fC 1 ;:::;C K g partitioning [p]; a selection method ^ S , as in (2.1), and a set of tuning parameters ; a number of pairs of subsamples B2N; a desired maximum number of selected clusterss2f1;:::;Kg. initialization; forb 1toB do A b a random subsample of [n] of sizebn=2c ; A b a random subsample of [n]nA b of sizebn=2c ; for2 do compute ^ S (A b ) and ^ S (A b ) ; forj 1topdo calculate ^ B (j) as in (2.15) ; fork 1toK do calculate ^ B (C k ) as in (2.16) ; calculate the weightsw k as in (2.19) ; ~ C k fj2C k :w kj 6= 0g ; if n k : ^ B (C k ) = 1 o sthen ^ (s) min n 2f0; 1=(2B);:::; 1g : n k : ^ B (C k ) o s o ; else ^ (s) 1 ; I n k : ^ B (C k ) ^ (s) o ; ^ S (CSS) S k2I ~ C k ; output: ^ S (CSS) ; n ^ B (C k );w k o k2I 25 Typically the most computationally intensive step of Algorithm 1 will be executing ^ S on each sub- sample. Fortunately, this part of the algorithm is completely parallelizable, making the algorithm much more feasible. Further, the choice of model sizes and weighting scheme is completely independent of this computationally intensive step. The cssr R package exploits this structure by obtaining the feature and cluster selection proportions in a standalone (parallelized) function. Separate functions are available to analyze the output of this process and obtain selected sets. 2.4.2 ChoiceofWeights Below we discuss the advantages of our two proposed weighting schemes (2.18) and (2.19). 2.4.2.1 SimpleAveragedClusterStabilitySelection Weighting scheme (2.18) aligns with an intuitive idea for a practitioner who knows she has several noisy observations of the same quantity: just average them. It is analogous to the cluster representative lasso [34, 137], which takes a simple average of the features in each cluster and ts the lasso on those cluster rep- resentatives and the remaining original features, though we run the underlying feature selection method on the original features, forming the cluster representatives after the selection proportions (and relevant clusters) have been determined. By Proposition A.1.5 in the online supplement, simple averaging is optimal if the noise levels of each proxy are equal. If the practitioner is condent in the clustering used and believes the noise levels within clusters to be roughly equal, simple averaging could result in a more favorable bias-variance tradeo than estimating weights. This is particularly true if all of the proxies have roughly equal noise levels that are also high, making estimation of the individual weights noisy. 26 2.4.2.2 WeightedAveragedClusterStabilitySelection The intuition behind regressing against weighted averages of cluster members is that averaging several noisy proxies might improve our approximation of the latent signal. Proposition A.1.5 in the supplement yields the intuitive result that the optimal weights are higher for less noisy proxies and lower for higher noise proxies. If the underlying selection procedure is more likely to select lower noise proxies, the weights from weighted averaged cluster stability selection may resemble the optimal weights. We show empirical evidence in Section 2.6.1 that weighted averaged cluster stability selection with the lasso does seem to outperform either other weighting scheme when proxy noise levels vary signicantly. We also directly investigate how well weighted averaged cluster stability selection estimates the optimal weights in Ap- pendix A.1.2. Figure A.1 shows that in the context of our simulation study, weighted averaged cluster stability selection tends to estimate weights that are closer to optimal than simple averaging (though the simple averaged weights are surprisingly close to optimal). Since cluster stability selection only selects those cluster members that were selected in at least one subsample, weighted averaging allows features to be dropped from a cluster with nonzero probability. In Section 2.6.3 we present evidence that this occurs in practice in a real data example. This is useful particularly if the clusters are estimated and might include some irrelevant features. Now that we have introduced cluster stability selection, in Section 2.5 we will investigate its properties theoretically before demonstrating its ecacy in simulation studies and a real data application in Section 2.6. 27 2.5 TheoreticalPropertiesofClusterStabilitySelection In this section, we will show theoretically that cluster stability selection corrects the “vote-splitting" prob- lem with stability selection that we illustrated in Corollary 2.3.2 (Theorems 2.5.1 and 2.5.3) while retaining the desirable error-control properties of stability selection (Theorem 2.5.5). 2.5.1 ClusterStabilitySelectionSolvesThe“Vote-Splitting"Problem Recall that the primary motivation of cluster stability selection is to improve on stability selection in the case of highly correlated features. As one might expect, Corollary A.1.3 in the online supplement shows that in the setting of Theorem 2.3.1,X 1 andX 2 are both better rst selections than featureX 3 . Corollary 2.3.2 shows that due to the “vote-splitting" problem, stability selection unfortunately tends to rank feature X 3 ahead of featureX 1 andX 2 . In contrast, Theorem 2.5.1 provides evidence that cluster stability selection achieves the intended goal of ranking the latent signals correctly in this setting when the clusters are known. Theorem2.5.1. AssumeX j ,j2 [p]andyareasdenedin (2.6)– (2.9). Assumeforsomen 200,bn=2cis largeenoughtosatisfy (2.11)and (2.12), Z 2I(bn=2c)forI()asdenedin (2.13),andpc 8 (logbn=2c) 3=2 for somec 8 > 0. Then if the cluster assignmentsC 1 =f1; 2g;C 2 =f3g;C 3 =f4g;:::;C p1 =fpg are knownandthebaseprocedure ^ S selectstherstfeaturetoenterthelassopath,theclusterselectionproportions from cluster stability selection (Algorithm 1) satisfy E h ^ B (C 1 ) i > 1 4 2 Z + 1 + 2 7=2 (n) and E h ^ B (C k ) i < 4 2 Z + 1 + 2 7=2 (n) fork2f2;:::;p 1g; 28 where (n) is as dened in (2.14). If the base procedure selects the rst two features to enter the lasso path, E h ^ B (C k ) i > 1 4 2 Z + 1 + 2 7=2 (n) fork2f1; 2g, and E h ^ B (C k ) i < 4 2 Z + 1 + 2 7=2 (n) fork2f3;:::;p 1g: Proof. See Appendix A.5.1 in the online supplement. In contrast to Corollary 2.3.2, Theorem 2.5.1 shows that cluster stability selection tends to rank cluster C 1 ahead ofC 2 and the noise clusters, regardless of whether the base selection procedure selects the rst one or two features to enter the lasso path. Since the selected set of features is yielded from the selected clusters, this results in cluster stability selection producing better selected feature sets than stability se- lection in this context. We also see that the cluster rankings tend to be best for a model of sizes if the base selection procedure selectss features. This provides some justication for selecting the tops features (based on an estimated or desired model size s) rather than selecting features according to a selection proportion threshold. There are many real-world settings in which the clusters may be known. In education, an investigator may be interested in a student’s latent skill at some number of xed tasks, and a test may be designed with questions that correspond to one specic task; then the answers to all questions associated in the same task would form a known cluster. Similarly, in psychology, interview questions may correspond to a known disorder. In biology, the genes that share common biological pathways may be known; then these genes’ expression levels will form known clusters. However, in some settings, the features may be known to be clustered, but it may not be known what those clusters are. In Theorem 2.5.3 we will prove similar high-probability bounds in the case that the clusters are estimated rather than known. First we will dene the clustering procedure we will use (both in our theory and in our simulation studies). 29 Denition 2.5.2. Given an unlabeled data setX 2 R np , dene an (empirically estimated) distance metric between two featuresi;j2 [p] by one minus the absolute value of the cosine similarity: d(i;j) := 1 X > i X j kX i k 2 kX j k 2 ; i;j2 [p]: Then we dene the clusters ^ C(n;) to be those clusters estimated by complete linkage hierarchical clus- tering using this distance metric and a distance threshold. Now we are prepared to state our result. Theorem 2.5.3. Suppose we have an unlabeled data setX (1) 2R n 1 p generated according to (2.7) – (2.9) which we use to estimate clusters ^ C(n 1 ; 0:75) following the procedure from Denition 2.5.2. Then cluster stabilityselection(Algorithm1)isexecturedusingtheseestimatedclustersonanindependentlabeleddataset of sizen 2 200 generated according to (2.6) – (2.9). Assume Z 2 I(bn 2 =2c) forI() as dened in (2.13), pc 8 (logbn 2 c) 3=2 for somec 8 > 0, minfn 1 ;bn=2cg is large enough to satisfy (2.11) and (2.12), and n 1 12 ( ) 2 (2 logc 8 + 3 log logn 2 ); (2.20) where 2 (0; minf1=4;t 0 Z g) fort 0 dened in Lemma A.4.11. Then there exists a constantc 5 > 0 so that the clusters are identied correctly with probability at least 1 6 expfc 5 n 1 g. Finally, denote by ^ C 1 the estimated cluster containingX 1 and ^ C k the cluster containingX k+1 fork2 f2;:::;p 1g. If ^ S selects the rst feature to enter the lasso path, then for (n) dened in (2.14) E h ^ B ^ C 1 i > 1 6 expfc 5 n 1 g 4 2 Z + 1 + 2 7=2 (n 2 ) and E h ^ B ^ C k i < 6 expfc 5 n 1 g + 4 2 Z + 1 + 2 7=2 (n 2 ) fork2f2;:::;p 1g: 30 If ^ S selects the rst two features to enter the lasso path, E h ^ B ^ C k i > 1 6 expfc 5 n 1 g 4 2 Z + 1 + 2 7=2 (n 2 ) fork2f1; 2g, and E h ^ B ^ C k i < 6 expfc 5 n 1 g + 4 2 Z + 1 + 2 7=2 (n 2 ) fork2f3;:::;p 1g: Proof. See Appendix A.5.1 in the online supplement. Although the bounds provided by Theorem 2.5.3 contain extra terms corresponding to clustering un- certainty, we see that these terms decay exponentially fast in the number of unlabeled observations used for clusteringn 1 . Remark 2.5.4. The assumption that the clusters are either known or can be well-estimated by available unlabeled data is closely related to a stronger assumption made in themodel-X framework rst developed by Candès et al. [37], where the distribution ofX is assumed to be known (or, more generally, easily estimated with precision due to abundantly available unlabeled data). See Section 1.3.2 of Candès et al. [37], as well as the many subsequent papers that cite this seminal work, for examples of settings where this assumption is reasonable. One example they cite that is particularly relevant to cluster stability selection (because of the presence of highly correlated clusters of features and the particular importance of feature selection in this setting) is genome-wide association studies, the setting of our data application in Section 2.6.3. 2.5.2 ErrorControlGuarantees Having shown that cluster stability selection alleviates the “vote-splitting" problem, we will show in The- orem 2.5.5 that cluster stability selection retains analogous error control guarantees to those provided by both Meinshausen and Bühlmann [127] and Shah and Samworth [159]. For a feature selection procedure ^ S n executed once on a data set for each in a nite set of tuning parameters , dene the set of selected 31 clusters (that is, the set of clusters from which at least one feature was selected by ^ S n for at least one 2 ) ^ S ;C n := ( C2C :C\ [ 2 ^ S n 6=; ) ; whereC =fC 1 ;:::;C K g is any partitioning of [p] (known or estimated). We generalize the selection probability for a single feature and tuning parameter (2.2) and dene p C;n; :=P C2 ^ S ;C n : For a xed 2 [0; 1], let L := C k :p C k ;bn=2c; denote the set of clusters that have low selec- tion probability under ^ S ;C bn=2c , and letH := C k :p C k ;bn=2c; > denote the set of clusters with high selection probability. Denote by ^ S CSS;;C n; C the set of clusters with selection proportion greater than. (Though as discussed in Section 2.4.1 we recommend selecting features by pre-selecting a desired model sizes rather than a threshold, Theorem 2.5.5 allows for direct comparison to previous stability selection proposals, illustrating that cluster stability selection enjoys the same error control properties.) Theorem2.5.5. 1. For2 (1=2; 1], E ^ S CSS;;C n; \L 2 1 E ^ S ;C bn=2c \L : 2. Let ^ N CSS;;C n; :=Cn ^ S CSS;;C n; and let ^ N ;C n :=Cn ^ S ;C n . For2 [0; 1=2), E ^ N CSS;;C n; \H 1 1 2 E ^ N ;C bn=2c \H : We see from part (i) of Theorem 2.5.5 that for suitable choices of and such that=(2 1)< 1, the expected number of selected clusters inL (that is, low-selection-probability clusters) is strictly smaller 32 under cluster stability selection than under a single iteration of the base procedure ^ S ;C bn=2c . Similarly, part (ii) shows that cluster stability selection controls the number of high-selection-probability clusters that are missed by the base procedure. In Appendix A.5.2 we also present Theorem A.5.2, which similarly generalizes Theorem 2 and Equa- tions 7 and 8 of Shah and Samworth [159] to our setting. These results provide tighter bounds under stronger assumptions on the distributions of the selection proportions yielded by cluster stability selec- tion. We omit proofs for Theorems 2.5.5 and A.5.2 because they can be proven using identical arguments used to prove the corresponding results from Shah and Samworth [159] by replacing indicators of features being selected with indicators of clusters being selected (we spell this point out in more detail in Appendix A.5.2). Note that Theorems 2.5.5 and A.5.2 are valid forany partitioning of [p] into clusters (whether they are known in advance or estimated), though the results are more interesting if the clusters are meaningful. Also, these results generalize the results from the original papers in the sense that choosing (C 1 ;:::;C p ) = (f1g;:::;fpg) recovers the original results. In this special case, our proposal reduces to a a more powerful modied version of stability selection in its handling of the set , in the sense that 1 2B B X b=1 " 1 ( j2 [ 2 ^ S (A b ) ) + 1 ( j2 [ 2 ^ S A b )# max 2 ( 1 2B B X b=1 h 1 n j2 ^ S (A b ) o + 1 n j2 ^ S A b oi ) ; so for a xed threshold , more features will exceed in our proposal than in a complementary pairs subsampling version of the Meinshausen and Bühlmann [127] proposal. Notice that this increase in power comes at no price to the error control guarantees. Remark 2.5.6. The discussion in the last paragraph of Section 3.1 of Shah and Samworth [159] about how the guarantee of their Theorem 1(a) is at least as good as the Meinshausen and Bühlmann [127] 33 Theorem 1 guarantee, even when B = 1, applies to our Theorem 2.5.5(i) as well in the special case C =ff1g;:::;fpgg. We also point out that this does not require the same assumptions that (i) the base procedure is “no worse than random guessing" (in the sense that the average selection probability of the “signal features"S [p] under ^ S ;C bn=2c is at least as large as the average selection probability of the “noise features"N = [p]nS) and (ii) the distribution of the noise features is exchangeable; we only require the weaker assumption thatNL for =E ^ S ;C bn=2c =p. 2.6 SimulationsandDataApplication Having shown theoretically in Section 2.5 that cluster stability selection corrects the “vote-splitting" prob- lem suered by stability selection while retaining its error control properties, in this section we present results demonstrating cluster stability selection’s performance in settings where we would expect the vote- splitting issue to arise. First, in Section 2.6.1 we present the results of a simulation study to evaluate weighted averaged cluster stability selection in a setting where proxies forZ are observed that have un- equal correlations withZ. We then evaluate simple averaged cluster stability selection in a setting where the proxies are all equally correlated withZ in Section 2.6.2. In both cases, we evaluate results both when the clusters are estimated (as in the setting of Theorem 2.5.3) and known (as in Theorem 2.5.1). (See also Appendix A.2.3 in the online supplement for more in-depth results that show that simple averaged stability selection performs well in the simulation study in Section 2.6.1, though it slightly trails weighted averaged cluster stability selection, and the reverse is true for the second simulation study.) Finally, in Section 2.6.3 we investigate the performance of cluster stability selection on a GWAS data set. 2.6.1 SimulationStudy: ProxieswithVariedNoiseLevels We randomly generate data in a similar way to (2.6) – (2.9), but with ten “weak signal" features andq = 9 proxies forZ. Further, three of these arestrongproxies with a correlation of 0:9 withZ and six of them are 34 weakproxies with a correlation of 0:6 withZ. Recall that Theorem 2.5.3 suggests that in this setting cluster stability selection should rank the latent signals eectively, while according to Corollary 2.3.2 stability selection is likely to rank the features suboptimally. Also, by Proposition A.1.5 in the online supplement it is suboptimal to weight the strong proxies and weak proxies equally, so this setting should be well-suited for weighted averaged cluster stability selection. We repeat the following procedure 2000 times: 1. For`2f1; 2g andn 1 =n 2 = 200, the latentZ (`) 2R n ` and observed (X (`) ;10 ;X (`) ;11 ;:::;X (`) ;100 )2 R n ` 91 are generated independently entry-wise as standard Gaussian random variables. 2. Added noise variables ( (`) 1 ; (`) 2 ; (`) 3 )2R n ` 3 are generated independently as entry-wise Gaussian random variables with mean 0 and variance 19=81, so that fori2 [n ` ] andj2 [3],Z (`) i andX (`) ij := Z (`) i + (`) ij have correlation E h Z (`) i Z (`) i + (`) ij i s E Z (`) i 2 E Z (`) i + (`) ij 2 = 1 p 1 [1 + 19=81] = 0:9: Then ( (`) 4 ;:::; (`) 9 )2R n ` 6 are generated similarly so thatX (`) j :=Z (`) + (`) j are Gaussian with correlation 0:6 withZ (`) forj2f4;:::; 9g, and we formX (`) = (X (`) 1 ;X (`) 2 ;:::;X (`) 100 ). 3. The unlabeledX (1) is used to estimate clusters ^ C(n 1 ; 0:5) following the procedure from Denition 2.5.2. 4. The responsey (2) is generated by (2) = 1:5Z (2) + 10 X j=1 j X (2) (j+9) and y (2) = (2) + (2) ; 35 where j = 1= p j and (2) N 0; 2 I n 2 , with 2 determined so that the signal-to-noise ratio is 3: 2 = k (2) k 2 2 =n 2 3 : 5. We use the labeled data (X (2) ;y (2) ) to obtain selected sets using the lasso, stability selection, weighted averaged cluster stability selection, the protolasso (which we described in the introduc- tion), and the cluster representative lasso (which we described in Section 2.4.2.1). (a) The same set of estimated clusters is provided to cluster stability selection, the protolasso, and the cluster representative lasso § . (b) Stability selection uses complementary pairs subsampling, and both stability selection and clus- ter stability selection useB = 100 pairs of subsamples ¶ of sizebn 2 =2c = 100 with the lasso with penalty chosen in advance by cross-validation (separately for each simulation) as the base feature selection method. 6. For each method, selected sets of each sizes2 [11] are obtained in the following ways: (a) Lasso and protolasso: the rst feature set of sizes to appear in the lasso path. (b) Cluster representative lasso: same as the above, counting the cluster representative as a single “feature." (c) Stability selection: thes features with the greatest selection proportions. (d) Simple and weighted averaged cluster stability selection: similar to stability selection, but ^ B (j) is used to selects cluster representatives instead of using ^ B (j) to selects features. § Note that we are using a modied version of the cluster representative lasso; Bühlmann et al. [34] estimate the clusters using their own proposed method. ¶ The results are similar but noisier using the smaller number of subsamples recommended by Shah and Samworth [159]. 36 For stability selection and cluster stability selection, sets are not always dened for everys due to ties in selection proportions. 7. Finally, a training set (X (train) ;y (train) ) of 1000 observations is independently generated in the same way as (X (2) ;y (2) ). For each method and for every model size dened in that method,y (train) is regressed against the selected features (or cluster representatives) fromX (train) using ordinary least squares (OLS), and the mean squared error (MSE) of the resulting training set predictions ^ y (train) compared to (train) is calculated. We implement stability selection, the cluster stability selection methods, the protolasso, and the clus- ter representative lasso using the cssr R package [58], and we implement the lasso using the glmnet R package [68]. After completing this procedure 2000 times, the average MSE is calculated for each method for every model sizes (including in the average for each method only those iterations where the model of sizes for that method is dened). We then evaluate the stability of each method across the 2000 simulations using the metric proposed by Nogueira, Sechidis, and Brown [132], which equals 1 if the selected set yielded by a given method is identical across every simulation and has expected value 0 for a “null" feature selection method that selects features at random. (We describe the metric in more detail in Section A.2.1 in the online supplement.) The results are shown in Figure 2.3. (Larger versions of the plots in Figure 2.3 displaying results for more methods, including simple averaged cluster stability selection, appear in Appendix A.2.3. Also, Table A.1 in the online supplement includes the means and standard errors of the MSEs for each method at each model size, and Table A.7 displays the corresponding results for the stability metric.) We see in the leftmost plot that cluster stability selection has clearly better predictive performance than any of the competitor methods. Table A.2 displaysp-values for pairedt-tests of the alternative hypothesis that weighted averaged 37 1 2 3 4 2 4 6 8 10 12 No. Fitted Coefficients MSE 0.4 0.6 0.8 1.0 2 4 6 8 10 No. Fitted Coefficients NSB Stability 1 2 3 4 0.4 0.6 0.8 1.0 NSB Stability MSE Method Cluster Rep. Lasso (est. clusts) Lasso Protolasso (est. clusts) Stability Selection Weighted Averaged CSS (est. clusts) Figure 2.3: Left and center panels: plots of average mean squared error and stability against model size (measured by number of tted coecients) across all 1000 simulations for the simulation study from Sec- tion 2.6.1. Right panel: average MSE plotted against stability. Each point corresponds to one model size. cluster stability selection has signicantly lower MSE than the other methods for each model size; all results are statistically signicant at the 10 4 level. From the center panel of Figure 2.5, we see that cluster stability selection is also clearly more stable than all methods, though cluster stability selection models of size one are never dened due to ties in selection proportions. Table A.8 in the online supplement providesp-values for aZ-test of the alternative hypothesis that weighted averaged stability selection has a higher stability than each competitor method in Figure 2.5 and each model size; allp-values are less than 10 4 . Corollary 2.3.2 suggests that stability selection will consistently select a weak signal feature ahead of a proxy forZ due to the vote-splitting problem, and we see evidence here that that is the case. Stability selection has very high stability for models of size one because it tends to chooseX 10 (the most important weak signal feature) rst, but referring back to the leftmost panel, we see that this is undesirable for predictive performance. Similarly, the results for the lasso in the left and center panels are consistent with 38 Theorem 2.3.1. The lasso models of size 1 have fairly good predictive performance (and clearly outperform stability selection) because the lasso tends to select a single cluster member rst. But the lasso models of size one have very low stability because the lasso tends to select a particular cluster member randomly with uniform probability. In general, practitioners may be interested in simultaneous predictive performance and stability, so we also consider what tradeo exists between the two across model sizes and methods. The rightmost panel of Figure 2.3 plots out-of-sample MSE against stability for each method and each model size. All of the points closest to the bottom right corner of the graph (forming the Pareto frontier of the tradeo between stability and predictive performance) come from cluster stability selection, so it simultaneously dominates the competitors on both predictive performance and stability. 2.6.1.1 KnownClusters To evaluate the performance of cluster stability selection when clusters are known, we also consider a setting identical to the above except that the protolasso, cluster representative lasso and weighted averaged cluster stability selection are provided with the correct clustersf[9];f10g;f11g;:::;f100gg, as in the setting of Theorem 2.5.1. We then evaluate these methods on the same 2000 feature selection and training sets as in the previous section and compare these results to the previous results for the lasso and stability selection. Similarly to the previous section, we expect cluster stability selection to rank the clusters well in this setting due to Theorem 2.5.1, while Corollary 2.3.2 suggests stability selection will not perform as well. The results are displayed in Figure 2.4. Again, we display larger plots with more methods as well as tables of results in Section A.2.3 of the online supplement. We see that the performance of the cluster representative lasso and cluster stability selection methods improve noticeably both in terms of predictive 39 1 2 3 4 2 4 6 8 10 12 No. Fitted Coefficients MSE 0.4 0.6 0.8 1.0 2 4 6 8 10 No. Fitted Coefficients NSB Stability 1 2 3 4 0.4 0.6 0.8 1.0 NSB Stability MSE Method Cluster Rep. Lasso Lasso Protolasso Stability Selection Weighted Averaged CSS Figure 2.4: Same as Figure 2.3, but the methods that use clusters are provided with the known clusters, as described in Section 2.6.1.1. Note that the results for the methods that don’t use clusters (lasso, protolasso, and stability selection) are identical to those from Figure 2.3. performance and stability. Interestingly, the gap between cluster stability selection and the cluster repre- sentative lasso shrinks both in terms of MSE and stability (though Tables A.5 and A.11 show that weighted averaged cluster stability selection still dominates the cluster representative lasso by statistically signi- cant margins at the 0:05 level in both metrics at every model size, with the exceptions that models of size 1 that are never dened for cluster stability selection and the methods’ stabilities are statistically sized for models of size 3). The fact that the cluster representative lasso’s performance degrades more when clusters are estimated suggests that cluster stability selection may be more “robust" to misspecied clusters than the cluster representative lasso. 2.6.2 SimulationStudy: EqualNoiseProxies In this section, we conduct a second simulation study where all of the cluster members have equal correla- tion withZ. This is the same simulation study used in Example 1, though we will present results for more 40 methods. The setting is the same as the simulation study from Section 2.6, except that the design matrices X (`) are constructed with 15 weak signal features with coecients 1= p j and one cluster ofq = 5 prox- ies which all have correlation 0:9 withZ. The remaining 80 features inX (`) are noise features. Again, Theorem 2.5.3 and Corollary 2.3.2 suggest that cluster stability selection will rank the latent signals more eectively than stability selection, though the simple averaged weights are optimal in this case rather than unequal weights (Proposition A.1.5). 1 2 3 4 2 4 6 8 10 12 14 16 No. Fitted Coefficients MSE 0.25 0.50 0.75 1.00 2 4 6 8 10 12 14 16 No. Fitted Coefficients NSB Stability 1 2 3 4 0.25 0.50 0.75 1.00 NSB Stability MSE Method Cluster Rep. Lasso (est. clusts) Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Stability Selection Figure 2.5: Same as Figure 2.3, but displaying results for the simulation study from Section 2.6.2 (with methods provided with estimated clusters). The results are shown in Figure 2.5. We again see that both of the cluster stability selection methods have clearly better predictive performance than any of the competitor methods other than the cluster rep- resentative lasso, which they beat more narrowly. (Table A.14 in the online supplement shows that simple averaged cluster stability selection has signicantly lower prediction error than the cluster representative lasso at the 0.01 level at all model sizes except for 2, where the methods are statistically tied, and 1, where no simple averaged cluster stability selection models exist. In Table A.20 we see that simple averaged 41 cluster stability selection has signicantly higher stability than the cluster representative lasso at the 10 4 level for all model sizes except 1.) It is perhaps not surprising that the cluster representative lasso performs well because simple averaging of cluster members is optimal for equal noise proxies (Proposition A.1.5). As in the simulation from Section 2.6.2, we also investigate the performance of each method in this setting when the methods that use clusters receive the known clusters. In this setting, the clusters are fairly easy to identify due to the high correlation of the clustered features, so the results are overall very similar to those depicted in Figure 2.5 with estimated clusters. For this reason, we do not display the results in the main text, but the results are shown in Figures A.12, A.13, and A.14 in the online supplement. 2.6.3 DataApplication We demonstrate cluster stability selection on an open genome-wide association study data set [173, 4] collected fromn = 1058 accessions of Arabidopsis thaliana, a small owering plant that has been widely studied. The feature matrixX contains SNP data—X ij = 0 if in accessioni both alleles at base position j take on the more common value, and X ij = 1 if both alleles take on the more rare value. The goal of GWAS is to accurately identify as many SNPs as possible that are associated with variation in a given phenotype, so feature selection methods are important in this setting. Further, GWAS data is a natural application of cluster stability selection because nearby SNPs tend to be highly correlated due to linkage disequilibrium, which is caused by a variety of mechanisms; see Nordborg et al. [133] and Kim et al. [103] for discussion of linkage disequilibrium specically in Arabidopsis thaliana. As a result, clusters of highly correlated SNPs can be identied. For the response, we use the logarithm of the measured owering time (in days) at 10 C [4]. ∥ Prior to evaluating our methods on the data, we pre-process the data using standard methods [37, 157]. We screen out SNPs where the less common allele appears in fewer than 1% of observations (that ∥ The imputed genotype was downloaded from https://aragwas.1001genomes.org/ and the phenotypes were down- loaded viahttps://arapheno.1001genomes.org/phenotype/261/. 42 is, the minor allele frequency is less than 1%). We do not have to screen for missing values or incorrect SNP position labeling because the data set as it is available online has already been cleaned and imputed using standard methods. The Hardy-Weinberg equilibrium test is commonly used for screening SNPs, but it is not applicable for our data because Arabidopsis thaliana is almost always homozygous. Finally, for computational speed we retained only the rst 1000 SNPs that remained after screening. We repeat the following procedure (similar to the procedure from the simulation studies) 1000 times. We randomly divide the data into feature selection and model estimation sets of 423 observations (about 40% of the data for each) and a test set of the remaining 212 observations. In each iteration we estimate clusters of features using complete linkage hierarchical clustering on the non-test-set data with distance metric one minus the absolute value of the sample correlations (cosine similarity after centering) between the SNPs with a cuto of 0:5, similarly to Candès et al. [37]. Then for every feature selection method, we use the feature selection set to obtain selected sets of sizesf1;:::; 60g, providing the estimated clusters to those methods that make use of them. Next, for every method and every model size, we use the model estimation set to estimate linear models by OLS using the selected features. Finally, we use these models to generate predictions for the test set, and evaluate the MSE of the predictions against the actual values. (Out-of-sample MSE will be directly of interest in many settings, and in GWAS data this can also be seen as a proxy measure for how accurately each method identies SNPs that are associated with the response phenotype.) After all simulations are complete, we also evaluate the stability of each method across all simulations, again using the metric proposed by Nogueira, Sechidis, and Brown [132]. Figure 2.6 shows the results. (Since the iterations are each draws from the same data set and therefore dependent, we do not provide standard errors orp-values for the results. We provide larger versions of the plots in Figure 2.6 with more methods in Appendix A.2.4 in the online supplement, as well as tables of the average mean squared errors and stabilities. In the left plot, we see that the overall best-performing models are yielded by weighted averaged cluster stability selection, with simple averaged cluster stability 43 0.033 0.035 0.037 0.039 0.041 10 20 30 40 50 60 No. Fitted Coefficients MSE 0.2 0.3 0.4 0.5 0.6 10 20 30 40 50 No. Fitted Coefficients NSB Stability 0.033 0.035 0.037 0.039 0.041 0.2 0.3 0.4 0.5 0.6 NSB Stability MSE Method Cluster Rep. Lasso Lasso Protolasso Simple Averaged CSS Stability Selection Weighted Averaged CSS Figure 2.6: Results from the data application in Section 2.6.3. The left two panels are plots of average MSE and stability against model size (measured by number of tted coecients) for the methods from the simulation study in Section 2.6.1 across all 1000 random draws. The right panel plots average MSE against model size. In all cases, points on the plot represent averages across 5 model sizes to smooth the plot (for example, the leftmost points in the left plot are aggregated over model sizes 1 through 5). 44 selection not far behind. The relatively poor predictive performance of stability selection compared to both cluster stability selection methods seems to indicate that in this setting the “vote-splitting" problem illustrated by Corollary 2.3.2 arises even though the SNP covariates are subgaussian rather than jointly Gaussian. The middle plot of Figure 2.6 shows the stability of each method. At most model sizes (and in par- ticular at the model sizes that are best for predictive performance, in the size range of 15 – 30), simple averaged cluster stability selection is the most stable. Note that unlike in the simulation studies, weighted averaged cluster stability selection has appreciably dierent stability that simple averaged cluster stability selection, because in this setting weighted averaged cluster stability selection assigns weight 0 to some cluster members with appreciably high probability. Examining the rightmost plot in Figure 2.6, we see that the cluster stability selection with averaging methods are among the best models for predictive performance, and are more stable than competing meth- ods with similar predictive performance. The three smallest model sizes for the cluster representative lasso are on the Pareto frontier, with higher stability than the CSS methods and relatively impressive predic- tive performance, though these represent smaller models with fewer discoveries than the cluster stability selection models on the Pareto frontier. 2.7 Conclusion Stability is a fundamental property in practice and for replicability in science. Practitioners might rea- sonably hope that if they were to receive another random sample from the same distribution, the main ndings of their statistical analysis would be minimally altered. Stability selection adds stability to the lasso, leading to fewer false selections. However, we have shown that the vote-splitting problem can cause stability selection to miss important features leading to poorly predicting models when there are highly correlated clusters of features. Cluster stability selection allows 45 practitioners to exploit knowledge of clustered features in the data to enjoy the benets of stability selec- tion while still maintaining the lasso’s ability to select important clustered features. 2.7.1 FutureWork: SumScoringandSingle-IndexModels In the introduction we discussed several settings with clustered features where cluster stability selection could be useful for feature selection, whether for its own sake (like in GWAS, where the goal is identify- ing SNPs associated with a phenotype) or to improve a predictive model. In this work we only use the estimated weights (2.19) from weighted average cluster stability selection to improve estimation in down- stream prediction tasks. However, representations of latent variables as linear combinations of observed features are of direct interest in other contexts. For example, in the introduction we mentioned that in psychology the answers to interview questions are interpreted as proxies for a latent variable, like whether a patient has a disorder. The answers to these questions are often quantied and combined linearly to create ascore intended to better measure the latent variable [63]. Frequently the variables are simply added (or, equivalently up to scaling, averaged), in what is called sum scoring [10]. This can be justied both on grounds of methodological simplicity and by an assumption that each noisy measurement is on the same scale and has the same error variance [125], but this practice has come under recent criticism for being overly simplistic [66, 145, 178, 108, 163]. One richer alternative is the congeneric model [72], which allows each noisy measurement to have a unique error variance. Congenericscoring orweightedsumscoring attempts to combine the noisy measure- ments with a vector of unequal weightsw to better approximate the latent variable of interest [125, 189], as Proposition A.1.5 in the online supplement suggests is optimal in this setting. Because the estimated weights (2.19) in weighted averaged cluster stability selection attempt to approximate a latent signal using a weighted average of features as in (2.17), it is closely related to congeneric scoring. In a setting where one or more latent variables of interestZ are believed to be a predictors for a responsey, weighted averaged 46 cluster stability selection could be used to provide data-driven estimates of the weightsw in congeneric scoring. (In contrast, simple averaged cluster stability selection, as in Equation 2.18, takes an approach that is equivalent to classical sum scoring). Such a responsey could be a measurable outcome that is believed to be inuenced by the latent presence or severity of a disorder. As we discuss in Section 2.4.1, cluster sta- bility selection is motivated by settings where the noise that distinguishes the correlated observed features in a cluster is unrelated to the latentZ or predicting the responsey, so cluster stability selection will best estimate the weights in congeneric scoring in settings where this assumption is appropriate. Sum scoring is related tosingle-indexmodels (see [123] for a textbook-level discussion), where given a matrix of covariatesX, a vector of responsesy is modeled as y =f(Xw) +; wherew is a vector of weights, f() is an arbitrary (typically assumed smooth) function, and is in- dependent random noise. In this model,Xw is assumed to model a latent one-dimensional variableZ associated withy (though in a potentially nonlinear way), andw is coerced to have unit norm for iden- tiability. Again, such a model aligns with our setting ifZ is a latent variable associated with a cluster of observed variables. Thenw could be estimated as the weights learned by weighted averaged cluster stability selection with a base feature selection procedure that accommodates such a nonlinear model. Single-index models can be viewed as a special case of projection pursuit [67], where the response is modeled as a sum of smooth functions of linear combinations of features: y = K X k=1 f k (Xw k ) +: 47 We can apply cluster stability selection in this more general setting if we assume that eachf k is a function only of features in a clusterC k corresponding to a latent variableZ k , so that the above projection pursuit model can be seen as approximating a generalized additive model [76] of the form y = K X k=1 f k (Z k ) +: Applying weighted average cluster stability selection with an appropriate nonlinear feature selection method would then yield estimates of the weightsw k for each cluster. Acknowledgements We thank Saharon Rosset for a helpful discussion. This work was supported in part by NSF CAREER Award DMS-1653017. 48 Chapter3 PredictingRareEventsbyShrinkingTowardsProportionalOdds Abstract. Training classiers is dicult with severe class imbalance, but many rare events are the culmi- nation of a sequence with much more common intermediate outcomes. For example, in online marketing a user rst sees an ad, then may click on it, and nally may make a purchase; estimating the probability of purchases is dicult because of their rarity. We show both theoretically and through data experiments that the more abundant data in earlier steps may be leveraged to improve estimation of probabilities of rare events. We present PRESTO, a relaxation of the proportional odds model for ordinal regression. Instead of estimating weights for one separating hyperplane that is shifted by separate intercepts for each of the estimated Bayes decision boundaries between adjacent pairs of categorical responses, we estimate sepa- rate weights for each of these transitions. We impose an L1 penalty on the dierences between weights for the same feature in adjacent weight vectors in order to shrink towards the proportional odds model. We prove that PRESTO consistently estimates the decision boundary weights under a sparsity assump- tion. Synthetic and real data experiments show that our method can estimate rare probabilities in this setting better than both logistic regression on the rare category, which fails to borrow strength from more abundant categories, and the proportional odds model, which is too inexible. 49 This chapter is based on the paper “Predicting Rare Events by Shrinking Towards Proportional Odds," forthcoming in the proceedings of the 40th International Conference on Machine Learning (ICML 2023) and also available on arXiv (https://arxiv.org/abs/2305.18700). 3.1 Introduction Estimating probabilities of rare events is known to be dicult due to class imbalance. However, sometimes these events are the culmination of a sequential process with intermediate outcomes. For example: 1. In online marketing, a customer is rst served an ad, then may click on it, then may indicate interest in making a purchase (by “liking" the product, for example), and nally may make a purchase. 2. In health and medicine, many outcomes can be encoded as ordered categorical variables, like re- ported quality of life and disease progression [134]. 3. Sales of high-price durable goods typically follow a sales funnel [54]. For example, when buying a car often a potential buyer rst comes in to see a car, may take a test drive, and nally may buy the car. In many of these cases, the intermediate events are much more common than the rare events. Though these intermediate events may not be of direct interest, if the features that contribute to the probability of advancing through earlier classes also contribute to the probability of advancing through later classes, then the more abundant intermediate events can be leveraged to improve estimation of the rare event probabilities. The proportional odds model [124], also called the ordered logit model [36, Section 15.9.1], satises, for ordinal outcomesk2f1;:::;K 1g, log P (ykjx) P (y>kjx) = k + > x; (3.1) 50 where2 R p is a vector of weights andx2 R p is a vector of features. This implies that for allk2 f1;:::;K 1g p k (x) :=P (ykjx) =F k + > x ; (3.2) where F () is the logistic cumulative distribution function, F (t) = expftg=[1 + expftg]. Notice that k + > x is the Bayes decision boundary for the binary random variable 1fykgj x. This problem could instead be cast asK 1 binary classication problems of the form (3.2) for adjacent classes: log P (ykjx) P (y>kjx) = k + > k x; k2f1;:::;K 1g: (3.3) The condition that the weight vectors k of the separating hyperplanes in (3.3) are all equal, as in (3.1), has been called theproportionaloddsassumption [124] or theparallelregressionassumption [73, Section 18.3.2]. One way to motivate this model is by supposing that the response is driven by a latent (unobserved) variableU, U = > x +; (3.4) where has a standard logistic distribution and is independent ofx. Responsek is observed if and only if k U i < k1 (where we dene 0 :=1 and K :=1). This model leads to (3.2). (See Section 3.3.2 of [2] for a more detailed explanation.) Because the proportional odds model assumes that the decision boundaries between adjacent classes are all governed by the same hyperplane dened by (separated only by dierent intercepts k ), it assumes that the decision boundary between any two classes perfectly explains the decision boundary between any two other classes, other than an intercept term. If a rare event has much more common intermediate events before it, this model can therefore be very useful for better estimating the parameters of the model, and therefore better estimating the rare event probabilities. However, it could be that the proportional odds 51 assumption is too rigid to be realistic, because observed features may have varying inuence at dierent decision boundaries. For example: 1. In online marketing, users may click on an ad only to realize that the product is not what they were expecting, resulting in a particularly low probability of purchase. 2. For expensive goods like a home or car, potential buyers may express interest by going on a tour or taking a test drive purely out of curiosity; this may be distinct from their level of interest in actually making a purchase. 3. Students may place weights on dierent factors when deciding whether to apply to graduate school than they did when deciding whether to apply to an undergraduate program—they may have more appealing alternatives to additional schooling, they may face new nancial or personal constraints because they are older, etc. In each of these settings, if specic features vary in relevance for dierent decision boundaries while other features have about the same inuence at every boundary, the proportional odds assumption may be too strong. Violations or relaxations of the proportional odds assumption along the lines of (3.3) have previously been considered by, for example, Brant [31]. Peterson and Harrell Jr [139] developed partial proportionaloddsmodels, which allow the proportional odds assumption to hold for some features but not others, an idea previously mentioned by Armstrong and Sloan [7]. (See Section 3.6.1 of [2] for a textbook- level discussion). These relaxations have not been widely adopted because tting separate weights for each outcome is too exible unlessp(K 1)n and all classes are reasonably common (and we discuss additional diculties of this kind of model in Sections 3.3.1 and 3.4.1). 52 3.1.1 OurContributions In this paper we propose relaxing the proportional odds assumption as in (3.3), but controlling the amount of relaxation by placing` 1 penalties on the dierences in weights corresponding to the same features in adjacent k vectors, in a way that is reminiscent of the fused lasso [170]. This model allows us to borrow strength from outcomes where data is much more abundant to improve rare probability estimates when outcomes are much more rare without making the strong assumption that the weights in these models are exactly equal. In particular, it allows for the proportional odds model to hold for some specic features in some adjacent pairs of decision boundaries, but not others. We formalize the intuitive argument we outline above—that the proportional odds model allows for precise estimation of the vector as long as at least one decision boundary is surrounded by reason- ably well-balanced outcomes, and this allows for improved estimation of rare probabilities at the end of the sequence—through theoretical results in Section 3.2. Motivated by this argument but skeptical of the proportional odds assumption holding exactly, we proposePRESTO in Section 3.3 and prove that it consis- tently estimates 1 ;:::; K1 under a sparsity assumption in Section 3.3.1. In Section 3.4 we demonstrate through synthetic and real data experiments thatPRESTO can outperform both logistic regression on the rare class and the proportional odds model, both in settings where the dierences in adjacent k vectors are sparse, as PRESTO assumes, and in settings where these dierences are not sparse. Before we move on from the introduction, we review related literature. 3.1.2 RelatedWork The diculty of classication with class imbalance has been well-known for decades. Johnson and Khosh- goftaar [97] provide a recent review focusing on deep learning methods for handling class imbalance, and they also provide references for many other ways of dealing with class imbalance. One particularly closely 53 related work is Owen [135], which explores how logistic regression handles a vanishingly rare class. A particularly popular approach, SMOTE [41], has its own recent review paper [62]. Tutz and Gertheiss [175] discuss the possibility of penalizing dierences in weights between adjacent models, including briey proposing an` 1 penalty between weights in corresponding categories for pro- portional hazard models, though this is not the focus of their article and they only mention the idea very briey without investigating it. Wurm, Rathouz, and Hanlon [195] propose a generalization of a proportional odds model (and imple- ment it in the R packageordinalNet) that allows for the possibility that adjacent categories have equal (or very close) weights, but their method diers from ours. The most closely related model Wurm, Rathouz, and Hanlon propose is an over-parameterizedsemi-parallel model with both a matrix of separate parame- ters for each level, an approach reminiscent of Peterson and Harrell Jr [139]. This results in more exible, less structured models than our approach, which assumes similarity between adjacent k vectors. Further, Wurm, Rathouz, and Hanlon [195] do not investigate the theoretical properties of their model, or the use of their model for improving estimates of rare event probabilities. Ugba, Mörlein, and Gertheiss [176] and Pößnecker and Tutz [141] implement an ` 2 rather than ` 1 penalty between weights in models for adjacent decision boundaries. However, these works also focus on ordinal regression more generally, while we focus both theoretically and in simulations on leveraging common classes to improve estimated probabilities of rare events. Further, the` 1 penalty, which imposes sparse dierences, allows the proportional odds assumption to hold for some features and decision bound- aries and not others, while the` 2 ridge penalty used by Ugba, Mörlein, and Gertheiss [176] (and previously proposed by [175, Section 4.2.2]) relaxes the proportional odds assumption for all features but regularizes the relaxation. The` 2 group lasso penalty used by Pößnecker and Tutz [141] can impose the proportional odds assumption for a given feature either at all decision boundaries or none of them, making it less exible thanPRESTO. 54 Besides the fused and generalized [172] lasso, our work relates more specically to the generalized fused lasso [85, 198]. Xin et al. [198] in particular propose and analyze an algorithm to solve a class of optimization problems similar to the PRESTO optimization problem, (3.7), with ` 1 fusion penalties. In contrast to the present work, Xin et al. [198] focus almost entirely on the properties of their algorithm. Further, PRESTO lies outside their class of optimization problems because PRESTO directly penalizes only the coecients in the rst decision boundary, not all of the coecients. This distinction is central to our proof strategy for Theorem 3.3.1; Xin et al. [198] do not prove the consistency of their method. Viallon et al. [183] provide theoretical results for the generalized fused lasso specically in the cases of linear and logistic regression, though not for ordinal regression. Viallon et al. [182] prove theoretical results for a broader class of generalized linear models that still does not include the proportional odds model or a generalization like PRESTO. Lastly, Ekvall and Bottai [56] prove theoretical results for a class of models in which PRESTO can be expressed, and indeed we leverage their results in proving our own theory, though they do not directly consider fusion penalties. 3.2 MotivatingTheory We present the following theoretical results to motivatePRESTO. The thrust of our motivation is as follows: 1. Logistic regression does arbitrarily badly as class imbalance worsens (Theorem 3.2.2). 2. However, as one would expect, a logistic regression model’s ability to estimate probabilities improves when the parameters are known (Theorem 3.2.3). 3. The proportional odds model allows for precise estimation of as long as two adjacent classes are reasonably common, even if the remaining classes are arbitrarily rare (Theorem 3.2.4). 4. Taking 2 and 3 together, our conclusion is that we can better estimate probabilities of rare events by using a method that leverages data from decision boundaries between abundant classes to better 55 estimate decision boundaries near rare classes. (Both the proportional odds model and PRESTO leverage the data in this way.) Before we present our results, we discuss the metrics we will use in our results and some of the as- sumptions we will make. 3.2.1 Preliminaries Our goal is to characterize and compare the prediction error of estimated conditional probabilities of a rare class from both logistic regression and the proportional odds model. There are many settings where estimating rare probabilities accurately (as opposed to, for example, predicting class labels accurately) is important. For example, in online advertising, advertisers bid on the price to display an ad to a given user. Advertisers could bid optimally if they knew the true probability each user would click a given ad, so they’d like to estimate these probabilities as precisely as possible [81, 203]. Another example is public policy, where scarce resources may be allocated based on estimated probability of bad outcomes [186]. To prioritize optimally, precisely estimated probabilities are needed, not just accurate labels. A natural metric in an estimation setting is mean squared error, E h (^ (x)(x)) 2 i , where (x) is the actual probability of a rare event conditional onx and ^ (x) is an estimate. Further, we leverage asymptotic statistics and present results for large-sample estimators. We dene the notions of asymptotic mean squared error we will use below: Denition3.2.1. Let ^ n be a maximum likelihood estimator for a parameter2R from a sample size ofn. Under regularity conditions, the sequence of random variablesf p n( ^ n )g converges in distribution to a Gaussian random variable. Then we dene theasymptoticmeansquarederror of ^ n to be (suppressing n from the notation) Asym:MSE( ^ ) :=E lim n!1 p n h ^ n i 2 : 56 Asymptotic metrics are commonly used to compare the performance of estimators. The asymptotic relative eciency of two estimators is the ratio of their asymptotic variances, Asym:Var( ^ ) := Var lim n!1 p n h ^ n i ; which is equal to Asym:MSE( ^ n ) for the (asymptotically unbiased) maximum likelihood estimators we consider. See Section 10.1.3 of Casella and Berger [39], Section 8.2 of Vaart [177], or Section 4.4.5 of Greene [73] for textbook-level discussions. The asymptotic MSE could also be used as an estimator of the MSE for large (but nite)n, under the heuristic reasoning that for largen, MSE( ^ ) = 1 n E p n h ^ i 2 1 n E lim n!1 p n h ^ i 2 = 1 n Asym:MSE( ^ ): See Section 4.4 of Greene [73], Section 7.3 of Hansen [75], or Section 3.5 of Wooldridge [193] for more discussion of this kind of nite-sample estimation using asymptotic quantities. We briey present and discuss some of our assumptions. • AssumptionX(A): The random vectorsx i 2R p are independent and identically distributed (iid) fori2f1;:::;ng, each with probability measuredF (x) with measurable, bounded supportS AR p , with Cov (X) positive denite. • Assumption Y (K): The response y i 2 f1;:::;Kg is distributed conditionally onx i as in the proportional odds model (3.1). (Note that ifK = 2, this is equivalent to the logistic regression model.) All classes have positive probability for allx on the support ofx i (equivalently, the intercepts strictly dier: 1 <:::< K1 .) 57 AssumptionX(A) allows a very broad class of distributions, including both discrete and continuous random variables. Notice that the boundedness assumption withinX(A) implies that for ~ X := (1;X) (where 1 is ann-vector of ones), the matrixE[ ~ X ~ X > ] has a nite maximum eigenvalue. When we will refer to it, we call it max and writeAssumptionX(A; max ). From (3.2) we see that in the proportional odds model if the intercepts strictly dier ( 1 <:::< K1 ) then for anyx all of the classes have conditional probability strictly between 0 and 1. That said, if the support ofX is unbounded then all of the probabilities for individual classes can become arbitrarily close to 0 or 1. Under AssumptionX(A), however, we can strictly bound quantities like sup x2S f k (x)g (where k (x) :=p k (x)p k1 (x) =P(y =kjx)) away from 0 or 1. Theorem 3.2.3 holds under AssumptionX([0;1) p ), though for any boundedS R p , there is some nitea one could add to each coordinate to shiftS to a subset of [0;1) p ; Theorem 3.2.3 would then apply to these translated features. 3.2.2 Theorem3.2.2 It is well-known that class imbalance poses a major challenge for classiers. Theorem 3.2.2 exhibits this concretely for logistic regression. Theorem 3.2.2. Assume X(R p ; max ) and Y (2) hold. Let (x) := P(y = 2 j x), and assume that sup x2S (x) = rare for some rare 1=2. Then 1. for any xedv2R p+1 , 1 kvk 2 2 Asym:MSE (^ ; ^ > )v 1 max rare ; and 2. for any xedz2S, 58 Asym:MSE ^ (z) (z) 1 rare rare 1 max : Proof. Provided in Section B.2.2. To give an example of applying part 1 of this result, consider the choicev = (0; 1; 0;:::; 0). Then we have that Asym:MSE ^ 1 1=( max rare ), so ^ 1 (or any other estimated coecient) has arbitrarily large asymptotic mean squared error as rare vanishes. Part 2 shows that the same thing happens to the asymptotic mean squared error for the estimated probabilities of the logistic regression estimator, when scaled by(z). 3.2.3 Theorem3.2.3 Theorem 3.2.3 suggests a possible way to circumvent the problem of class imbalance. We compare the typical logistic regression intercept estimate ^ to the quasi-estimated estimator ^ q obtained when one estimates only the intercept of the logistic regression model with a known. We also compare the resulting estimators of conditional probabilities for anyz2 R p : the usual logistic regression estimator ^ (z) and ^ q (z), the estimator when is known. Theorem 3.2.3 proves the reasonable intuition that ^ q must be a better estimator than ^ , and likewise for ^ q (z) and ^ (z). Theorem 3.2.3. AssumeX([0;1) p ; max ) andY (2) hold. Let(x) := P(y = 2j x), and let min := inf x2S f(x)^ 1(x)g. Then 1. Asym:MSE(^ ) Asym:MSE(^ q ) [Asym:MSE(^ q )] 2 59 where := 4 2 min (1 min ) 2 kE [X]k 2 2 max ; and 2. For anyz2R p nfz g, where z := E [X(X)[1(X)]] E [(X)[1(X)]] ; it holds that Asym:MSE(^ q (z))< Asym:MSE(^ (z)): (Forz , the above holds with rather than<.) Proof. Provided in Section B.2.2. Examining the rst result, it is sensible that the lower bound for the gap between the asymptotic variances of the two estimators vanishes as min vanishes because if minf 1 (x); 1 1 (x)g becomes very small on the bounded support, then the imbalance between the two classes potentially becomes very large, and estimating the intercept becomes dicult regardless of whether or not is known. As the class balance improves ( min becomes closer to its upper bound 1=2), the guaranteed gap between Asym:MSE(^ ) and Asym:MSE(^ q ) becomes larger. In addition to formally verifying intuition, Theorem 3.2.3 also quanties the estimation gap between the rare intercept estimators in terms of noteworthy parameters and shows that the assumptions needed for this intuition to hold are minimal. 60 3.2.4 Theorem3.2.4 Theorem 3.2.3 suggests that if only we could estimate very well, we could improve our estimated proba- bilities even in the face of class imbalance. Theorem 3.2.4 suggests a way to leverage abundant data among other classes to do this. In the proportional odds model (3.1),R p is partitioned into regions with separating hyperplanes dened by, which we note are Bayes decision boundaries: forx2 R p such that k + > x = 0, we have p k (x) = 1=2. Consider the setting of ordered categorical data generated by the proportional odds model with cate- gories 1 and 2 similarly common over the support of a bounded distribution ofx i and categories 3;:::;K all rare. In this setting, for many of the observed values ofx i , the probabilities of being in class 1 or 2 will both be close to 1/2. Intuitively it should be relatively easy to estimate and 1 , the parameters that dene the Bayes decision boundary between classes 1 and 2, and thereforep 1 (x i ). Theorem 3.2.3 suggests this should help us in estimating the rare class probabilities. In Theorem 3.2.4, we prove that even if class K becomes arbitrarily rare, as long as the rst two classes are reasonably well balanced, the proportional odds model still learns quite well. Theorem3.2.4. AssumeX(R p ) andY (3) hold. Assume for allx2S it holds thatj k (x) 1=2j for k2f1; 2g for some 2 (0; 1=2) and letM := sup x2S kxk 2 (notice thatX(R p ) ensures thatM <1). Suppose sup x2S f 3 (x)g = rare , where rare is no greater than min 8 > > < > > : 1 2 1 2 1 2 + ; min I 2 I 1 I > 1 I 1 1 3M 2 (M + 2) 9 > > = > > ; ; (3.5) where min () denotes the minimum eigenvalue of andI 2 I 1 I > 1 I 1 1 is a symmetric matrix composed of termsfromtheFisherinformationmatrixfortheproportionaloddsmodel(seethedenitionsofthesetermsin 61 Equations B.2.1, B.2.2, and B.2.3 in the appendix). Then there existsC <1 not depending on rare such that for any xedv2R p , 1 kvk 2 2 Asym:MSE v > ^ prop. odds C: Proof. Provided in Section B.3.3. Theorem 3.2.4 shows that in contrast to logistic regression, the proportional odds model still learns within a xed precision even as rare vanishes. Remark 3.2.5. We briey discuss the upper bound (3.5). For this bound to make sense, it must hold that the symmetric matrixI 2 I 1 I > 1 I 1 1 is positive denite so that its minimum eigenvalue is strictly positive. The matrixS :=I I 1 I > 1 I 1 1 is the Schur complement ofI 1 1 =M 1 in the submatrix 0 B B @ I 1 1 I > 1 I 1 I 1 C C A (3.6) of the Fisher information matrixI prop. odds (;) for the proportional odds model (see Lemma B.2.1 in the appendix). Note (3.6) is a principal submatrix of the positive deniteI prop. odds (;), so is positive denite by Observation 7.1.2 in Horn and Johnson [88]. From (B.2.1) we also know thatI 1 1 > 0, soS is positive denite by Theorem 1.12 in Zhang [202]. It seems plausible that I 2 I 1 I > 1 I 1 1 =S I 1 I > 1 I 1 1 is also positive denite becauseI is the inverse of the asymptotic covariance matrix of ^ ideal , the maxi- mum likelihood estimator of when 1 and 2 are known. We expect that Cov( ^ ideal ) would be small (and the eigenvalues ofI would be large) in this setting because we can estimate well due to the abundant 62 observations in classes 1 and 2 (ensured if is not too large), so we should be able to learn the decision boundary between these classes well. If the eigenvalues ofI are indeed large, it might be reasonable to expectI 2 I 1 I > 1 I 1 1 to be positive denite. In Sections B.3.1 and B.3.2 in the appendix, we present more detailed analysis as well as the results of synthetic experiments that indicate that it is plausible both that I 2 I 1 I > 1 I 1 1 is positive denite and that the upper bound (3.5) is reasonable. 3.3 PredictingRareEventsbyShrinkingTowardsproportionalOdds Theorems 3.2.3 and 3.2.4 suggest a path to improve estimated probabilities for a rare event that is at the end of an ordered sequence: use the more common events that come before it to improve the estimation of the decision boundary aecting the rare class. In practice, however, the proportional odds model assumption is strong and unlikely to hold in many settings. PRESTO allows for this assumption to be relaxed; instead of assuming the vectors governing the decision boundaries are identical, we assume they are in general dierent, but with dierences that are (approximately) sparse. One concrete model to motivate this is a relaxation of (3.4) along the lines of (3.3). Suppose thatU 1 := U as dened in (3.4) (with 1 =), and it still holds that an observation is in class 1 ifU 1 1 . However, fork2f2;:::;K 1g, outcomek is observed if and only if k U k < k1 + > k x for sparse vectors 2 ;:::; K1 2R p satisfying k = k k1 , soU k =U k1 + > k x fork2f2;:::;K 1g. Note that this is within the scope of (3.3), but we assume a structure on the diering k vectors rather than allowing for arbitrary dierences. 63 Assuming sparse dierences in adjacent k vectors in this way suggests the following optimization problem for dataX = (x 1 ;:::;x n ) > andy = (y 1 ;:::;y n ): arg min ; 1 n n X i=1 log F y i + > y i x i F y i 1 + > y i 1 x i + n 0 @ p X j=1 j j1 j + p X j=1 K1 X k=2 j jk j;k1 j 1 A ; (3.7) where we dene K :=1; 0 :=1 and 0 := 0. The penalties on thej j1 j terms are sucient to regularize all of the weights given the penalties on the dierence terms starting from the 1 vector, improving parameter estimation. Like the proportional odds model and the generalized lasso [172] opti- mization problem, (3.7) is strictly convex if and only if y i + > y i x i > y i 1 + > y i 1 x i for alli [142]. This can be violated if the decision boundaries, which are not parallel, cross in the support ofX. In Section 3.4.1, we discuss the practical issues this presents when implementing relaxed proportional odds models like PRESTO, and in the next section, we prove PRESTO is consistent relying in part on an assumption that these decision boundaries do not cross in the support ofX. See Appendix B.6 for details on how we estimatePRESTO in practice. 3.3.1 TheoreticalAnalysis In this section, we present Theorem 3.3.1, which shows thatPRESTO is a consistent estimator of 1 ;:::; K1 under suitable assumptions. Before stating Theorem 3.3.1, we present and briey discuss some of the new assumptions we will make. • AssumptionS(s;c): The distribution ofy i jx i is distributed according to thePRESTO likelihood (3.7), where the true coecients = > 1 ; > 2 ;:::; > K1 > 2 R p(K1) are s-sparse (have s nonzero entries for a xeds not increasing inn orp). Further,k k 1 c for a xedc. 64 • AssumptionT (c): For all small enough > 0, for all2 R p(K1) withk k 1 it holds that none of the decision boundaries dened by and the true 1 ;:::; K1 cross inS. Also, max k2f1;:::;K1g j k jc. The xed sparsity assumption S(s;c) is helpful theoretically and also because without it in higher dimensions it becomes increasingly dicult to have nonparallel decision boundaries that do not cross. The rst part of AssumptionT (c) can be interpreted to mean that none of the decision boundaries cross “too closely" toS. Other than these aspects, AssumptionsS(s;c) andT (c) are mild. Theorem 3.3.1. In a setting with xedK 3 andp = p n !1 asn!1 and satisfyingp n C 1 n C 2 for someC 1 > 0 andC 2 2 (0; 1), consider estimatingPRESTO with penalty n =C 3 log(p n [K 1])=n for someC 3 > 0. Suppose AssumptionX(R pn ) holds and there is someC 4 <1 such that sup x2S kxk 1 C 4 and AssumptionsS(s;C 4 ) andT (C 4 ) hold. Assume for some xedb> 0 it holds that min := min k2f1;:::;Kg min ( k ) > b, where k := E x i x > i jy i =k . Then PRESTO is a consistent estimator of 1 ;:::; K1 . Proof. Provided in Section B.5. Theorem 3.3.1 shows that under fairly mild regularity conditions and a sparsity assumption in a high- dimensional setting,PRESTO consistently estimates all of the decision boundaries. That is, it is consistent both if the proportional odds assumption holds and in more exible settings, where the proportional odds model is unrealistic, under sparsity. Theorem 3.2.3 suggests this should be helpful for estimating rare class probabilities. The proof of Theorem 3.3.1 leverages recent theory developed for` 1 -penalized ordinal regression [56]. 65 1e−06 1e−05 1e−04 1e−03 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.7091% 0.1 1.0 10.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.7091% 0.1 1.0 10.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.4348% 0.1 0.3 1.0 3.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 1.1664% Figure 3.1: Top left: MSE of estimated rare class probabilities for each method across alln = 2500 observa- tions, across 700 simulations, in sparse dierences simulation setting of Section 3.4.1, for intercept setting yielding rare class proportions of about 0:71% on average and sparsity 1=2. Remaining plots: ratios of MSE for PRESTO divided by MSE of each other method for each of three sets of intercepts with sparsity 1=2 (PRESTO performs better if ratio is less than 1). All plots on log scale. 66 3e−06 1e−05 3e−05 1e−04 3e−04 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.616% 0.1 0.3 1.0 3.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.616% 0.01 0.10 1.00 10.00 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.364% 0.1 0.3 1.0 3.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.994% Figure 3.2: Same plots as in Figure 3.1, but for uniform dierences synthetic experiment in Section 3.4.2. 67 3.4 Experiments To illustrate the ecacy of PRESTO, we conduct two synthetic experiments and also examine two real data sets. In Section 3.4.1, we generate randomy that have conditional probabilities based on a relaxation of the proportional odds model with sparse dierences between adjacent decision boundary parameter vectors, rather than parameterizing all decision boundaries with the same. This setting is well-suited to PRESTO. In Section 3.4.2, we show that PRESTO also performs well in a less favorable setting, where the dierences between adjacent decision boundaries are instead dense; nonetheless, PRESTO still out- performs logistic regression and proportional odds models. In Section 3.4.3 we compare the performance of PRESTO to logistic regression and the proportional odds model at estimating rare probabilities in a real data experiment. Finally, in Section 3.4.4 we conduct a second real data experiment on a data set of patients diagnosed with diabetes, where we vary the rarity of the outcome of interest. See Section B.6 of the appendix for all implementation details. The code generating all plots and tables is available at https://github.com/gregfaletto/presto. 3.4.1 SyntheticData: SparseDierencesSetting We repeat the following procedure for 700 simulations. First we generate data usingn = 2500,p = 10, andK = 4. We draw a randomX2 [1; 1] np , whereX ij Uniform(1; 1) for alli2f1;:::;ng and j 2f1;:::;pg. Theny2f1;:::;Kg n is generated according to a relaxation of the proportional odds model; instead of (3.1), we generate probabilities according to (3.3) where the k are generated in the following way for sparsity settings of 2 f1=3; 1=2g: rst, we generate 1 by taking the vector (0:5;:::; 0:5) > , but setting all of the entries equal to 0 randomly with probability 1 for each entry 68 independently. Then we set k = k1 + k ; k2f2;:::;K 1g; where k 2R p are iid random vectors for eachk2f2;:::;K 1g generated according to the following distribution: kj = 8 > > > > > > > > > < > > > > > > > > > : 0; with probability 1; 0:5; with probability=2; 0:5; with probability=2; j2f1;:::;pg: We consider three possible sets of intercepts: = (0; 3; 5); (0; 3:5; 5:5), and (0; 4; 6), so that the rst two categories are common and the remaining categories are rare. The nal rare class is the one of interest; in the three settings, the average proportions of observations falling in the rare class are 1:00%, 0:62%, and 0:37%, respectively, for the = 1=3 setting and 1:17%, 0:71%, and 0:43% for the = 1=2 setting. The fact that the decision boundaries may cross in the support ofX, which would mean that for suchx some class probabilities are dened to be negative, puts practical limits on the magnitude of k in simulations. (See Section 3.6.1 of [2] for a discussion of this point.) Also, for this reason, in each simulation we check whether or not the conditional probabilities are positive for each class for every sampledx; if not, we generate new 2 ;:::; K1 for a limited number of iterations, ending the simulation study in failure if no suitable k can be found in a reasonable number of attempts. The parameters we used generated positive probabilities for all observations across all simulations. We then estimate a model for each method; for logistic regression, we estimate the binary classication problem of whether or not each observation is in classK, and for proportional odds andPRESTO, we t a full model on allK responses. ForPRESTO, we use 5-fold cross-validation to choose a value of n among 20 choices, selecting the n with the best out-of-fold Brier score (other metrics, like negative log likelihood, failed because some values of n in some folds resulted in models yielding negative probabilities, so these other metrics were undened). The 20 candidate values of n are generated in the following way: the 69 largest n value, (20) n , is the smallest n for which all of the estimated sparse dierences equal 0; the smallest n value is set to (1) n = 0:01 (20) n , and the remaining n values are generated at equal intervals on a logarithmic scale between these two values. Each of these models yields estimated probabilities that each observation lies in classK. In the nal step of each simulation run, we compute the mean squared error of these estimated probabilities for each method. In Figure 3.1, we show boxplots of the empirical mean squared errors for each method in the setting where the rare class is observed in 0:71% of observations when = 1=2. In order to see how the methods compare pairwise on each simulation, we also show boxplots of the ratio between the mean squared error of PRESTO and the other two methods in each of the three simulation settings. We also conduct one- tailed pairedt-tests of the alternative hypothesis that the mean MSE for PRESTO is lower than each of the competitor methods in each setting; all 12 of thep-values (provided in Table B.1 of Appendix B.1) are below 0:01. Finally, in Appendix B.1 we also provide the means and standard errors for the MSE of each method in each simulation setting in Table B.2, as well as boxplots like the one in the top left corner of Figure 3.1 for the other two intercept settings and all boxplots for the = 1=3 setting. We see thatPRESTO typically estimates these rare probabilities better than logistic regression, which despite being correctly specied struggles with class imbalance and does not draw strength from estimating the easier decision boundary between classes 1 and 2, and the proportional odds model, whose assumptions are not satised in this setting. 3.4.2 SyntheticData: DenseDierencesSetting In real data sets the dierences between adjacent decision boundary parameter vectors may not always be exactly sparse, so we conduct another synthetic experiment in the same way as in Section 3.4.1, except 1j Uniform(:5;:5) and each kj Uniform(:5;:5), iid acrossj2f1;:::;pg andk2f2;:::;K 70 1g. We also add an extra intercept setting of (0; 2:5; 4:5). This yields average rare class proportions of 0:99%, 0:62%, and 0:36% using the same intercepts as the experiments in Section 3.4.1 and 1:60% in the new intercept setting. The uniformly distributed dierences can be considered “approximately" sparse in the sense that while no deviations will exactly equal 0, some will be large and important to estimate, and some will be essentially negligible. Figure 3.2 and Table 3.1 summarize the results, along with additional gures and tables in Appendix B.1. We again see thatPRESTO outperforms both competitor methods by statistically signicant margins. Table 3.1: Calculatedp-values for one-tailed pairedt-tests for uniform dierences simulation setting of Section 3.4.2 testing the alternative hypothesis that PRESTO’s rare probability MSE is less than each com- petitor method in each rarity setting. (Statistically signicant p-values indicate better performance for PRESTO). Rare Class Proportion Logitp-value POp-value 1.6% < 1e-10 < 1e-10 0.99% < 1e-10 < 1e-10 0.62% < 1e-10 < 1e-10 0.36% < 1e-10 0.000242 3.4.3 RealDataExperiment1: SoupTasting We conduct a real data experiment using the soup data set from the R ordinal package [143]. The data come from a study [44] of participants who tasted soups and responded whether they thought each soup was a reference product they had previously been familiarized with or a new test product. The respondents also stated how sure they were in their response on a three-level scale, yielding a total ofK = 6 possible ordered outcomes for n = 1847 observations. The outcome of interest corresponds to the respondent being sure the tasted soup was the reference and is observed in 228 observations (about 12% of the total). All of the features are categorical, and after one-hot encoding we have p = 22 binary features related to the soup, the respondent, and the testing environment ∗ . This may be a promising setting for PRESTO ∗ The categorical predictorsPRODID andRESP are omitted because in some splits not all levels of these features are observed in the training set, making it impossible to estimate parameters for these features. 71 because, while the responses have a well-dened ordering, it’s plausible that dierent features could have varying impacts at dierent levels of respondent certainty. We complete the following procedure 350 times: rst, we randomly split the data into training (90% of the data) and test (10%) sets. We estimate models using PRESTO, logistic regression, and the proportional odds model on the training data and evaluate on the test set. We are interested in the accuracy of the rare class probabilities, but we can’t evaluate rare probability MSE directly since we don’t observe the true probabilities. Brier score could be a reasonable proxy, but it is known to be a poor metric in the presence of class imbalance [17]. Instead we estimate rare probability MSE using the following procedure. For each method, we sort the estimated test set rare class probabilities in ascending order and assign the observations into 10 bins: the rst 1=10 observations go in the rst bin, and so on. Then we estimate the mean squared error of the estimated probabilities by 1 n P n i=1 (^ (i) 1 o b(i) ) 2 , where ^ (i) 1 is the estimated rare class probability for observation i and o b(i) is the observed rare class proportion in the bin containing observationi. This is similar to expected calibration error [129], though we use squared error rather than absolute error. 10 equal frequency bins follows the default of the R CalibratR package that implements expected calibration error [153]. By this metric, the mean error forPRESTO is 0.0096, 0.0157 for logistic regression and 0.0135 for pro- portional odds. Figure 3.3 displays boxplots of the results as in the synthetic experiments which indicate thatPRESTO typically outperforms the other methods. We do not reportp-values or standard errors since the observed samples are dependent (random splits of the same data set). 3.4.4 RealDataExperiment2: Diabetes We present another real data experiment using the data setPreDiabetes from the RMLDataR package [91]. This data set containsn = 3059 observations of patients who were eventually diagnosed with diabetes. Each observation consists of the age at which the patient was diagnosed with prediabetes and diabetes 72 0.00 0.01 0.02 0.03 PRESTO Logit PO Estimated Rare Probability MSE 0.3 0.5 1.0 Logit PO Estimated MSE Ratio (PRESTO/Other) Figure 3.3: Left: Estimated MSEs of estimated rare class probabilities for each method across 350 random draws of training and test sets in real data experiment from Section 3.4.3. Right: ratios of estimated MSE forPRESTO divided by MSE of each other method (PRESTO performs better if ratio is less than 1). 73 Table 3.2: Estimated rare class MSE for each method at each age cuto in prediabetes real data experiment from Section 3.4.4. Age cuto PRESTO Logit PO 30 0.000943 0.009609 0.009217 35 0.005013 0.023658 0.021740 40 0.017307 0.046828 0.048123 45 0.060896 0.115189 0.117525 50 0.124000 0.211083 0.213059 55 0.234906 0.336009 0.340130 60 0.353615 0.413534 0.418179 65 0.345535 0.448015 0.447290 as well asp = 5 covariates. Given an agea, we form an ordinal variable based on the patient’s status of non-diabetic, prediabetic, or diabetic at agea 1. We do this for agesa2f30; 35; 40;:::; 65g. The number of patients diagnosed with diabetes increases witha, so varyinga allows us to change the rarity of the rarest class in a natural way. 0:92% of patients in the data were diagnosed with diabetes before age a = 30, and 50:93% of the patients were diagnosed with diabetes before agea = 65. We use PRESTO, logistic regression, and the proportional odds model to estimate the probability that each patient was diagnosed with diabetes before agea for eacha. Much like our soup tasting data applica- tion, in each setting we take repeated random splits of the data, using 90% of the data selected at random for training and 10% for testing. In each iteration we again evaluate each method on the test data using the same estimator for mean squared error of the estimated rare class probabilities. We repeat this procedure 49 times in each of the 8 settings. We display the results in a plot in Figure 3.4. We also provide the mean MSEs for each method at each age cuto in Table 3.2. We see that PRESTO outperforms both logistic regression and the proportional odds model in all of these settings. (For age cutosa = 29 and below we were unable to estimate the proportional odds model on all subsamples because of the diculty of having at least one observation from each class in both the training and test sets for 49 random draws.) PRESTO seems to outperform the 74 0.0 0.1 0.2 0.3 0.4 30 40 50 60 Age cutoff Estimated Rare Probability MSE Method PRESTO Logit PO Figure 3.4: Estimated MSEs of estimated rare class probabilities for each method and each age cuto across 49 random draws of training and test sets in real data experiment from Section 3.4.4. 75 other methods at all class rarities, and the absolute performance gap increases as the rare class becomes less rare. 3.5 Conclusion By leveraging data from earlier decision boundaries, but relaxing the rigid proportional odds assumption, PRESTO can substantially improve estimation of the probability of rare events, even when the assumption of sparse dierences between adjacent decision boundary weight vectors does not exactly hold. Future work could explore` 1 penalties for the coecients themselves, not just the dierences between the coe- cients, to allow for simultaneous feature selection and model estimation. Inference forPRESTO could also be possible by extending the method for exact post-selection inference for the generalized lasso path by Hyun, G’Sell, and Tibshirani [92], or similar work on the fused lasso by Chen, Jewell, and Witten [42], to our generalized linear model setting. Future work could also explore the empirical performance ofPRESTO in even more depth, perhaps by using large-scale real world data sets like those used in Duncan and Elkan [54]. There are other possible extensions that could improve estimation. For example, we set the rst de- cision boundary as the one that is directly penalized, with dierences from this boundary assumed to be sparse. This makes sense if the classes become increasingly rare and the rst decision boundary is the most balanced. However, it may make more sense to directly penalize whichever decision boundary has the best balance of observed responses on each side. Penalizing the dierences from this boundary might improve estimation since this decision boundary ought to be the easiest to estimate. This could improve estimation in settings like the real data experiment from Section 3.4.3 where the most balanced decision boundary is closer to the center of the responses. Also, in cases where the nal categories are very rare, a better bias/variance tradeo might be achieved by reimposing the proportional odds assumption, imposing an exact equality constraint for the last few 76 decision boundaries. In these settings, data might be too rare to hope for better estimation by relaxing the proportional odds assumption even with regularization. Lastly, in Section B.6 of the appendix we discuss possible faster approaches than the one used in the present work for solving the PRESTO optimization problem. 77 Chapter4 FusedExtendedTwo-WayFixedEectsforDierence-in-Dierences Abstract. The dierence-in-dierences method allows estimation of causal eects even under some forms of selection bias by comparing the change in the response variable for treated units to the change over the same time period for control units. This comparison can be calculated via a two-way xed eects lin- ear regression. However, recent research has found that the canonical dierence-in-dierences model provides biased treatment eect estimates when cohorts of units start receiving treatment at dierent times. Wooldridge [194] shows that this bias can be removed by adding more parameters to the regression model, but such models often have an excessive number of parameters to estimate relative to the number of observations, potentially trading large bias for substantial variance. This can be mitigated by assuming some of the parameters are equal, but one risks reintroducing bias if these assumptions are ad hoc. We propose a sparse shrinkage method, fused extended two-way xed eects (FETWFE), that fuses some of the parameters of the unbiased two-way xed eects model towards each other, nding data-driven esti- mates for which parameters are equal. We prove that under an appropriate sparsity assumption, FETWFE combines the parameters correctly with probability tending to one. We prove the consistency and asymp- totic normality of our estimated treatment eects, so our method allows for model selection, estimation, and asymptotically valid condence intervals in a single step without data splitting. We demonstrate our method in synthetic data experiments and a real data application. 78 4.1 Introduction In recent years, researchers and practitioners in a variety of disciplines have become increasingly inter- ested in causal inference with observational data. In general this is challenging because the dierence in observed outcomes between treated and control units may be explained by unobserved variables (“con- founding"). Causal inference methods must account for this bias to isolate the eect of the treatment. Thedierence-in-dierences method is an increasingly popular method for causal inference in observa- tional panel data settings, particularly in economics. SupposeN units are observed atT = 2 times. At time t = 2, some of the units receive a treatment while others do not. We will use the potential outcomes frame- work [130, 150], with the following notation. Fori2f1;:::;Ng =: [N] andt2 [T ], denote the treatment status byW i :W i = 2 for units who are treated at time 2 andW i = 0 for never-treated units. Lety (it) (W) be the response that would be observed for uniti at timet if they received treatmentW . Our goal will be to estimate the eect of the treatment on the treated group at time 2, := E[y i2 (2)y i2 (0)j W i = 2]. The fundamental problemof causalinference [86, 93] is that we never observe both potential outcomes for any one unit, so we must estimateE[y i2 (0)jW i = 2], the expected untreated potential outcome for units that receive the treatment. Consider the expected dierence in the changes of outcome from timet = 1 tot = 2 for the treated and untreated groups: E [y i2 (2)y i1 (2)jW i = 2]E [y i2 (0)y i1 (0)jW i = 0]: (4.1) We might expect the second dierence to only relate to any possible change in external conditions from time 1 to 2. If these changes aect both the treated and untreated units in the same way in expectation, we can isolate the eect of the treatment on the treated group by subtracting this term from the change in the treated group. 79 More formally, (4.1) is equal to under two assumptions. First, we will assume there isnoanticipation of the treatment from treated units, in the sense thatE[y i1 (2)y i1 (0)jW i = 2] = 0 almost surely—the potential outcome at time 1 is not aected by the treatment to come. Second, we will assume parallel or commontrends between the treated and untreated groups:E[y i2 (0)y i1 (0)jW i = 2] =E[y i2 (0)y i1 (0)j W i = 0] almost surely, so the expected change in the untreated potential outcome from time 1 to time 2 is independent of treatment assignment. Together, these assumptions imply that (4.1) is equal to: E [y i2 (2)y i1 (2)jW i = 2]E [y i2 (0)y i1 (0)jW i = 0] (a) = E [y i2 (2)jW i = 2]E [y i1 (0)jW i = 2]E [y i2 (0)y i1 (0)jW i = 0] (b) =E [y i2 (2)jW i = 2] +E[y i2 (0)y i1 (0)jW i = 0]E[y i2 (0)jW i = 2] E [y i2 (0)y i1 (0)jW i = 0] =E[y i2 (2)y i2 (0)jW i = 2] =; (4.2) where in step (a) we used the no-anticipation assumption and step (b) follows from re-arranging the common trends assumption. Since (4.1) depends on only the potential outcomes corresponding to the observed treatments, under these assumptions one can consistently estimate (4.1) by substituting the appropriate sample means for the conditional expectations. This calculation can equivalently [149, Section 2.4] be carried out by estimating the coecient in two-way xed eects linear regression y it = i + t + 1fW i = 2g1ft = 2g + it 80 by ordinary least squares (OLS), where i are xed eects corresponding to units, t are xed eects corresponding to time, 1fg is an indicator variable that equals 1 on the event and 0 otherwise, and it is a noise term. Having adopted a linear regression framework with time xed eects, we might want to generalize this method to investigate the eects of treatments over more than two time periods. Suppose we observe N units atT 2 times, where our theoretical results will assumeN grows asymptotically whileT is xed. We assume abalanced panel where all units are observed at all times, so we haveNT observations. Suppose R T 1 cohorts start receiving a treatment at timesr2R =fr 1 ;r 2 ;:::;r R gf2;:::;Tg. (This is sometimes called thestaggeredinterventions setting.) For example, we might want to investigate the eect on responses of a law or policy that is implemented state-by-state across time [35, 71, 28], or the eect of a product or service that is released across dierent regions over time [49] We consider settings where the treatment is an absorbing state: once a unit starts being treated, they continue receiving treatment therafter. We have a baseline comparison for all cohorts because no cohort enters at timet = 1, and we assume some units are never treated so we have a “control" group available at all times. For each unit, we encode the treatment information in a random variableW i 2f0g[R, whereW i =r if uniti is in cohort r2R orW i = 0 if uniti is never treated. In this setting, one might wonder whether, under suitable generalizations of the no anticipation and common trends assumptions, it is possible to consistently estimate the eect of the treatment by carrying out the linear regression y it = i + t + X r2R 1fW i =rg1ftrg + it : (4.3) 81 Even if the eect of treatment diers for dierent cohorts or with time since the start of treatment, one might hope that the estimated treatment eect under this model might aggregate these eects in a rea- sonable way to estimate the average treatment eect. However, a slew of recent research [49, 167, 71, 35, 28] shows that regression (4.3) generally results in biased estimates of the average treatment eect and proposes alternative estimators. Most of these estimators depart from OLS estimation, but Wooldridge [194] shows that linear regression can still be unbiased provided that enough parameters are estimated. Wooldridge proposes estimating a linear model on cohort dummy variables, time dummies, covariates, treatment indicators, and interactions of those terms: for alli2 [N],t2 [T ], andr2R, y it = + t +X > i ( + t ) + X r2R 1fW i =rg r +X i r + 1ftrg( rt + _ X > (ir) rt ) + it ; (4.4) where t are xed eect coecients for timesf2;:::;Tg, X i 2 R d is a set of time-invariant (pre- treatment) covariates, 2 R d are coecients and t 2 R d and r 2 R d are time-specic and cohort- specic coecients (coecients on the interactions between the covariates and the time and cohort dum- mies), r are cohort xed eect coecients (collinearity is avoided because of the presence of untreated units), rt are base treatment eects, rt 2R d are coecients governing the interactions between treat- ment dummies and covariates, and _ X (ir) :=X i E[X i jW i =r] are covariate vectors that are centered with respect to their cohort means. Wooldridge calls this the extended two-way xed eects model [194, Equation 6.33] and shows that it is unbiased under generalizations of the no-anticipation and common trends assumptions (see Assumptions CNAS and CCTS in Section 4.2.1). This is a mostly satisfactory solution, but as Wooldridge [194, Section 6.5] points out, that the regres- sion (4.4) may contain a large number of parameters to estimate ifR,T , ord is moderately large relative toN. In this case, ordinary least squares estimation of (4.4) may lead to very imprecise estimates of at least some of the parameters, trading the bias problem that model (4.3) has for a high-variance problem. 82 Wooldridge [194] suggests some possible ways to mitigate this problem by imposing restrictions in order to reduce the number of parameters to be estimated. For example, one might assume that the treat- ment eects only vary across cohorts and not with time, so that rt = r for allr2R,t r. However, imposing these restrictions in anadhoc way could reintroduce bias if the wrong restrictions are imposed, or leave excessive variance if not enough restrictions are imposed. (Another work that considers such restrictions is [28].) 4.1.1 OurContributions We view the task of reducing the dimensionality of (4.4) through the lens of sparse regularization. In Section 4.4, we propose the fused extended two-way xed eects (FETWFE) model, which uses` q fusion penalties forq2 (0; 2] (bridge regression) to shrink certain sets of parameters in (4.4) towards each other. We structure these fusion penalties to shrink towards sensible restrictions, allowing the specic restrictions to be selected in a hands-o, data-driven way. We show how FETWFE can be used to estimate causal estimands in Section 4.3. In Section 4.5, we prove that under the same generalizations of the no-anticipation and common trends assumptions used by Wooldridge [194], along with a suitable sparsity assumption and regularity conditions, FETWFE learns the correct restrictions with probability tending to one (Theorem 4.5.2). FETWFE then estimates only the needed parameters, reducing estimation variance. We also prove that our proposed causal estimators are consistent (Theorem 4.5.1) and asymptotically normal (Theorem 4.5.3). Under the assumptions of Theorem 4.5.3 FETWFE is anoracleestimator [59, 207]: the asymptotic variance of its estimates depend only on the parameters of the smaller restricted model, and it converges at the same 1= p N rate as an OLS-estimated model on the true restricted model even if the number of covariates grows to innity withN. The data-driven restrictions FETWFE learns improve parameter estimation while preserving asymptotic unbiasedness and consistency, and allowing for the construction of asymptotically valid condence intervals, all in a single step without data splitting. In 83 Section 4.6 we demonstrate the superior performance of FETWFE in our setting against extended two-way xed eects (ETWFE), bridge-penalized ETWFE, and a slightly more exible model than (4.3) in synthetic data experiments. First, we discuss related work and some of our notation, and then we establish our setup and basic assumptions in Section 4.2. 4.1.2 RelatedWork Besides the works already cited, as noted previously, there has been an abundance of recent econometric research on dierence-in-dierences and it is not possible to cite all of the work in this stream; see [149] and De Chaisemartin and d’Haultfoeuille [48] for recent reviews, and [53] for one example of an even more recent approach. [22] also proposes a machine learning framework for dierence-in-dierences, though they do not mention fusion penalties. Recent work in the machine learning community on the related synthetic controls method for causal inference in panel data include [60, 52, 16, 87, 121, 1]. Other related works for causal inference in the panel data setting that use dierent approaches include [174, 192, 184, 43, 114, 23, 40, 33, 6], and [38]. Lastly, our work contributes to the much broader topic of estimating heterogeneous treatment eects (where in our setting there is heterogeneity in treatment eects not just across cohorts or time but also across the distribution of covariates). In Section 4.3, we cite a few prominent works in this stream of literature. Outside of causal inference, our approach is clearly closely related to both the fused lasso [170] and the generalized lasso [172]; see (4.21). Our work also draws much from the bridge regression theory of Kock [106], and, in turn, [90]. [98] proposes a linear panel data model with a fused lasso-like penalty, though not for causal inference. [136] propose a related Bayesian approach. [82] also propose a penalized panel data model that combines categories, but not in the causal inference setting. Similarly, [109] is a recently 84 proposed shrinkage estimator for linear panel data models (though not for causal inference) and cites more related work in this stream. FETWFE also relates more specically to the generalized fused lasso [85, 198]. Like FETWFE, the generalized fused lasso involves applying ` q penalties to the absolute dierences between a variety of kinds of pairs of coecients, though the generalized fused lasso only uses theq = 1 case. An algorithm designed to solve an optimization problem related to our optimization problem (4.21) was proposed by Xin et al. [198]. However, their optimization problem uses only` 1 penalties, as opposed to our bridge penalties, and Xin et al. [198] focus mostly on the properties of their algorithm. Some works that prove theoretical results for the generalized fused lasso include Viallon et al. [183] and Viallon et al. [182]. Viallon et al. [183] prove consistency for the generalized fused lasso under dierent assumptions from ours, including a restriction to the xedp case. They prove selection consistency and asymptotic normality only for an adaptive [207] version of their method. Again, they do not prove results for a more general bridge penalty. Viallon et al. [182] prove similar results (with similar limitations) for a broader class of generalized linear models. 4.1.3 Notation Throughout we will use subscripts like y (it) to refer to the i th unit at the t th time,y (i) 2 R T to refer to all response observations corresponding to unit i, and so on. For matrices, Z (it) refers to the row corresponding to uniti at timet,Z (i)j refers to theT observations of thej th feature in uniti, and so on. The covariates are time invariant, and it will be more convenient to refer to the covariates for uniti with the notationX i . 85 Recall that for a sequence of random variablesfX N g and a sequence of positive real numbersfa N g, we say thatX N =O P (a N ) if for every > 0 there exists a niteM > 0 andN > 0 such that for all N >N , P X N a N >M <: Ifa N ! 0,X N =O P (a N ) is sucient forX N p ! 0 by the following argument: it is enough to show that for every > 0 and > 0, there exists anN such thatN > N impliesP (jX N j>) < . Fix > 0. Sincea N ! 0, for any > 0 there exists anN such thatja N j < =M for allN > N . Then for all N > maxfN ;N g, P (jX N j>) =P X n a N > ja N j M <: 4.2 BasicSetupandAssumptions We will expand on the setup from the introduction. For eachi2 [N], assume y (i) =E[y (i) jW i ;X i ] +c i 1 T +u (i) ; wherec i 2R andu (i) 2R T are both random variables and 1 T is aT -vector with 1 in every entry. Assum- ing unobserved random eectsc i is a common way to allow for unobserved, time-invariant heterogeneity across units while maintaining a typical regression assumption of independent and identically distributed units; see [138, Section 26.5] for a textbook description and [106, Section 2.2] and [194, Theorem 6.2] for closely related works that adopt this model. Assumption (F1): E[c i j W i ;X i ] = 0, E[u (i) j c i ;W i ;X i ] = 0; Var(c i j W i ;X i ) = 2 c , and Var(u (i) j c i ;W i ;X i ) = 2 I T almost surely, where 2 c and 2 are assumed to be known and nite and I T 2R TT is the identity matrix. 86 Under Assumption (F1) we can use a generalized least squares transform to create a data set with uncorrelated rows. Let (i) := c i 1 T +u (i) , and observe that for everyi2 [N] almost surely Var( (i) j W i ;X i ) = 2 I T + 2 c 1 T 1 > T =: 2 R TT . Under Assumption (F1), for everyi2 [N] almost surely E[ 1=2 (i) jW i ;X i ] = 0 and Var( 1=2 (i) jW i ;X i ) = 2 1=2 Var( (i) jW i ;X i ) 1=2 = 2 I T : (4.5) We can carry out this transformation for any known nite and symmetric positive denite = Var( (i) j W i ;X i ). All of our results can be easily adapted to apply for any such , though we focus on the specied setting for concreteness. Next we make an assumption on the joint distribution of the potential outcomes, treatment, and co- variates. Let r (x) :=P(W =rjX i =x); r2f0g[R (4.6) be the conditional probability of a unit’s treatment status. Assumption(F2): TheN joint random variables (W i ;X i ;c i ;u (i) ) N i=1 are independent and identically distributed (iid). Further, for allr2f0g[R, r (X)2 (0; 1) almost surely, and r (X) does not vary with N. Finally,E[X i X > i jW i =r] is nite and positive denite for allr2f0g[R. Notice that we have made no assumptions on the relationship between the treatment assignments and the potential outcomes for a given unit; we do so in the next section. 4.2.1 Dierence-in-DierencesAssumptions We will use the generalizations of the no anticipation and common trends assumptions described in the introduction—Assumptions (CNAS) and (CCTS), respectively—that were proposed by [194]. Fori2 [N], t2 [T ], andr2R, denote byy (it) (r) the potential outcome for uniti at timet if they were in cohortr, 87 and lety (it) (0) be the potential outcome for a unit if they were never treated. These potential outcomes are well-dened since treatment probabilities are always between 0 and 1 under Assumption (F2). Assumption(CNAS) (“conditional no-anticipation with staggered interventions"): For treatment co- hortsr2R, almost surely E[y (it) (r)y (it) (0)jW i =r;X i ] = 0 8i2 [N];r2R;t<r: This requires that before treatment, each individual’s expected outcome is identical to what it would have been if the unit were never treated. As Wooldridge [194, Section 6.3] points out, assuming thaty (it) (r) = y (it) (0) almost surely for t < r is sucient for (CNAS). Strengthening the typical no-anticipation as- sumption to hold conditional onX i allows us to estimate heterogeneous treatment eects that vary with X i . See further discussion in Wooldridge [194, Sections 5.3, 6.1, 6.3]. Assumptions (F2) and (CNAS) are sucient for the stable unit treatment value assumption [93, Section 1.6]. Assumption(CCTS) (“conditional common trends with staggered interventions"): Almost surely E[y (it) (0)y (i1) (0)jW i ;X i ] =E[y (it) (0)y (i1) (0)jX i ];i2 [N];t2f2;:::;Tg: This requires that each unit’s untreated potential outcomes y (it) (0) follow the same trend over time in expectation regardless of when or if they are treated. Part of the appeal of the dierence-in-dierences method in general, and FETWFE in particular, is that Assumption (CCTS) is weaker than an unconfound- edness assumption [93, Def. 3.6] that is common in causal inference, which requires that treatment as- signment is independent of all of the potential outcomes conditional onX. Under Assumption (CCTS) there can still be systematic dierences in the potential outcomes between treated and untreated units conditional onX provided that those dierences are time-invariant [149, Section 2.3]. Causal inference on observational data typically requires untestable assumptions like unconfoundedness [93, Section 12.2.3] 88 or (CCTS) which is only testable if we already assume no anticipation [194, Section 7]. So making these assumptions as weak as possible is an important goal. Finally, because we are including covariates in our model, we require an assumption on the functional forms of the conditional distributions of the untreated potential outcomes, the trendsy (it) (0)y (i1) (0), and the treatment eects. Assumption (LINS): Assume we observe d N covariates, where d N may be xed or increasing in N. For any xedx2 R d N in the support of the random vectorX i , there exist xed parameters 2 R;f r g r2R 2 R; 2 R d N ;f r g r2R 2 R d N ;f t g t2f2;:::;Tg 2 R;f t g t2f2;:::;Tg 2 R d N ;f rt g r2R;tr 2 R, andf rt g r2R;tr 2R d N such that almost surely for alli2 [N] E y (i1) (0)jW i = 0;X i =x = +x > ; (4.7) E y (i1) (0)jW i =r;X i =x = + r +x > ( + r ) 8r2R; (4.8) E y (it) (0)y (i1) (0)jX i =x = t +x > t ; t2f2;:::;Tg; and (4.9) E y (it) (r)y (i1) (r)jW i =r;X i =x E y (it) (0)y (i1) (0)jW i = 0;X i =x = rt + _ x > r rt ; r2R;tr (4.10) where _ x r :=xE [X i jW i =r] are covariate vectors that are centered with respect to their cohort means. Notice the similarity of the quantity on the left side of (4.10) to the traditional dierence-in-dierences estimand (4.1). Wooldridge [194] shows that Assumption (LINS), along with Assumptions (CNAS) and (CCTS), make (4.4) an unbiased estimator of the potential outcomes (and, in turn, the treatment eects); see also our derivation in Section C.4. 89 4.3 CausalEstimandsandEstimators Next we will discuss a variety of causal estimands we hope to estimate. The building block of our estimands will be CATT (r;x;t) :=E[y (it) (r)y (it) (0)jW i =r;X i =x]; i2 [N];r2R;t2fr;:::;Tg; (4.11) the conditional average treatment eects on the treated units, which matches denition (6.23) in [194]. Es- timating CATT (r;x;t) is useful if we are interested in identifying which units are most aected by the treatment at a specic time. Broadly, there has been an increasing interest in conditional average (or het- erogeneous) treatment eects over the last decade or so as machine learning methods and large data sets have made estimation of these eects more feasible. Applications include personalized medicine [61, 65, 112, 64], personalized education [128], targeted marketing [83, 9, 57, 199, 29], and social science and public policy [197]. This estimand turns out to be estimable by regression (4.4). As we already mentioned, (4.4) is an unbiased estimator of the regression estimands from Assumption (LINS) (see Section C.4). Further, using a similar derivation to the one yielding (4.2), we have that for allr2R;tr, rt + _ x > r rt (a) = E y (it) (r)y (i1) (r)jW i =r;X i =x E y (it) (0)y (i1) (0)jW i = 0;X i =x (b) =E[y (it) (r)jW i =r;X i =x] E[y (i1) (0)jW i =r;X i =x] +E[y (it) (0)y (i1) (0)jW i = 0;X i =x] (c) =E[y (it) (r)jW i =r;X i =x]E[y (it) (0)jW i =r;X i =x] = CATT (r;x;t); (4.12) 90 where in (a) we used (4.10), in (b) we used (CNAS), and (c) follows from rearranging (CCTS). The exibility of (4.4) in allowing for treatment eects that vary with cohort and over time addresses the bias of the overly rigid model (4.3). Further, we can use (4.11) as a building block to construct a variety of causal estimands. We can start by averaging the treatment eects across a cohort to dene the conditional average treatment eect on the treated units in cohortr: CATT (r;x) := 1 Tr + 1 T X t=r E[y (it) (r)y (it) (0)jW i =r;X i =x]; r2R = 1 Tr + 1 T X t=r CATT (r;x;t); r2R: (4.13) This estimand represents the average treatment eect across time for a unit in cohortr with covariatesx. Next, we can get a conditional average treatment eect by marginalizing (4.13) across treatment status: CATT (x) :=E [ CATT (W i ;X i )jW i 6= 0;X i =x] = X r2R ~ r (x) CATT (r;x); (4.14) where for anyr2R ~ r (x) :=P(W i =rjW i 6= 0;X i =x) = X r 0 2R r 0(x) ! 1 r (x) (4.15) is the probability a unit is in cohortr given that they are treated conditional onx. This estimand is closely related to many previously considered heterogenous treatment eects; one of many works that considers a similar estimand (outside of the specic context of dierence-in-dierences) is Wager and Athey [187, Equation 1]. We can also marginalize these conditional treatment eects over the distribution ofX i . One of the most basic causal parameters of interest is the average eect of treatment on the treated group (ATT), 91 which we dened in the introduction in (4.1) for the less exible canonical dierence-in-dierences setup that requires stronger assumptions than our setting. We can express this estimand as ATT :=E [ CATT (X i )jW i 6= 0] =E " X r2R ~ r (X i ) CATT (r;X i ) # : (4.16) Estimand (4.16) is similar to estimand (6.40) proposed by Wooldridge [194], though we weight the cohort average treatment eects CATT (r;X i ) by the probabilities ~ r (X i ) rather than taking an average where each average treatment eect in each time period is weighted equally. The virtues of estimand ATT include summarizing the eect of the treatment in a single number and potentially being a less noisy quantity to estimate. Estimating ATT will involve taking a weighted average over many estimated coecients rather than relying on a single, less precise, estimated coecient. In contrast, more specic estimands like CATT (r;x;t) will typically be dicult to estimate with precision because their estimators will depend heavily on a small subset of the observed data (though this problem will be mitigated by our fusion penalty, which allows us to borrow information from nearby estimators). Our last two estimands similarly marginalize CATT (W i ;X i ) and CATT (W i ;X i ;t) over the distribu- tion ofX: ATT (r;t) :=E y (it) (r)y (it) (0)jW i =r ; r2R;tr; (4.17) =E [ CATT (r;X i ;t)]; and ATT (r) := 1 Tr + 1 T X t=r E y (it) (r)y (it) (0)jW i =r ; r2R; (4.18) = 1 Tr + 1 T X t=r ATT (r;t); r2R =E [ CATT (r;X i )]: 92 Estimand (4.17) matches the estimand from [194, Equation 6.2] and is also closely related to the cohort- specicaveragetreatmenteectonthetreated (3) from [167] (though we use the acronym CATT to refer to conditional, not cohort, eects) and estimands from [71] and [35]. Estimand (4.18) is identical to estimand (3.7) in Callaway and Sant’Anna [35], and is also the estimand of estimator (6.41) in Wooldridge [194]. Both (4.17) and (4.18) also fall in the framework of estimation targets of Borusyak, Jaravel, and Spiess [28]. Notice that ATT can be obtained not just by marginalizing CATT (X i ) over the distribution ofX, but also by marginalizing ATT (W i ) over the distribution of cohort assignments: ATT =E [ CATT (W i ;X i )jW i 6= 0] =E [E [ CATT (W i ;X i )jW i ]jW i 6= 0] = X r2R ~ r E[ CATT (W i ;X i )jW i =r] = X r2R ~ r ATT (r); where ~ r := P(W i = rj W i 6= 0) = E[~ r (X i )] is the marginal probability of a unit being in cohort r given that it is treated. This shows that estimand (4.16) matches estimand (3.11) from Callaway and Sant’Anna [35]. This formulation will also be useful for estimation. 4.4 FusedExtendedTwo-WayFixedEects We will carry out regression on the same design matrixZ proposed by [194] to estimate regression (4.4) by OLS, but we will add fusion penalties to leverage our belief that some of the parameters are equal to each other. This design matrix containsR columns of cohort dummies,T 1 columns of time dummies for timesf2;:::;Tg, d N covariates (where, unlike [194], we will allow the number of covariatesd N to grow asymptotically withN subject to regularity conditions),W := P r2R (Tr+1) treatment dummies for each possible base treatment eect rt , andd N (R +T 1 +W) interactions between covariates and the cohort dummies, time dummies, and treatment dummies. As is typical in penalized regression, we will centery before estimating our regression. It is not necessary to estimate the intercept term in 93 order to estimate the treatment eects, but if this is desired, one can estimate an unpenalized intercept by ^ = yX ^ (q) , wherey andX are the sample means and ^ (q) is the estimated coecient vector [79, Section 2.2]. Let p N be the number of columns ofZ; that is, Z 2 R NTp N . We will assume p N NT (see Assumption R2 later), but as we discussed in the introduction, it is clear thatp N =R+T1+W+d N (R+ T 1 +W) may be fairly large compared toNT , particularly ifd N is not very small, which could lead to imprecise treatment eect estimates if (4.4) is estimated by OLS. Wooldridge [194, Section 6.5] suggests alleviating this problem by assuming restrictions in the model—assuming some of these parameters equal each other. Some of the possibilities Wooldridge suggests include assuming that the treatment-covariate interactions rt are xed across time or assuming the within-cohort treatment eects rt are xed across time. Goodman-Bacon [71] also considers the possibility of treatment eects that are xed across time. Another weaker assumption Wooldridge [194] proposes is assuming that dierent treatment eects do not need to be estimated for every individual time period since treatment starts; instead, these times could be condensed into “early treated" and “late treated" eects, for example. These suggestions seem sensible, but making these simplifying assumptions without strong justica- tion may re-introduce the bias that was removed by adding these parameters to begin with. In settings where a practitioner has no clear reason (besides a desire to improve estimating precision) to assume that these parameters may be equal, we instead propose a data-driven approach to estimate which, if any, of these parameters may be equal. We will assign` q fusion [170] penalties to the dierences between esti- mated parameters which may be equal, and then carry out a penalized linear regression. For example, for the treatment eects, we propose penalizing the absolute dierences between treat- ment eects within the same cohort at adjacent times,j r k ;t r k ;t1 j, and we propose penalizing a similar term for interaction eects,j r k ;t;j r k ;t1;j j separately for eachj2 [d N ]. This leverages our belief that 94 some of these eects may be xed across time while allowing the optimization problem to learn from the data which restrictions to choose. We also propose penalizing the absolute dierences between initial treatment eects for each cohort, j r k ;r k r k1 ;r k1 j, and likewise we penalize the absolute dierences between adjacent initial interaction eectsj r k ;r k ;j r k1 ;r k1 ;j j for eachj. This is motivated by the idea that perhaps time since beginning treatment could be the only dimension in which treatment eect changes. It is also natural to penalize the adjacent cohort xed eects absolute dierencesj r k r k1 j and time xed eects absolute dierences j t t1 j towards each other, with analogous penalties for the interaction termsj r k j r k1 ;j j and j tj t1;j j. Finally, we may also want to regularize the coecients directly, as is common in regressions with large numbers of parameters to estimate relative to the number of observations. Because of the fusion penalties we employ, regularizing only a handful of base terms allows for this regularization. We can also penalize all of the base covariate coecients directly, allowing us to screen out irrelevant covariates asd N grows. In summary, we propose estimating a model like (4.4) but with a penalty term p(;q) := R X k=2 j r k r k1 j q +j r R j q + T X t=3 j t t1 j q +j T j q + d N X j=1 j j j + R X k=2 j r k ;j r k1 ;j j q +j r R ;j j q + T X t=3 j tj t1;j j q +j Tj j q ! +j r 1 ;r 1 j q + R X k=2 0 @ j r k ;r k r k1 ;r k1 j q + X tr k +1 j r k ;t r k ;t1 j q 1 A + d N X j=1 2 4 j r 1 ;r 1 ;j j q + R X k=2 0 @ j r k ;r k ;j r k1 ;r k1 ;j j q + X tr k +1 j r k ;t;j r k ;t1;j j q 1 A 3 5 ; where2R p N collects all of the coecients. We can more compactly express this as p(;q) =kD N k q q ; (4.19) 95 whereD N 2R p N p N is a suitably dened dierences matrix which we later dene explicitly in (C.3.3). It is much easier to see that these candidate restrictions are plausible due to the time-structured nature of the parameters than it is to know exactly which restrictions to choose. FETWFE allows practitioners to leverage this structure to improve estimation eciency without requiring them to choose restrictions by hand, which could introduce bias if too many restrictions are chosen or compromise eciency if too few restrictions are chosen. Finally, we also need to account for the random eects discussed in Section 4.2. As suggested by (4.5), in practice we will arrange the rows ofZ to contain consecutive blocks ofT rows for each observation and carry out a regression ony andZ transformed by G N :=I N 1=2 2R NTNT ; (4.20) where denotes the Kronecker product (recall that was dened in Section 4.2). We will also center the responsey, as is common in penalized regression. That is, our optimization problem is ^ (q) := arg min 2R p N kG N ( _ yZ)k 2 2 + N kD N k q q ; (4.21) where _ y :=yy is the centered response and N is a tuning parameter. Finally, we denote the regression estimand as N := arg min 2R p N kG N ( _ yZ)k 2 2 and (4.22) :=E[y it ]E[Z (it) ] > N : As mentioned earlier, it turns out that the components of N are the quantities ; ; , and so on from Assumption (LINS). 96 4.4.1 CausalEstimators As shown in (4.12), under our assumptions CATT (r;x;t) = rt + (xE [X i jW i =r]) > rt for any xed x. When at least one unit is observed in cohortr, it is natural to estimate this quantity by ^ CATT (r;x;t) := ^ (q) rt + xX r > ^ (q) rt ; r2R;tr; (4.23) where ^ (q) rt and ^ (q) rt are the estimators of rt and rt from ^ (q) as dened in (4.21) and X r = 1 N r X fi:W i =rg X i is the sample mean of the covariates for units observed in cohortr, withN r := P N i=1 1fW i =rg the total number of units in cohortr. The forms of (4.13) and (4.14) then naturally suggest the estimators ^ CATT (r;x) := 1 Tr + 1 T X t=r ^ CATT (r;x;t); r2R; and (4.24) ^ CATT (x) := X r2R ^ r (x) P r 0 2R ^ r 0(x) ^ CATT (r;x); (4.25) where ^ r (x) is a consistent estimator of the conditional cohort probabilities r (x). We will discuss what properties this estimator ^ r (x) should have in more detail in Section 4.5.1. Next we discuss estimating the average treatment eects marginalized overX. As Wooldridge [194, Sections 6.3 and 5.3] points out (and we show formally in our setting in the proof of Theorem 4.5.1), because 97 the treatment eects are interacted with the covariates centered with respect to their cohort means, we can estimate the average treatment eects using only the ^ (q) rt terms. In particular, we propose the estimators ^ ATT (r;t) := ^ (q) rt ; r2R;tr; and (4.26) ^ ATT (r) := 1 Tr + 1 T X t=r ^ (q) rt (4.27) for estimands (4.17) and (4.18), respectively. Estimator (4.27) corresponds to estimator (6.41) proposed by [194], and estimator (4.26) was also proposed by Wooldridge [194, Section 6.3] Finally, to estimate (4.16), the average treatment eect marginalized across cohorts and covariates ATT , we could marginalize (4.25) over the distribution of the covariates, but instead we will marginalize (4.27) over the distribution of cohort probabilities by using the observed cohort proportions to estimate the marginal cohort probabilities: ^ ATT := X r2R N r N ^ ATT (r); r2R; (4.28) whereN :=N 1 P N i=1 1fW i 6= 0g = P r 0 2R N r 0 is the total number of treated units. This allows us to avoid making an assumption about the functional form ofx7! r (x) unless we are estimating conditional average treatment eects. 4.5 Theory Before we present our theoretical results, we will state a sparsity assumption that we will require along with some additional regularity conditions. We will require fewer assumptions to prove consistency alone than we will to prove selection consistency and asymptotic normality. We emphasize that the conceptually most 98 important assumptions are (CNAS) and (CCTS), which validate the dierence-in-dierences approach, and Assumption S(s N ), dened below, which assumes sparsity in a sense that matches our penalty (4.19). All of our results will also require Assumption (LINS). It may be possible to relax this assumption so that linearity only has to hold approximately, along the lines of e.g. Condition ASTE in [12]. (At a high level, we might expect that if approximate linearity holds so that the error due to misspecication is on the same order as the estimation error of the population least squares model, we can achieve convergence at the same rate as when the linearity assumption holds exactly.) 4.5.1 Consistency Assumption S(s N ): Assume the vectorD N N (where the dierences matrixD N associated with our penalty (4.19) is dened in Equation C.3.3) is sparse, with indicesS [p N ] nonzero and the remaining indices in [p N ]nS equal to 0, wherejSj =s N . Assumption S(s N ) allows us to have sparsity in the appropriate sense: the dierences between param- eters that we penalized in (4.19) are sparse, and irrelevant covariates have true coecient equal to 0. That is, all buts N of the restrictions that we expect to hold will in fact hold, though crucially Assumption S(s N ) does not require us to know which restrictions to select, or even how many restrictions to select—the value ofs N does not need to be known. Our consistency and selection consistency theorems will allows N to grow withN subject to certain constraints in later assumptions. The remaining regularity conditions for Theorem 4.5.1 are more routine. Assumption(R1): For eachi2 [N],E[X 4 (i)j ]<1 for allj2 [d N ] andE[u 4 (it) ]<1 for allt2 [T ]. Further, for someK <1, sup 1N1 8 < : 1 p N T X t=1 p N X j=1 E h (G N ZD 1 N ) 2 (1t)j i 9 = ; K: 99 We show later (Lemma C.3.2) thatD N is invertible. The matrixG N ZD 1 N is of interest because it turns out this is the design matrix we will actually use for estimation; see the proof of Theorem C.2.1 for details. Because the singular values ofG N andD 1 N are bounded away from 0 and from above for allN even if d N !1 (see Lemmas C.4.2 and C.6.2), this is qualitatively similar to making the same statement about Z itself. Before we state Assumption (R2), we will dene some needed notation. For any matrixA2R Np N , denote the empirical Gram matrix by ^ (A): ^ (A) := 1 N A > A2R p N p N : (4.29) Lete 1N ande 2N be the smallest and largest eigenvalues of ^ (Z). (Note that ^ (Z) is random, soe 1N and e 2N are as well.) Assumption(R2): Assumee 1N is upper-bounded by a constant for allN andh 0 N a:s: ! 0, where h 0 N := r p N + N s N Ne 1N (4.30) and a:s: ! denotes almost sure convergence. Also, assume lim N!1 N p s N N = 0. [106] points out that (4.30) requires the design matrix to be full rank, so this assumption requires that p N grows slower thanN, N does not grow too quickly, ande 1N does not vanish too quickly. Also,Z being full rank requires at leastd N + 1 units in each cohort (see Lemma C.7.1 in the appendix), so (R2) requires that none of the marginal cohort probabilities are too small relative to the growth rate ofd N as N!1, and that the distribution ofX is well-behaved asN!1. Finally, this assumption also requires N not to increase too quickly. Assumption(R3): There exist constants 0 0, j^ ATT (r;t) ATT (r;t)j =O P minfh N ;h 0 N g ; j^ ATT (r) ATT (r)j =O P minfh N ;h 0 N g ; and j^ ATT ATT j p ! 0; where h N := 1 e 1N r p N N (4.31) andh 0 N is as dened in (4.30). Further, ifd N =d is xed, j^ CATT (r;x;t) CATT (r;x;t)j =O P minfh N ;h 0 N g ; j^ CATT (r;x) CATT (r;x)j =O P minfh N ;h 0 N g ; andif ^ (x) = (^ 0 (x); ^ r 1 (x);:::; ^ r R (x))isaconsistentestimatoroftheconditionaltreatmentprobability r (x) (4.6) (in the sense that ^ r (x) p ! r (x) for any xedx for allr2f0g[R), j^ CATT (x) CATT (x)j p ! 0: Proof. Provided in Section C.3 of the appendix. 101 Theorem 4.5.1 shows that all of our causal estimators are consistent and characterizes the rate of con- sistency for the cohort-specic treatment eect estimators, as well as the cohort- and time-specic esti- mators. To interpret the rate of convergence minfh N ;h 0 N g, consider the upper boundh N from (4.31). The minimum eigenvaluee 1N is bounded away from 0 with high probability under relatively mild assumptions in ourp N NT setting (see our more detailed discussion later in Section 4.5.3 about Assumption R6). ThenO P (minfh N ;h 0 N g) =O P p p N =N (see Lemma C.7.2 in the appendix). In general we will focus more onq2 (0; 1) as required by Theorems 4.5.2 and 4.5.3, but we note that Theorem 4.5.1 holds for anyq> 0. This is useful because the convex lasso (q = 1) and ridge (q = 2) opti- mization problems, for example, may be much more tractable on very large data sets than the nonconvex bridge estimator withq< 1. To give a concrete example of an estimator ^ (x), consider a setting where the conditional cohort membership probabilities are generated by a multinomial logit model. In the setting of the second half of Theorem 4.5.1 whered N =d is xed, the multinomial logit model is consistent for the class probabilities by standard maximum likelihood theory (see, for example, Theorem 13.2 from [193]). If this model is overly simplistic ord N !1, more sophisticated machine learning methods with theoretical convergence guarantees would also work, like the` 1 -penalized classiers of Levy and Abramovich [113]. Lastly, we point out that under Theorem 4.5.1 the conditional cohort membership probabilities and treatment eects can be estimated on the same data sets; that is, the full data set can be used for both. Intuitively, this is because these estimators are used for the distinct tasks of estimating the cohort-specic eects CATT (r;x) and estimating the weights used to combine them in a weighted average for CATT (x). For technical details, see the proof of Theorem 4.5.1 in Appendix C.3. 102 4.5.2 RestrictionSelectionConsistency Again, we rst state some additional needed regularity conditions and then present our selection con- sistency result, Theorem 4.5.2. Theorem 4.5.2 shows that with probability tending towards one FETWFE identies the correct restrictions, resulting in fusing together parameters that equal each other, screening out irrelevant covariates, and improving estimation eciency. Assumption (R4): There exists a positive constante max <1 such thate 2N e max almost surely (recall from Section 4.5.1 thate 2N is the largest eigenvalue of ^ (Z)). A sucient condition for Assumption (R4) is that the distribution ofX has bounded support in the sense thatkX i k 2 B almost surely for someB <1. Assumption(R5): N e 2q 1N q N q p 2q N a:s: !1: We will briey examine how Assumption (R5) interacts with Assumption (R2). For simplicity, assume the minimum eigenvalue e 1N c for somec> 0 with probability tending towards 1, so that we can ignore e 1N asymptotically (using a similar argument to the one in the proof of Lemma C.7.2 in the appendix). Then Assumption (R5) is N 1 N q=2 p 1q=2 N !1: If we also assumes N =s is xed, we have from Assumption (R2) that N N ! 0 and lim N!1 N N 1=2 = 0: 103 We will requireq2 (0; 1), and we see that under these assumptions for (R5) and (R2) to hold simultaneously we requireN q=2 p 1q=2 N to grow slower thanN 1=2 ; that is,p N must grow slower than N 1=2 N q=2 ! 2=(2q) = (N 1q ) 1=(2q) =N (1q)=(2q) : Even ife 1N is not bounded away from 0 with high probability, Assumption (R5) requires thate 1N does not vanish too quickly. As we discussed in reference to Assumption (R2), this requires that the probability of being in any one cohort does not vanish too quickly asN!1. It also requires that N not decrease too quickly. Notice that other than these relatively mild regularity conditions on the eigenvalues of ^ (Z), we do not require a condition like the irrepresentable condition [204] or neighborhood stability condition [126]. However, our restriction selection consistency result will only work forq < 1; that is, we do not prove selection consistency of FETWFE with a lasso (q = 1) penalty. Theorem4.5.2 (Selection consistency). AssumethatAssumptions(CNAS),(CCTS),and(LINS)hold,aswell as Assumptions (F1), (F2), S(s N ), and (R1) - (R5). For the FETWFE estimator ^ (q) dened in (4.21), dene ^ S := n j2 [p N ] : (D N ^ (q) ) j 6= 0 o to be the set of nonzero values inD N ^ (q) . Recall from Assumption S(s N ) that S =fj2 [p N ] : (D N N ) j 6= 0g: Then for anyq2 (0; 1), lim N!1 P ^ S =S = 1. Proof. Provided in Section C.3 of the appendix. 104 Theorem 4.5.2 shows that FETWFE fullls its promise to identify the correct restrictions in accordance with our` q penalty structure, as discussed in Section 4.4. This serves our goal of improving estimation by avoiding unnecessarily separately estimating parameters that are in fact equal. FETWFE also screens out irrelevant covariates and interactions, even ifd N !1. The proof of Theorem 4.5.2 relies on an extension of Theorem 2 of [106] (Proposition C.4.4) to show that not only does the bridge estimator exclude false selections, it also selects all of the relevant components of D N N with probability tending to 1. Note that this is slightly stronger than convergence in probability, and that Proposition C.4.4 requires no added assumptions from Theorem 2 of [106]. One consequence of Theorem 4.5.2 is that if ATT (r;t) = 0 for anyr, then lim N!1 P (^ ATT (r;t) = 0) = 1; which is stronger than convergence in probability to 0. The same is true of all of the other causal estimators. 4.5.3 AsymptoticNormality Our asymptotic convergence result (Theorem 4.5.3 below) requires one additional regularity condition. Assumption(R6): The minimum eigenvalue of E h ^ G N ZD 1 N i (4.32) is greater than or equal to a xed> 0 and 1 e 1N 2o P p N N ! : (4.33) 105 The assumption that the minimum eigenvalue of (4.32) is bounded away from 0 is fairly mild since we already require ^ (Z) to be full rank and for its minimum eigenvalue to not vanish too quickly, and the minimum singular values ofD N andG N are bounded from below by positive constants; see Lemmas C.4.2 and C.6.2 in the appendix. Condition (4.33) is another assumption that requirese 1N to not vanish too quickly, which is qualita- tively similar to Assumptions (R2) and (R5). Holdings =s N xed (as in Theorem 4.5.3) and ignoring the distinction between almost sure convergence and convergence in probability, (4.33) is stronger than (R2) since (R2) requires N Ne 1N ! 0 while (4.33) requires N p Ne 1N ! 0: However, ife 1N is bounded away from 0 with high probability, using a similar argument to the one in the proof of Lemma C.7.2 in the appendix, (4.33) is essentially equivalent to the other requirement in (R2) that lim N!1 N p s N =N = 0. If the smallest eigenvalue ofE h ^ (Z) i is bounded away from 0, one can show that the asymptotic distribution for the minimum eigenvalue of ^ (Z) has support bounded away from 0 in our setting whereX has nite fourth moments andp N NT [120]. If we also assume thatX is subgaussian, under Theorem 5.39 from [181]e 1N is bounded away from 0 with high probability for nite N large enough relative top N . Theorem 4.5.3 (Asymptotic Normality of FETWFE). In addition to the assumptions of Theorem 4.5.2, as- sume that Assumption (R6) holds and Assumption S(s) holds for a xed s. Then for any r 2 R where ATT (r;t)6= 0, s NT ^ v ATT;r;t N (^ ATT (r;t) ATT (r;t)) d !N (0; 1); 106 and if ATT (r)6= 0, s NT ^ v ATT;r N (^ ATT (r) ATT (r)) d !N (0; 1); where ^ v ATT;r;t N and ^ v ATT;r N ,denedin (C.3.5)and (C.3.6),arenite-samplevarianceestimators. (Bothvariance estimators depend only on the known 2 and the observed data.) Lastly, ^ ATT (r;t) and ^ ATT (r) are both oracle estimators [59, 207]: if the estimands are nonzero, then p NT (^ ATT (r;t) ATT (r;t)) and p NT (^ ATT (r) ATT (r)) both converge in distribution to mean zero Gaussianrandomvariableswithvariancesthatdependonlyonthesparametersofthemodelwithallrestric- tionscorrectlyidentied. Iftheestimandsare0, p NT (^ ATT (r;t) ATT (r;t))and p NT (^ ATT (r) ATT (r)) both converge in probability to 0. Proof. Provided in Section C.3 of the appendix. Theorem 4.5.3 leverages novel (to the best of our knowledge) extensions of Theorem 2 from [106]. We cannot test the null hypothesis of treatment eects equalling 0 under Theorem 4.5.3, since these test statistics are not asymptotically normal if the treatment eect is 0, but we can construct asymptotically valid condence intervals for parameters that are nonzero. Theorems 4.5.2 and 4.5.3 (and, more directly, Theorem C.2.1 in the supplement) show that FETWFE is an oracle procedure: it identies the correct restrictions with probability tending to 1, when the estimands are nonzero it converges at a 1= p N rate even ifp N !1, and its asymptotic covariance matrix depends only on thes parameters of the correct model. That is, even ifp N !1, FETWFE estimates the model with the same asymptotic eciency as an OLS-estimated ETWFE model that knows thep N s correct restrictions. FETWFE leverages the possible restrictions in the ETWFE model to improve eciency without requiring the practitioner to make assumptions on which restrictions to choose that could induce bias if the wrong restrictions are chosen, or compromise eciency if not enough restrictions are chosen. 107 A similar proof strategy for the one used to prove Theorem 4.5.3 should also work to prove the asymp- totic normality of ^ CATT (r;x;t) and ^ CATT (r;x) for a xedx. Further, using Slutsky’s theorem in conjunc- tion with these strategies should work to prove the asymptotic normality of ^ ATT and ^ CATT (x) for a xed x. 4.6 SyntheticExperiments To test the ecacy of FETWFE under our assumptions, we conduct synthetic data experiments in R using the simulator package [19]. We choose parameters that bear resemblance to the data application from Section III of Goodman-Bacon [71] (N = 51;T = 33;R = 12;d N = 3), though with fewer cohorts for simplicity of presentation, more covariates to increasep N in compensation for the smallerR, and more units to avoid realizations where no units are assigned to a cohort by random chance. We generate data withN = 120 units, T = 30 time periods, R = 5 cohorts entering at timesf2;:::; 6g, andd N = 12 features, which results in a total ofp N = 2209 covariates andNT = 3600 observations. To generate a N such thatD N N is sparse, we generate a single random sparse N 2R p N , then transform this into a coecient vector N =D 1 N N to use across all simulations. We generate N by taking ap N -vector of all 0 entries and setting each entry equal to 2 randomly with probability 0.1. We set the sign of each term randomly, but since N is a sum of terms in N , we bias the signs to avoid treatment eects that are too close to 0: individual nonzero terms in N are positive with probability 0.6 and negative otherwise. On each of 280 simulations, we randomly generateN independent realizations ofX (i) N (0;I d N ) for the time-invariant covariates. We randomly assign treatments with probabilities ( 0 ; 1 ;:::; R ) = (1=(R + 1);:::; 1=(R + 1)), where r := 8 > > > > < > > > > : P(W i = 0); r = 0; P(W i =r); r2R; (4.34) 108 are the marginal probabilities of treatment assignments. To ensure the model is estimable, in the rare instances where there are no untreated units or one cohort has no observed units, we draw another set of random assignments. After drawing covariates and treatment assignments, we constructZ. Then we drawN independent random eectsc i N (0; 1) as well asNT independent noise termsu (it) N (0; 1) to generate (i) = c i 1 T +u (i) and nally generatey =Z N + () . For each method, we centery as _ y =yy and multiply both _ y and the design matrix used by each method on the left byG N = I N 1=2 as described in Section 4.2. Besides FETWFE, we also consider three competitor methods. The rst is ETWFE, estimated by a simple linear regression onZ. We also consider bridge regression onZ directly penalizing rather than penalizingD N (BETWFE). For both FETWFE and BETWFE, we follow the experiments of [106] and use q = 0:5, selecting the penalty N over a grid of 100 values (equally spaced on a logarithmic scale) using BIC. We implement bridge regression for both BETWFE and FETWFE using the R grpreg package [32]. Finally, we also consider a slightly more exible version of (4.3), where we estimate the model y it = r + t +X + X r2R r 1fW i =rg1ftrg + it by OLS (TWFE_COVS). (Notice that it is not possible to add unit-specic covariates to Model (4.3) because this would induce collinearity. Fitting the model instead on cohort xed eects, like ETWFE, also matches, e.g., Equation 3.2 in [35].) The inclusion of covariates makes it more plausible that Assumptions (CNAS) and (CCTS) might hold in this model. Separate treatment eects for each cohort, while insucient to avoid bias and capture the variation in treatment eects, should also improve estimation relative to (4.3) and allows for the estimation of cohort-specic treatment eects to compare to the other methods. Given 109 that approximately 90% of the restrictions hold due to our construction of N , one might hope this highly restricted model could be somewhat competitive with FETWFE. 4.6.1 EstimationError We start by evaluating the error of each method in estimating the average treatment eect (4.16). The assumptions of Theorem 4.5.1 are satised, so we expect FETWFE to estimate the treatment eects more accurately than ETWFE, which will suer from high variance, BETWFE, which does not assume the correct form of sparsity, and TWFE_COVS, which is biased. We estimate the average treatment eect (4.16) using the FETWFE estimator (4.28), as well as analogous estimators using the competitor estimated regression coecients and observed cohort counts. We calculate the squared error of each average treatment eect estimate for each method on each iteration. Boxplots of the results are displayed in Figure 4.1. We also provide the means and standard errors for the squared errors of each method in Table 4.1. Table 4.2 containsp-values for paired one-tailedt-tests of the alternative hypothesis that the squared error for FETWFE is less than the squared error for each competitor method; all results show signicantly better performance for FETWFE at the 0.01 signicance level. These results show that even if most of the restrictions we would naturally consider hold and we are only interested in overall average treatment eect (4.16), a simple model like TWFE_COVS is a bad estimator. This aligns with the conclusions of previous works, as we discussed in the introduction. We see that ETWFE has better estimation error, though in practice it still does not perform as well as FETWFE because it estimates about ten times as many parameters as are needed, leading to imprecise, high-variance estimates. 110 1e−06 1e−03 1e+00 BETWFE ETWFE FETWFE TWFE_covs Method ATT MSE Figure 4.1: Boxplots of squared errors for each treatment eect estimate across all 280 simulations. Vertical axis is on a log scale. Table 4.1: Means and standard errors for the squared error of the estimator of the ATT (4.16) for each method across all 280 simulations in the synthetic experiment from Section 4.6. FETWFE ETWFE BETWFE TWFE_COVS 0.0108 (0.000914) 0.124 (0.0484) 0.262 (0.0187) 0.442 (0.0361) Table 4.2: p-values from paired one-tailed t-tests of the alternative hypothesis that the FETWFE ATT estimator (4.28) has a lower squared prediction error for the ATT (4.16) than each competitor method calculated across all 280 simulations in the synthetic experiment from Section 4.6. (Lowp-values indicate better predictive performance for FETWFE.) ETWFE BETWFE TWFE_COVS 0.00986 2e-32 5.04e-27 111 In Section C.1 of the appendix, we also present prediction error results for the cohort-specic average treatment eects ATT (r) as dened in (4.18) for each of the ve cohorts. These results show that FETWFE outperforms the competitor methods for these estimands as well. 4.6.2 RestrictionSelectionConsistency Next, we examine the extent to which we can trust FETWFE to select the correct restrictions, as we know it does with asymptotically high probability under Theorem 4.5.2. In the same experiment as before, on each simulation we calculate the percentage of restrictions FETWFE successfully identies among the treatment parameters. In particular, we calculate the percentage of entries ofD N that correspond to the treatment coecients rt that FETWFE successfully estimates as 0. 0.84 0.86 0.88 0.90 FETWFE Method Proportion of Restrictions Identified Figure 4.2: Boxplot displaying proportions of treatment eect restrictions successfully identied by FETWFE across each of the 280 simulations from the synthetic experiment in Section 4.6. 112 On average across the 280 simulations, FETWFE successfully identies 88:6% of the restrictions, with a standard error of 0:0589%. Figure 4.2 shows a boxplot of the proportion of restrictions successfully identied across all 280 simulations. This suggests that FETWFE is eective at identifying restrictions in practice, which contributes to FETWFE’s success in estimating treatment eects that we saw in Section 4.6.1. 4.6.3 AsymptoticDistribution Finally we investigate the asymptotic distribution of the scaled estimator of the cohort average treatment eects for the FETWFE estimator. We conduct a second simulation study with 175 iterations in the same way as the previous one but withN = 4000;T = 5;R = 2, andd N = 2 in order to get a better sense of the asymptotic behavior. We use the variance estimator ^ v ATT;r N from Theorem 4.5.3, dened in (C.3.6), to calculateZ-statistics for the cohort average treatment eects for each cohort. On each simulation, we construct a nominal 95% condence interval in the usual way for each estimated cohort average treat- ment eect. After all simulations are complete, we calculate the percentage of condence intervals that contained the true average treatment eect. We nd coverage rates of 0.926 for the rst cohort (standard error 0.021) and 0.971 for the second cohort (standard error 0.013). Overall, the coverage rates are fairly close to the nominal 95% level in this niteN setting. 113 Bibliography [1] Alberto Abadie and Jérémy L’hour. “A penalized synthetic control estimator for disaggregated data”. In: Journal of the American Statistical Association 116.536 (2021), pp. 1817–1834. [2] Alan Agresti. Analysis of ordinal categorical data. Vol. 656. John Wiley & Sons, 2010. [3] David H. Alexander and Kenneth Lange. “Stability selection for genome-wide association”. In: Genetic Epidemiology 35.7 (2011), pp. 722–728.issn: 07410395.doi: 10.1002/gepi.20623. [4] Carlos Alonso-Blanco, Jorge Andrade, Claude Becker, Felix Bemm, Joy Bergelson, Karsten M Borgwardt, Jun Cao, Eunyoung Chae, Todd M Dezwaan, Wei Ding, et al. “1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana”. In: Cell 166.2 (2016), pp. 481–491. [5] Mohammed El Anbari and Abdallah Mkhadri. “Penalized regression combining the L1 norm and a correlation based penalty”. In: Sankhya B 76.1 (2014), pp. 82–102.issn: 09768394.doi: 10.1007/s13571-013-0065-4. [6] Dmitry Arkhangelsky, Guido W Imbens, Lihua Lei, and Xiaoman Luo. “Double-robust two-way-xed-eects regression for panel data”. In: arXiv preprint arXiv:2107.13737 (2021). [7] Ben G Armstrong and Margaret Sloan. “Ordinal regression models for epidemiologic data”. In: American Journal of Epidemiology 129.1 (1989), pp. 191–204. [8] Taylor B. Arnold and Ryan J. Tibshirani. “Ecient Implementations of the Generalized Lasso Dual Path Algorithm”. In: Journal of Computational and Graphical Statistics 25.1 (2016), pp. 1–27. issn: 15372715.doi: 10.1080/10618600.2015.1008638. arXiv: 1405.3222. [9] Eva Ascarza. “Retention futility: Targeting high-risk customers might be ineective”. In: Journal of Marketing Research 55.1 (2018), pp. 80–98. [10] Aaron T Beck, Robert A Steer, and Gregory Brown. “Beck depression inventory–II”. In: Psychological assessment (1996). [11] Andre Beinrucker, Ürün Dogan, and Gilles Blanchard. “Extensions of stability selection using subsamples of observations and covariates”. In: Statistics and Computing 26.5 (2016), pp. 1059–1077.issn: 15731375.doi: 10.1007/s11222-015-9589-y. arXiv: 1407.4916. 114 [12] Alexandre Belloni, Victor Chernozhukov, and Christian Hansen. “Inference on treatment eects after selection among high-dimensional controls”. In: The Review of Economic Studies 81.2 (2014), pp. 608–650. [13] Alexandre Belloni, Victor Chernozhukov, Abhishek Kaul, Mathieu Rosenbaum, and Alexandre B. Tsybakov. Pivotal Estimation Via Self-Normalization for High-Dimensional Linear Models With Errors In Variables. 2017. arXiv: 1708.08353.url: https://arxiv.org/abs/1708.08353. [14] Alexandre Belloni, Mathieu Rosenbaum, and Alexandre B Tsybakov. “Linear and conic programming estimators in high dimensional errors-in-variables models”. In: Journal of the Royal Statistical Society. Series B: Statistical Methodology 79.3 (2017), pp. 939–956.issn: 14679868.doi: 10.1111/rssb.12196. arXiv: 1408.0241. [15] Alexandre Belloni, Mathieu Rosenbaum, and Alexandre B. Tsybakov. “An l1, l2, l innity-regularization approach to high-dimensional errors-in-variables models”. In: Electronic Journal of Statistics 10.2 (2016), pp. 1729–1750.issn: 19357524.doi: 10.1214/15-EJS1095. [16] Alexis Bellot and Mihaela van der Schaar. “Policy Analysis using Synthetic Controls in Continuous-Time”. In: Proceedings of the 38th International Conference on Machine Learning. Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning Research. PMLR, July 2021, pp. 759–768.url: https://proceedings.mlr.press/v139/bellot21a.html. [17] Riccardo Benedetti. “Scoring rules for forecast verication”. In: Monthly Weather Review 138.1 (2010), pp. 203–211. [18] Peter J Bickel, Ya’Acov Ritov, and Alexandre B Tsybakov. “Simultaneous Analysis of LASSO and Dantzig Selector”. In: The Annals of Statistics 37.4 (2009), pp. 1705–1732.doi: 10.1214/08-AOS620. [19] Jacob Bien. “The simulator: an engine to streamline simulations”. In: arXiv preprint arXiv:1607.00021 (2016). [20] Xin Bing, Florentina Bunea, Yang Ning, and Marten Wegkamp. “Adaptive estimation in structured factor models with applications to overlapping clustering”. In: Annals of Statistics 48.4 (2020), pp. 2055–2081.issn: 21688966.doi: 10.1214/19-AOS1877. arXiv: 1704.06977. [21] Xin Bing, Florentina Bunea, and Marten Wegkamp. “Inference in latent factor regression with clusterable features”. In: Bernoulli 28.2 (2022), pp. 997–1020. [22] Matthew Blackwell, Adam Glynn, Hanno Hilbig, and Connor Halloran Phillips. “Dierence-in-dierences Designs for Controlled Direct Eects”. In: (2022). [23] Marianne Bléhaut, Xavier D’Haultfoeuille, Jérémy L’Hour, and Alexandre B Tsybakov. “An alternative to synthetic control for models with many covariates under sparsity”. In: arXiv preprint arXiv:2005.12225 (2020). [24] Malgorzata Bogdan, Ewout van den Berg, Weijie Su, and Emmanuel J Candès. “Statistical estimation and testing via the sorted L1 norm”. In: arXiv preprint arXiv:1310.1969 (2013). 115 [25] Kenneth A Bollen. Structural equations with latent variables. Wiley series in probability and mathematical statistics. Applied probability and statistics section. Oxford, England: John Wiley & Sons, 1989.isbn: 0-471-01171-1 (Hardcover).doi: 10.1002/9781118619179. [26] Verónica Bolón-Canedo and Amparo Alonso-Betanzos. “Ensembles for feature selection: A review and future trends”. In: Information Fusion 52.November 2018 (2019), pp. 1–12.issn: 15662535.doi: 10.1016/j.inffus.2018.11.008. [27] Howard D Bondell and Brian J Reich. “Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR”. In: Biometrics 64 (2008), pp. 115–123.doi: 10.1111/j.1541-0420.2007.00843.x. [28] Kirill Borusyak, Xavier Jaravel, and Jann Spiess. Revisiting Event Study Designs: Robust and Ecient Estimation. 2023. arXiv: 2108.12419[econ.EM]. [29] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. “Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising.” In: Journal of Machine Learning Research 14.11 (2013). [30] Olivier Bousquet and André Elissee. “Stability and Generalization”. In: Journal of Machine Learning Research 2.3 (2002), pp. 499–526.issn: 15324435.doi: 10.1162/153244302760200704. [31] Rollin Brant. “Assessing proportionality in the proportional odds model for ordinal logistic regression”. In: Biometrics (1990), pp. 1171–1178. [32] Patrick Breheny and Jian Huang. “Penalized methods for bi-level variable selection”. In: Statistics and its interface 2.3 (2009), p. 369. [33] Nicholas Brown and Kyle Butts. “Generalized Imputation Estimators for Factor Models”. In: (2022). [34] Peter Bühlmann, Philipp Rütimann, Sara Van De Geer, and Cun-Hui Zhang. “Correlated variables in regression: clustering and sparse estimation”. In: Journal of Statistical Planning and Inference 143.11 (2013), pp. 1835–1858. [35] Brantly Callaway and Pedro HC Sant’Anna. “Dierence-in-dierences with multiple time periods”. In: Journal of Econometrics 225.2 (2021), pp. 200–230. [36] A Colin Cameron and Pravin K Trivedi. Microeconometrics: methods and applications. Cambridge university press, 2005. [37] Emmanuel Candès, Yingying Fan, Lucas Janson, and Jinchi Lv. “Panning for gold: ‘model-X’ knockos for high dimensional controlled variable selection”. In: Journal of the Royal Statistical Society. Series B: Statistical Methodology 80.3 (2018), pp. 551–577.issn: 14679868.doi: 10.1111/rssb.12265. arXiv: 1610.02351. 116 [38] Carlos Carvalho, Ricardo Masini, and Marcelo C Medeiros. “ArCo: An articial counterfactual approach for high-dimensional panel time-series data”. In: Journal of econometrics 207.2 (2018), pp. 352–380. [39] George Casella and Roger L Berger. Statistical inference. Cengage Learning, 2021. [40] Matias D Cattaneo, Yingjie Feng, Filippo Palomba, and Rocio Titiunik. “Uncertainty Quantication in Synthetic Controls with Staggered Treatment Adoption”. In: arXiv preprint arXiv:2210.05026 (2022). [41] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. “SMOTE: synthetic minority over-sampling technique”. In: Journal of articial intelligence research 16 (2002), pp. 321–357. [42] Yiqun Chen, Sean Jewell, and Daniela Witten. “More powerful selective inference for the graph fused lasso”. In: Journal of Computational and Graphical Statistics (2022), pp. 1–11. [43] Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. “An exact and robust conformal inference method for counterfactual and synthetic controls”. In: Journal of the American Statistical Association 116.536 (2021), pp. 1849–1864. [44] Rune Haubo Bojesen Christensen, Graham Cleaver, and Per Bruun Brockho. “Statistical and Thurstonian models for the A-not A protocol with and without sureness”. In: Food Quality and Preference 22.6 (2011), pp. 542–549.issn: 09503293.doi: 10.1016/j.foodqual.2011.03.003. [45] Gauss M Cordeiro and Peter McCullagh. “Bias Correction in Generalized Linear Models”. In: Journal of the Royal Statistical Society: Series B (Methodological) 53.3 (1991), pp. 629–643.doi: 10.1111/j.2517-6161.1991.tb01852.x. [46] Flavio Cunha, James Heckman, and Salvador Navarro. “Separating uncertainty from heterogeneity in life cycle earnings”. In: oxford Economic papers 57.2 (2005), pp. 191–261. [47] C.S. Davis. Statistical Methods for the Analysis of Repeated Measurements. Online access with purchase: Springer. Springer, 2002.isbn: 9780387953700.url: https://books.google.com/books?id=MIwSjX4UW60C. [48] Clément De Chaisemartin and Xavier d’Haultfoeuille. Two-way xed eects and dierences-in-dierences with heterogeneous treatment eects: A survey. Tech. rep. National Bureau of Economic Research, 2022. [49] Clément De Chaisemartin and Xavier d’Haultfoeuille. “Two-way xed eects estimators with heterogeneous treatment eects”. In: American Economic Review 110.9 (2020), pp. 2964–2996. [50] Marcel Dettling and Peter Bühlmann. “Finding predictive gene groups from microarray data”. In: Journal of Multivariate Analysis 90.1 (2004), pp. 106–131. [51] L. Devroye and T. Wagner. “Distribution-free performance bounds for potential function rules”. In: IEEE Transactions on Information Theory 25.5 (1979), pp. 601–604.doi: 10.1109/TIT.1979.1056087. 117 [52] Nick Doudchenko, Khashayar Khosravi, Jean Pouget-Abadie, Sébastien Lahaie, Miles Lubin, Vahab Mirrokni, Jann Spiess, and guido imbens guido. “Synthetic Design: An Optimization Approach to Experimental Design with Synthetic Controls”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 8691–8701.url: https://proceedings.neurips.cc/paper_files/paper/2021/file/48d23e87eb98cc2227b5a8c33fa00680- Paper.pdf. [53] Arindrajit Dube, Daniele Girardi, Oscar Jorda, and Alan M Taylor. A local projections approach to dierence-in-dierences event studies. Tech. rep. National Bureau of Economic Research, 2023. [54] Brendan Andrew Duncan and Charles Peter Elkan. “Probabilistic Modeling of a Sales Funnel to Prioritize Leads”. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’15. Sydney, NSW, Australia: Association for Computing Machinery, 2015, pp. 1751–1758.isbn: 9781450336642.doi: 10.1145/2783258.2788578. [55] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. “Least Angle Regression”. In: The Annals of Statistics 32.2 (2004), pp. 407–499.issn: 0090-5364.doi: 10.1214/009053604000000067. [56] K.O. Ekvall and M. Bottai. “Concave likelihood-based regression with nite-support response variables”. In: Biometrics March (2022), pp. 1–12.issn: 0006-341X.doi: 10.1111/biom.13760. arXiv: 2203.04582. [57] Paul B Ellickson, Wreetabrata Kar, and James C Reeder III. “Estimating marketing component eects: Double machine learning from targeted digital promotions”. In: Marketing Science (2022). [58] Gregory Faletto and Jacob Bien. cssr: Cluster Stability Selection. R package version 0.1.5. 2023.url: https://gregfaletto.github.io/cssr-project/. [59] Jianqing Fan and Runze Li. “Variable selection via nonconcave penalized likelihood and its oracle properties”. In: Journal of the American statistical Association 96.456 (2001), pp. 1348–1360. [60] Vivek Farias, Andrew Li, and Tianyi Peng. “Learning Treatment Eects in Panels with General Intervention Patterns”. In: Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34. Curran Associates, Inc., 2021, pp. 14001–14013.url: https://proceedings.neurips.cc/paper_files/paper/2021/file/7504adad8bb96320eb3afdd4df6e1f60- Paper.pdf. [61] ALVAN R FEINSTEIN, JOEL F RUBINSTEIN, and WALTER A RAMSHAW. “Estimating prognosis with the aid of a conversational-mode computer program”. In: Annals of Internal Medicine 76.6 (1972), pp. 911–921. [62] Alberto Fernández, Salvador Garcia, Francisco Herrera, and Nitesh V Chawla. “SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary”. In: Journal of articial intelligence research 61 (2018), pp. 863–905. 118 [63] Jessica K. Flake, Jolynn Pek, and Eric Hehman. “Construct Validation in Social and Personality Research: Current Practice and Recommendations”. In: Social Psychological and Personality Science 8.4 (2017), pp. 370–378.issn: 19485514.doi: 10.1177/1948550617693063. [64] Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. “Subgroup identication from randomized clinical trial data”. In: Statistics in medicine 30.24 (2011), pp. 2867–2880. [65] Jennifer Frankovich, Christopher A Longhurst, and Scott M Sutherland. “Evidence-based medicine in the EMR era”. In: N Engl J Med 365.19 (2011), pp. 1758–1759. [66] Eiko I Fried and Randolph M Nesse. “Depression sum-scores don’t add up: why analyzing specic depression symptoms is essential”. In: BMC medicine 13.1 (2015), pp. 1–11. [67] Jerome H Friedman and Werner Stuetzle. “Projection pursuit regression”. In: Journal of the American statistical Association 76.376 (1981), pp. 817–823. [68] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent”. In: Journal of Statistical Software 33.1 (2010), pp. 1–22. url: http://www.jstatsoft.org/v33/i01/. [69] E. García-Portugués. Notes for Nonparametric Statistics. Version 6.9.0. ISBN 978-84-09-29537-1. 2023.url: https://bookdown.org/egarpor/NP-UC3M/. [70] Niharika Gauraha. “Stability Feature Selection using Cluster Representative LASSO”. In: Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016). Scitepress, 2016, pp. 381–386.doi: 10.5220/0005827003810386. [71] Andrew Goodman-Bacon. “Dierence-in-dierences with variation in treatment timing”. In: Journal of Econometrics 225.2 (2021), pp. 254–277. [72] James M Graham. “Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them”. In: Educational and psychological measurement 66.6 (2006), pp. 930–944. [73] William H. Greene. Econometric Analysis. 7th. Pearson Education, 2012. [74] A Dilek Güngör. “Some bounds for the product of singular values”. In: Int. J. Contemp. Math. Sciences 2.26 (2007), pp. 1285–1292. [75] B. Hansen. Econometrics. Princeton University Press, 2022.isbn: 9780691235899.url: https://books.google.com/books?id=Pte7zgEACAAJ. [76] T.J. Hastie and R.J. Tibshirani. Generalized Additive Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1990.isbn: 9780412343902.url: https://books.google.com/books?id=qa29r1Ze1coC. [77] Trevor Hastie, Robert Tibshirani, David Botstein, and Patrick Brown. “Supervised harvesting of expression trees”. In: Genome Biology 2.1 (2001), pp. 1–12. 119 [78] Trevor Hastie, Robert Tibshirani, Michael B Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing C Chan, David Botstein, and Patrick Brown. “’Gene shaving’as a method for identifying distinct sets of genes with similar expression patterns”. In: Genome biology 1.2 (2000), pp. 1–21. [79] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. “Statistical learning with sparsity”. In: Monographs on statistics and applied probability 143 (2015), p. 143. [80] Andrew F Hayes and Klaus Krippendor. “Answering the Call for a Standard Reliability Measure for Coding Data”. In: Communication Methods and Measures 1.1 (2007), pp. 77–89.issn: 1931-2458. doi: 10.1080/19312450709336664. [81] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. “Practical lessons from predicting clicks on ads at facebook”. In: Proceedings of the eighth international workshop on data mining for online advertising. 2014, pp. 1–9. [82] Phillip Heiler and Jana Mareckova. “Shrinkage for categorical regressors”. In: Journal of Econometrics 223.1 (2021), pp. 161–189. [83] Günter J Hitsch and Sanjog Misra. “Heterogeneous treatment eects and optimal targeting policy evaluation”. In: Available at SSRN 3111957 (2018). [84] Marek Hlavac. stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.3. Social Policy Institute. Bratislava, Slovakia, 2022.url: https://CRAN.R-project.org/package=stargazer. [85] Holger Höing, Harald Binder, and Martin Schumacher. “A coordinate-wise optimization algorithm for the Fused Lasso”. In: arXiv preprint arXiv:1011.6409 (2010). [86] Paul W Holland. “Statistics and causal inference”. In: Journal of the American statistical Association 81.396 (1986), pp. 945–960. [87] Alex Hollingsworth and Coady Wing. “Tactics for design and inference in synthetic control studies: An applied example using high-dimensional data”. In: Available at SSRN 3592088 (2020). [88] Roger A. Horn and Charles R. Johnson. Matrix Analysis. 2nd ed. Cambridge University Press, 2012.doi: 10.1017/9781139020411. [89] Roger A. Horn and Charles R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.doi: 10.1017/CBO9780511840371. [90] Jian Huang, Joel L. Horowitz, and Shuangge Ma. “Asymptotic properties of bridge estimators in sparse high-dimensional regression models”. In: Annals of Statistics 36.2 (2008), pp. 587–613.issn: 00905364.doi: 10.1214/009053607000000875. [91] Gary Hutson, Asif Laldin, and Isabella Velásquez. MLDataR: Collection of Machine Learning Datasets for Supervised Machine Learning. R package version 0.1.3. 2022.url: https://CRAN.R-project.org/package=MLDataR. 120 [92] Sangwon Hyun, Max G’Sell, and Ryan J Tibshirani. “Exact post-selection inference for the generalized lasso path”. In: Electronic Journal of Statistics 12.1 (2018), pp. 1053–1097. [93] Guido Imbens and Donald B. Rubin. Causal inference for statistics, social, and biomedical sciences: an introduction. Cambridge University Press, 2015. [94] Alan J. Izenman. Modern Multivariate Statistical Techniques: Regression, Classication, and Manifold Learning. 1st ed. Springer-Verlag New York, 2008. [95] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. “Group Lasso with Overlap and Graph Lasso”. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09. Montreal, Quebec, Canada: Association for Computing Machinery, 2009, pp. 433–440. isbn: 9781605585161.doi: 10.1145/1553374.1553431. [96] G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with Applications in R. Springer Texts in Statistics. Springer US, 2021.isbn: 9781071614174.url: https://books.google.com/books?id=g5gezgEACAAJ. [97] Justin M. Johnson and Taghi M. Khoshgoftaar. “Survey on deep learning with class imbalance”. In: Journal of Big Data 6.1 (2019), p. 27.doi: 10.1186/s40537-019-0192-5. [98] Yousef Kaddoura and Joakim Westerlund. “Estimation of Panel Data Models with Random Interactive Eects and Multiple Structural Breaks when T is Fixed”. In: Journal of Business & Economic Statistics (2022), pp. 1–13. [99] Alexandros Kalousis, Julien Prados, and Melanie Hilario. “Stability of Feature Selection Algorithms”. In: Proceedings of the Fifth IEEE International Conference on Data Mining. ICDM ’05. USA: IEEE Computer Society, 2005, pp. 218–225.isbn: 0769522785.doi: 10.1109/ICDM.2005.135. [100] Michael Kearns and Dana Ron. “Algorithmic Stability and Sanity-Check Bounds for Leave-One-out Cross-Validation”. In: Proceedings of the Tenth Annual Conference on Computational Learning Theory. COLT ’97. Nashville, Tennessee, USA: Association for Computing Machinery, 1997, pp. 152–162.isbn: 0897918916.doi: 10.1145/267460.267491. [101] J. T. Kent. “Discussion of “Stability Selection"”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.4 (2010). [102] Kipoong Kim and Hokeun Sun. “Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data”. In: BMC Bioinformatics 20.1 (2019).issn: 14712105.doi: 10.1186/s12859-019-3040-x. [103] Sung Kim, Vincent Plagnol, Tina T. Hu, Christopher Toomajian, Richard M. Clark, Stephan Ossowski, Joseph R. Ecker, Detlef Weigel, and Magnus Nordborg. “Recombination and linkage disequilibrium in Arabidopsis thaliana”. In: Nature Genetics 39.9 (2007), pp. 1151–1155. issn: 10614036.doi: 10.1038/ng2115. [104] Paul D.W. Kirk, Alexandra M. Lewin, and Michael P.H. Stumpf. “Discussion of “Stability Selection"”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.4 (2010). 121 [105] Seyoon Ko, Donghyeon Yu, and Joong-Ho Won. “Easily parallelizable and distributable class of algorithms for structured sparsity, with optimal acceleration”. In: Journal of Computational and Graphical Statistics 28.4 (2019), pp. 821–833. [106] Anders Bredahl Kock. “Oracle ecient variable selection in random and xed eects panel data models”. In: Econometric Theory 29.1 (2013), pp. 115–152. [107] Elena Krasnokutskaya. “Identication and estimation of auction models with unobserved heterogeneity”. In: Review of Economic Studies 78.1 (2011), pp. 293–327.issn: 1467937X.doi: 10.1093/restud/rdq004. [108] Megan Kuhfeld and James Soland. “Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses.” In: Psychological Methods 27.2 (2022), p. 234. [109] Soonwoo Kwon. “Optimal Shrinkage Estimation of Fixed Eects in Linear Panel Data Models”. In: EliScholar–A Digital Platform for Scholarly Publishing at Yal e (2021), p. 1. [110] Tilman Lange, Mikio Braun, Volker Roth, and Joachim Buhmann. “Stability-Based Model Selection”. In: Advances in Neural Information Processing Systems. Ed. by S. Becker, S. Thrun, and K. Obermayer. Vol. 15. MIT Press, 2003.url: https://proceedings.neurips.cc/paper/2002/file/37d097caf1299d9aa79c2c2b843d2d78-Paper.pdf. [111] Erich Leo Lehmann. Elements of large-sample theory. Springer, 1999. [112] LJ Lesko. “Personalized medicine: elusive dream or imminent reality?” In: Clinical Pharmacology & Therapeutics 81.6 (2007), pp. 807–816. [113] Tomer Levy and Felix Abramovich. “Generalization error bounds for multiclass sparse linear classiers”. In: Journal of Machine Learning Research 24.151 (2023), pp. 1–35. [114] Kathleen T Li and David R Bell. “Estimation of average treatment eects with panel data: Asymptotic theory and implementation”. In: Journal of Econometrics 197.1 (2017), pp. 65–75. [115] Tong Li, Isabelle Perrigne, and Quang Vuong. “Conditionally independent private information in OCS wildcat auctions”. In: Journal of Econometrics 98.1 (2000), pp. 129–161.issn: 03044076.doi: 10.1016/S0304-4076(99)00081-0. [116] Tong Li and Quang Vuong. “Nonparametric Estimation of the Measurement Error Model Using Multiple Indicators”. In: Journal of Multivariate Analysis 65.2 (1998), pp. 139–165.issn: 0047259X. doi: 10.1006/jmva.1998.1741. [117] Yuan Li, Benjamin Mark, Garvesh Raskutti, and Rebecca Willett. “Graph-based regularization for regression problems with highly-correlated designs”. In: SIAM Journal on Mathematics of Data Science 2.2 (2020), pp. 480–504.doi: 10.1137/19M1287365. arXiv: 1803.07658. 122 [118] Chris J Lintott, Kevin Schawinski, Anže Slosar, Kate Land, Steven Bamford, Daniel Thomas, M Jordan Raddick, Robert C Nichol, Alex Szalay, Dan Andreescu, et al. “Galaxy Zoo: morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey”. In: Monthly Notices of the Royal Astronomical Society 389.3 (2008), pp. 1179–1189. [119] Po-Ling Loh and Martin J. Wainwright. “High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity”. In: Ann. Statist. 40.3 (June 2012), pp. 1637–1664. doi: 10.1214/12-AOS1018. [120] Vladimir Alexandrovich Marchenko and Leonid Andreevich Pastur. “Distribution of eigenvalues for some sets of random matrices”. In: Matematicheskii Sbornik 114.4 (1967), pp. 507–536. [121] Ricardo Masini and Marcelo C Medeiros. “Counterfactual Analysis With Articial Controls: Inference, High Dimensions, and Nonstationarity”. In: Journal of the American Statistical Association 116.536 (2021), pp. 1773–1788. [122] Winter Mason and Siddharth Suri. “Conducting behavioral research on Amazon’s Mechanical Turk”. In: Behavior Research Methods 44.1 (2012), pp. 1–23.issn: 1554351X.doi: 10.3758/s13428-011-0124-6. [123] P. McCullagh and J.A. Nelder. Generalized Linear Models, Second Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1989.isbn: 9780412317606. url: https://books.google.com/books?id=h9kFH2%5C_FfBkC. [124] Peter McCullagh. “Regression models for ordinal data”. In: Journal of the Royal Statistical Society: Series B (Methodological) 42.2 (1980), pp. 109–127. [125] Daniel McNeish and Melissa Gordon Wolf. “Thinking twice about sum scores”. In: Behavior Research Methods 52.6 (2020), pp. 2287–2305.issn: 15543528.doi: 10.3758/s13428-020-01398-0. [126] Nicolai Meinshausen and Peter Bühlmann. “High-dimensional graphs and variable selection with the lasso”. In: (2006). [127] Nicolai Meinshausen and Peter Bühlmann. “Stability selection”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.4 (2010), pp. 417–473. [128] Marilyn Murphy, Sam Redding, and Janet Twyman. Handbook on personalized learning for states, districts, and schools. IAP, 2016. [129] Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. “Obtaining well calibrated probabilities using bayesian binning”. In: Twenty-Ninth AAAI Conference on Articial Intelligence. 2015. [130] Jerzy Neyman. “On the application of probability theory to agricultural experiments. Essay on principles”. In: Ann. Agricultural Sciences (1923), pp. 1–51. [131] Linh Nghiem and Cornelis Potgieter. “Simulation-selection-extrapolation: Estimation in high-dimensional errors-in-variables models”. In: Biometrics 75.4 (2019), pp. 1133–1144.issn: 15410420.doi: 10.1111/biom.13112. arXiv: 1808.10477. 123 [132] Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. “On the stability of feature selection algorithms”. In: Journal of Machine Learning Research 18 (2018), pp. 1–54.issn: 15337928.url: http://jmlr.org/papers/v18/17-514.html.. [133] Magnus Nordborg, Justin O. Borevitz, Joy Bergelson, Charles C. Berry, Joanne Chory, Jenny Hagenblad, Martin Kreitman, Julin N. Maloof, Tina Noyes, Peter J. Oefner, Eli A. Stahl, and Detlef Weigel. “The extent of linkage disequilibrium in Arabidopsis thaliana”. In: Nature Genetics 30.2 (2002), pp. 190–193.issn: 10614036.doi: 10.1038/ng813. [134] Colleen M Norris, William A Ghali, L Duncan Saunders, Rollin Brant, Diane Galbraith, Peter Faris, Merril L Knudtson, Approach Investigators, et al. “Ordinal regression model and the linear regression model were superior to the logistic regression models”. In: Journal of clinical epidemiology 59.5 (2006), pp. 448–456. [135] Art B Owen. “Innitely Imbalanced Logistic Regression.” In:JournalofMachineLearningResearch 8.4 (2007). [136] Jong Hee Park and Soichiro Yamauchi. “Change-Point Detection and Regularization in Time Series Cross-Sectional Data Analysis”. In: Political Analysis 31.2 (2023), pp. 257–277. [137] Mee Young Park, Trevor Hastie, and Robert Tibshirani. “Averaged gene expressions for regression”. In: Biostatistics 8.2 (2007), pp. 212–227. [138] M. Hashem Pesaran. Time Series and Panel Data Econometrics. OUP Catalogue 9780198759980. Oxford University Press, 2015.url: https://ideas.repec.org/b/oxp/obooks/9780198759980.html. [139] Bercedis Peterson and Frank E Harrell Jr. “Partial proportional odds models for ordinal response variables”. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 39.2 (1990), pp. 205–217. [140] Iosif Pinelis and Raymond Molzon. “Optimal-order bounds on the rate of convergence to normality in the multivariate delta method”. In: Electronic Journal of Statistics 10.1 (2016), pp. 1001–1063.doi: 10.1214/16-EJS1133. [141] Wolfgang Pößnecker and Gerhard Tutz. A General Framework for the Selection of Eect Type in Ordinal Regression. 2016.url: http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-26912-0. [142] John W Pratt. “Concavity of the log likelihood”. In: Journal of the American Statistical Association 76.373 (1981), pp. 103–106.issn: 1537274X.doi: 10.1080/01621459.1981.10477613. [143] R. H. B. Christensen. ordinal—Regression Models for Ordinal Data. 2019.url: https://CRAN.R-project.org/package=ordinal. [144] Stephen Reid and Robert Tibshirani. “Sparse regression and marginal testing using cluster prototypes”. In: Biostatistics 17.2 (Nov. 2016), pp. 364–376.issn: 1465-4644.doi: 10.1093/biostatistics/kxv049. eprint: https://academic.oup.com/biostatistics/article-pdf/17/2/364/6692798/kxv049.pdf. 124 [145] Mijke Rhemtulla, Riet van Bork, and Denny Borsboom. “Worse than measurement error: Consequences of inappropriate latent variable measurement models”. In: Psychological Methods 25.1 (2020), p. 30.url: https://pubmed.ncbi.nlm.nih.gov/23956527/. [146] Cristian R Rojas and Bo Wahlberg. “On change point detection using the fused lasso method”. In: arXiv preprint arXiv:1401.5408 (2014). [147] Mathieu Rosenbaum and Alexandre B Tsybakov. “Sparse recovery under matrix uncertainty”. In: Annals of Statistics 38.5 (2010), pp. 2620–2651.issn: 00905364.doi: 10.1214/10-AOS793. [148] Mathieu Rosenbaum and Alexandre B. Tsybakov. “Improved matrix uncertainty selector”. English. In: From probability to statistics and back: high-dimensional models and processes. A Festschrift in honor of Jon A. Wellner. Including papers from the conference, Seattle, WA, USA, July 28–31, 2010. Beachwood, OH: IMS, Institute of Mathematical Statistics, 2013, pp. 276–290.isbn: 978-0-940600-83-6.doi: 10.1214/12-IMSCOLL920. [149] Jonathan Roth, Pedro HC Sant’Anna, Alyssa Bilinski, and John Poe. “What’s trending in dierence-in-dierences? A synthesis of the recent econometrics literature”. In: Journal of Econometrics (2023). [150] Donald B Rubin. “Estimating causal eects of treatments in randomized and nonrandomized studies.” In: Journal of educational Psychology 66.5 (1974), p. 688. [151] Yvan Saeys, Thomas Abeel, and Yves Van de Peer. “Robust feature selection using ensemble feature selection techniques”. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2008, Antwerp, Belgium, September 15-19, 2008, Proceedings, Part II 19. Springer. 2008, pp. 313–325. [152] Susanne M Schennach. “Recent Advances in the Measurement Error Literature”. In: Annual Review of Economics 8 (2016), pp. 341–377.doi: 10.1146/annurev-economics-080315-015058. [153] Johanna Schwarz and Dominik Heider. “GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decision-making”. In: Bioinformatics 35.14 (Nov. 2018), pp. 2458–2465.issn: 1367-4803.doi: 10.1093/bioinformatics/bty984. eprint: https://academic.oup.com/bioinformatics/article-pdf/35/14/2458/28913220/bty984.pdf. [154] Mark R. Segal, Kam D. Dahlquist, and Bruce R. Conklin. “Regression Approaches for Microarray Data Analysis”. In: Journal of Computational Biology 10.6 (2004), pp. 961–980.issn: 1066-5277. doi: 10.1089/106652703322756177. [155] Mark R Segal, Kam D Dahlquist, and Bruce R Conklin. “Regression approaches for microarray data analysis”. In: Journal of Computational Biology 10.6 (2003), pp. 961–980. [156] Robert J. Sering. Approximation theorems of mathematical statistics. [Nachdr.] Wiley series in probability and mathematical statistics : Probability and mathematical statistics. New York, NY [u.a.]: Wiley, 1980. XIV, 371.isbn: 0471024031.url: http: //gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+024353353&sourceid=fbw_bibsonomy. 125 [157] M Sesia, C Sabatti, and E J Candès. “Gene hunting with hidden Markov model knockos”. In: Biometrika 106.1 (2019), pp. 1–18.issn: 14643510.doi: 10.1093/biomet/asy033. [158] Rajen D. Shah and Richard J. Samworth. “Discussion of ‘Correlated variables in regression: Clustering and sparse estimation’ by Peter Bühlmann, Philipp Rütimann, Sara van de Geer and Cun-Hui Zhang”. In: Journal of Statistical Planning and Inference 143.11 (2013), pp. 1866–1868. doi: 10.1016/j.jspi.2013.05.022. [159] Rajen D. Shah and Richard J. Samworth. “Variable selection with error control: another look at stability selection”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75.1 (2013), pp. 55–80.doi: 10.1111/j.1467-9868.2011.01034.x. [160] Dhruv B Sharma, Howard D Bondell, and Hao Helen Zhang. “Consistent Group Identication and Variable Selection in Regression With Correlated Predictors”. In: Journal of Computational and Graphical Statistics 22.2 (2013), pp. 319–340.doi: 10.1080/15533174.2012.707849. [161] Yiyuan She. “Sparse regression with exact clustering”. In: Electronic Journal of Statistics 4 (2010), pp. 1055–1096.doi: 10.1214/10-EJS578. [162] Xiaotong Shen and Hsin Cheng Huang. “Grouping pursuit through a regularization solution surface”. In: Journal of the American Statistical Association 105.490 (2010), pp. 727–739.issn: 01621459.doi: 10.1198/jasa.2010.tm09380. [163] James Soland, Megan Kuhfeld, and Kelly Edwards. “How survey scoring decisions can inuence your study’s results: A trip through the IRT looking glass.” In: Psychological Methods (2022). [164] Øystein Sørensen, Arnoldo Frigessi, and Magne Thoresen. “Measurement error in Lasso: Impact and likelihood bias correction”. In: Statistica Sinica 25.2 (2015), pp. 809–829.issn: 10170405.doi: 10.5705/ss.2013.180. [165] Therese Sørlie, Robert Tibshirani, Joel Parker, Trevor Hastie, J. S. Marron, Andrew Nobel, Shibing Deng, Hilde Johnsen, Robert Pesich, Stephanie Geisler, and et al. “Repeated observation of breast tumor subtypes in independent gene expression data sets”. In: Proceedings of the National Academy of Sciences 100.14 (2003), pp. 8418–8423.doi: 10.1073/pnas.0932692100. [166] Jiangwen Sun, Jin Lu, Tingyang Xu, and Jinbo Bi. “Multi-view sparse co-clustering via proximal alternating linearized minimization”. In: International Conference on Machine Learning. PMLR. 2015, pp. 757–766. [167] Liyang Sun and Sarah Abraham. “Estimating dynamic treatment eects in event studies with heterogeneous treatment eects”. In: Journal of Econometrics 225.2 (2021), pp. 175–199. [168] Qiang Sun, Kean Ming Tan, Han Liu, and Tong Zhang. “Graphical Nonconvex Optimization via an Adaptive Convex Relaxation”. In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, July 2018, pp. 4810–4817.url: http://proceedings.mlr.press/v80/sun18c.html. 126 [169] Robert Tibshirani. “Regression Shrinkage and Selection via the Lasso”. In: Journal of the Royal Statistical Society. Series B: Statistical Methodology 58.1 (1996), pp. 267–288. [170] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.1 (2005), pp. 91–108. [171] Ryan J Tibshirani. “The lasso problem and uniqueness”. In: Electronic Journal of Statistics 7.1 (2013), pp. 1456–1490.issn: 19357524.doi: 10.1214/13-EJS815. arXiv: 1206.0313. [172] Ryan J Tibshirani and Jonathan Taylor. “The solution path of the generalized lasso”. In: The annals of statistics 39.3 (2011), pp. 1335–1371. [173] Matteo Togninalli, Ümit Seren, Dazhe Meng, Jorey Fitz, Magnus Nordborg, Detlef Weigel, Karsten Borgwardt, Arthur Korte, and Dominik G Grimm. “The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog”. In: Nucleic Acids Research 46.D1 (Oct. 2017), pp. D1150–D1156.issn: 0305-1048.doi: 10.1093/nar/gkx954. eprint: https://academic.oup.com/nar/article-pdf/46/D1/D1150/23162835/gkx954.pdf. [174] Panagiotis Toulis and David C Parkes. “Long-term Causal Eects via Behavioral Game Theory”. In: Advances in Neural Information Processing Systems. Ed. by D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett. Vol. 29. Curran Associates, Inc., 2016.url: https://proceedings.neurips.cc/paper_files/paper/2016/file/af4732711661056eadbf798ba191272a- Paper.pdf. [175] Gerhard Tutz and Jan Gertheiss. “Regularized regression for categorical data”. In: Statistical Modelling 16.3 (2016), pp. 161–200. [176] Ejike R Ugba, Daniel Mörlein, and Jan Gertheiss. “Smoothing in ordinal regression: An application to sensory data”. In: Stats 4.3 (2021), pp. 616–633. [177] A.W. van der Vaart.AsymptoticStatistics. Asymptotic Statistics. Cambridge University Press, 2000. isbn: 9780521784504.url: https://books.google.com/books?id=UEuQEM5RjWgC. [178] Riet Van Bork, Mijke Rhemtulla, Klaas Sijtsma, and Denny Borsboom. “A causal theory of error scores.” In: Psychological Methods (2022). [179] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth. ISBN 0-387-95457-0. New York: Springer, 2002.url: http://www.stats.ox.ac.uk/pub/MASS4. [180] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. isbn: 9781108415194.url: https://books.google.com/books?id=J-VjswEACAAJ. [181] Roman Vershynin. “Introduction to the non-asymptotic analysis of random matrices”. In: Compressed Sensing: Theory and Applications. Ed. by Yonina C. Eldar and GittaEditors Kutyniok. Cambridge University Press, 2012, pp. 210–268.doi: 10.1017/CBO9780511794308.006. 127 [182] Vivian Viallon, Sophie Lambert-Lacroix, Hölger Hoeing, and Franck Picard. “On the robustness of the generalized fused lasso to prior specications”. In: Statistics and Computing 26.1-2 (2016), pp. 285–301. [183] Vivian Viallon, Sophie Lambert-Lacroix, Holger Höing, and Franck Picard. “Adaptive generalized fused-lasso: Asymptotic properties and applications”. In: (2013). [184] Davide Viviano and Jelena Bradic. Dynamic covariate balancing: estimating treatment eects over time. 2021. arXiv: 2103.01280[econ.EM]. [185] Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHI conference on Human factors in computing systems. 2004, pp. 319–326. [186] Till Von Wachter, Marianne Bertrand, Harold Pollack, Janey Rountree, and Brian Blackwell. “Predicting and preventing homelessness in Los Angeles”. In: California Policy Lab and University of Chicago Poverty Lab (2019). [187] Stefan Wager and Susan Athey. “Estimation and inference of heterogeneous treatment eects using random forests”. In: Journal of the American Statistical Association 113.523 (2018), pp. 1228–1242. [188] Hadley Wickham. ggplot2 Elegant Graphics for Data Analysis. Use R! Cham: Springer International Publishing, 2016.isbn: 978-3-319-24275-0.doi: 10.1007/978-3-319-24277-4. [189] Keith F. Widaman and William Revelle. “Thinking thrice about sum scores, and then some more about measurement and analysis”. In: Behavior Research Methods 0.0123456789 (2022).issn: 15543528.doi: 10.3758/s13428-022-01849-w. [190] CO Wilke. “cowplot: streamlined plot theme and plot annotations for ‘ggplot2’. R package version 1.1.1”. In: https://CRAN. R-project. org/package= cowplot (2020). [191] Daniela M Witten, Ali Shojaie, and Fan Zhang. “The cluster elastic net for high-dimensional regression with unknown variable grouping”. In: Technometrics 56.1 (2014), pp. 112–122.issn: 00401706.doi: 10.1080/00401706.2013.810174. [192] Sam Witty, Kenta Takatsu, David Jensen, and Vikash Mansinghka. “Causal Inference using Gaussian Processes with Structured Latent Confounders”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, July 2020, pp. 10313–10323.url: https://proceedings.mlr.press/v119/witty20a.html. [193] Jerey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010. [194] Jerey M Wooldridge. “Two-way xed eects, the two-way mundlak regression, and dierence-in-dierences estimators”. In: Available at SSRN 3906345 (2021). [195] Michael J. Wurm, Paul J. Rathouz, and Bret M. Hanlon. “Regularized Ordinal Regression and the ordinalNet R Package”. In: Journal of Statistical Software 99.6 (2021), pp. 1–42. 128 [196] Michael J Wurm, Paul J Rathouz, and Bret M Hanlon. “Regularized ordinal regression and the ordinalNet R package”. In: arXiv preprint arXiv:1706.05003 (2017). [197] Yu Xie, Jennie E Brand, and Ben Jann. “Estimating heterogeneous treatment eects with observational data”. In: Sociological methodology 42.1 (2012), pp. 314–347. [198] Bo Xin, Yoshinobu Kawahara, Yizhou Wang, and Wen Gao. “Ecient generalized fused lasso and its application to the diagnosis of Alzheimer’s disease”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 28. 1. 2014. [199] Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. “Targeting for long-term outcomes”. In: arXiv preprint arXiv:2010.15835 (2020). [200] Bin Yu. “Stability”. In: Bernoulli 19.4 (2013), pp. 1484–1500.doi: 10.3150/13-BEJSP14. [201] Bin Yu and Karl Kumbier. “Veridical data science”. In: Proceedings of the National Academy of Sciences 117.8 (2020), pp. 3920–3929.issn: 0027-8424.doi: 10.1073/pnas.1901326117. eprint: https://www.pnas.org/content/117/8/3920.full.pdf. [202] F. Zhang. The Schur Complement and Its Applications. Numerical Methods and Algorithms. Springer, 2005.isbn: 9780387242712.url: https://books.google.com/books?id=Wjd8%5C_AwjiIIC. [203] Weinan Zhang, Shuai Yuan, and Jun Wang. “Optimal real-time bidding for display advertising”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 1077–1086. [204] Peng Zhao and Bin Yu. “On Model Selection Consistency of Lasso”. In: Journal of Machine Learning Research 7 (2006), pp. 2541–2563. [205] Zemin Zheng, Yang Li, Chongxiu Yu, and Gaorong Li. “Balanced estimation for high-dimensional measurement error models”. In: Computational Statistics & Data Analysis 126 (2018), pp. 78–91. [206] Yunzhang Zhu. “An augmented ADMM algorithm with application to the generalized lasso problem”. In: Journal of Computational and Graphical Statistics 26.1 (2017), pp. 195–204. [207] Hui Zou. “The adaptive lasso and its oracle properties”. In: J. Amer. Statist. Assoc. 101.476 (2006), pp. 1418–1429.issn: 0162-1459.url: https://doi.org/10.1198/016214506000000735. [208] Hui Zou and Trevor Hastie. “Regularization and variable selection via the elastic net”. In: Journal of the royal statistical society: series B (statistical methodology) 67.2 (2005), pp. 301–320. 129 AppendixA ClusterStabilitySelectionAppendix In Appendix A.1, we present additional theoretical results that motivate cluster stability selection and our weighting schemes. Appendix A.2 contains more details on the stability metric proposed by Nogueira, Sechidis, and Brown [132] that we use in our simulation studies and data application and more details and results than appear in Section 2.6 of the main paper. Appendix A.3 contains the proof of Theorem 2.3.1, and the proofs of supporting results for Theorem 2.3.1 are presented in Appendix A.4. Appendix A.5 contains the proofs of Corollary 2.3.2, Theorem 2.5.1, and Theorem 2.5.3; the statement of Theorem A.5.2 and outlines of proofs of Theorems 2.5.5 and A.5.2; and the proofs of Proposition A.1.2, Corollary A.1.3, and Proposition A.1.5. Finally, in Appendix A.6 we present the proofs of various technical lemmas. A.1 AdditionalTheory Section A.1.1 contains results showing theoretically that proxies for important latent features can be more important predictors than directly observed “weak signal" features. We present results identifying the optimal weights for cluster representatives under our latent variable model in Section A.1.2. 130 A.1.1 Low-NoiseProxiesAreGoodSelectionsForPrediction A practitioner who is strictly interested in features that appear in the true data-generating process may be uninterested in low-noise proxies for important latent features, but we will show that they are useful for out-of-sample predictive performance. We will make use of a more general data-generating process than (2.6). Supposeq proxies are observed, X j :=Z + j ; j2 [q]; (A.1.1) with (possibly dierent) noise variances: j N 0; 2 j I n ;j2 [q]. Suppose (Z i ;X i;q+1 ;:::;X ip ; i ; i1 ;:::; iq ); i2 [n] (A.1.2) are independent Gaussian random variables, and y = Z Z + p X j=q+1 j X j +; (A.1.3) with q+1 ;:::; p 2R. Now we will dene a notion of prediction risk that we will use to compare features as selections. Denition A.1.1 (Prediction risk of a single feature). Assume the setup of (A.1.1), (A.1.2), and (A.1.3). Let ~ Z; ~ X q+1 ;:::; ~ X p ; ~ , and ~ 1 ;:::; ~ q be i.i.d. copies of the corresponding variables, and dene ~ y and ~ X 1 ;:::; ~ X q analogously. The prediction risk R(j) of a model using only feature j is the expected out-of-sample mean squared error of the predictions: R(j) :=E 1 n ~ y ^ j ~ X j 2 2 ; 131 where ^ j = X > j y X > j X j is the ordinary least squares (OLS) coecient. We show that proxies are better selections than directly observed features (in the sense of reducing this prediction risk) if the signal strength of the latent feature is large enough and the noise added to the latent feature is not too large. Proposition A.1.2. Assume ∗ the setup of (A.1.1), (A.1.2), and (A.1.3). Then for anyj2 [q] and anyk2 fq + 1;:::;pg, R(j) =R(k) n 1 n 2 2 Z 1 + 2 j 2 k ! : We see from Proposition A.1.2 that the risk of a proxy is less than the risk of a weak signal feature if (and only if) 2 Z 1 + 2 j 2 k > 0 () 2 Z 2 k > 1 + 2 j : This implies that in the setting of Theorem 2.3.1,X 1 orX 2 is a better choice for a model of size 1 than X 3 : CorollaryA.1.3. In the setting of Theorem 2.3.1,R(1) =R(2)<R(3). (For the proofs of these results, see Appendix A.5.3.) Together with Corollary 2.3.2, these results show that in the setting of Theorem 2.3.1, stability selection provides a worse ranking of the features for predic- tive performance than the lasso. A.1.2 OptimalWeightsforClusterRepresentatives Our goal in this section is to characterize what weights are optimal for taking weighted averages of proxy features for latent variables. Proposition A.1.5 will motivate our weighting schemes for cluster stability selection. First we will need a notion of prediction risk for cluster representatives. ∗ It is worth noting that this result readily generalizes further to cases of arbitrary numbers of latent signals and so on, but this becomes both notationally inconvenient and beyond the scope of what is needed in this paper. 132 DenitionA.1.4 (Prediction risk of cluster representatives). Assume the same setup as Denition A.1.1. For a clusterC [p] and arbitrary weightsw2 jCj1 , recall the denition of the cluster representative (2.17), and dene ~ X rep C as an out-of-sample draw (using the same weights). Dene the prediction risk of a model containing only a cluster representative constructed using these weights as R(C;w) :=E 1 n ~ y ^ ~ X rep C 2 2 ; (A.1.4) where ^ = X rep C > y . X rep C > X rep C . Proposition A.1.5 identies the optimal weights for proxy features for minimizing this prediction risk. Proposition A.1.5. Assume the setup of (A.1.1), (A.1.2), and (A.1.3). Consider regressing y against a weighted average of theq proxies, as in (2.17). Then 1. the weights that minimize the prediction risk dened in (A.1.4) are w j = 1 2 j , q X j 0 =1 1 2 j 0 ; j2 [q]; 2. the minimum risk (using the optimal weightsw =fw j g j2[q] ) is E ideal + n 1 n 2 2 Z 1 + P q j=1 1 2 j ; whereE ideal , dened in (A.5.15), is the prediction risk of regressing againstZ directly, and 3. for anyk2fq + 1;:::;pg, R([q];w ) =R(k) n 1 n 2 0 @ 2 Z P q j=1 1 2 j 1 + P q j=1 1 2 j 2 k 1 A : Proof. Provided in Appendix A.5.3. 133 Proposition A.1.5(i) leads to the unsurprising conclusion that it is optimal to include all proxies in a weighted average but place lower weight on proxies that have higher noise. Weighted averaged cluster stability selection should do this because proxies that are more correlated with the latent signal (and there- forey) should have higher selection probabilities, and therefore higher expected selection proportions in cluster stability selection. In Appendix A.1.2 we will present gures demonstrating how well each of our cluster stability selection weighting schemes approximates these optimal weights, and we will see that weighted averaged cluster stability selection does indeed best approximate the weights when the noise levels of the proxies are unequal. Proposition A.1.5(ii) shows that adding one more proxy with nite noise variance 2 j always reduces the prediction risk when optimal weights are used. Also, note that if any one proxy has noise variance tending towards 0, the optimal weight on that proxy tends toward 1, the prediction risk tends towardE ideal , and the quantity on the right side of (A.1.5) tends toward 1. From Proposition A.1.5(iii) we see thatR([q];w )<R(k) if and only if 2 Z P q j=1 1 2 j 1 + P q j=1 1 2 j 2 k > 0 () 2 Z 2 k > 1 + P q j=1 1 2 j P q j=1 1 2 j : (A.1.5) By comparison of Proposition A.1.5(iii) to Proposition A.1.2, we see that the signal strength ofZ does not have to be as high for the prediction risk ofX rep [q] to be less than anyR(k);k2fq + 1;:::;pg than it does for any oneR(j);j2 [q] to be less thanR(k), since if 0< 2 j <1 for allj it holds that 1 + P q j 0 =1 1 2 j 0 P q j 0 =1 1 2 j 0 < 1 + 2 j 8q> 1;8j2 [q]: See also Park, Hastie, and Tibshirani [137] and Bühlmann et al. [34] (particularly Propositions 4.2 and 4.3) for other theoretical results about the benets of averaging correlated features under dierent assumptions from ours. 134 A.2 AdditionalSimulationStudyDetails In this section we include additional details, results, and plots from the simulation studies and data appli- cation. We begin by describing the stability metric due to Nogueira, Sechidis, and Brown [132] that we use in the simulation studies and data application in Section A.2.1. In Section A.2.2 we describe a third weighting scheme for cluster stability selection, sparse cluster stability selection, which we include in the additional plots of simulation study results in Section A.2.3 and data application results in Section A.2.4 as competitor methods, in addition to the elastic net [208] with mixing parameter = 0:5 as well as stability selection with the elastic net as the baseline procedure. A.2.1 StabilityMetric There is a literature dating back to at least Kalousis, Prados, and Hilario [99] attempting to quantify the stability of feature selection methods. We choose to evaluate stability using the metric proposed relatively recently by Nogueira, Sechidis, and Brown [132]. Briey, the metric works as follows (we adopt in part the notation of [132]). For each method and for each model sizes, our simulations generateM 2000 dierent selected sets (the number of selected sets may be smaller than 2000 for the stability selected methods due to ties in selection proportions making sets of a xed s sometimes undened). These selected sets are encoded in binary Mp matricesZ, whereZ ij = 1 if featurej was selected by a given method with a given model size on simulationi, and 0 otherwise. Note that each row of eachZ contains exactlys ones for the lasso, elastic net, protolasso, and stability selection. For the cluster representative lasso and cluster stability selection methods, each row may have more thans ones, since every feature in a selected cluster with nonzero weights is considered 135 to be selected. Each column ofZ can be interpreted as containingM realizations of a Bernoulli random variable, with sample mean ^ p j equal to the mean of the column and unbiased sample variance ^ 2 j := M M 1 ^ p j (1 ^ p j ): The average of these values 1 p P p j=1 ^ 2 j can be thought of as a measure of the instability of a feature selection method at a given model size. Nogueira, Sechidis, and Brown propose as a stability measure the quantity ^ (Z) := 1 1 p P p j=1 ^ 2 j s p 1 s p ; wheres is the average number of features selected in each row. The minus sign in front of the right term makes this a measure of stability, rather than instability. The 1 on the left ensures that this measure is at most 1, and the denominator of the term on the right is the sample variance of a feature selection method that selects features completely at random with probabilitys=p. This ensures that the metric has expected value 0 for such a “null" feature selection method, and for feature selection methods that are at least as good as random guessing, the metric is stochastically bounded between 0 and 1. Simpler stability metrics exist, but we choose this metric because of the useful theoretical properties it enjoys [132]. Most notably, it is easier to compare this metric across dierent model sizes, because the expected value of this metric for a “null" feature selection method is always 0, regardless of model size; it turns out this is not true for simpler stability metrics. Further, Nogueira, Sechidis, and Brown provide an asymptoticZ-test for their metric (when each selected set is independent) which we use to compare the stability of competing methods in the simulation studies (though not in the GWAS data application, where the selected sets are dependent because they are yielded from dierent subsamples of the same xed data set). 136 A.2.2 SparseClusterStabilitySelection We propose sparse cluster stability selection as a third weighting scheme in addition to the two proposed in Section 2.4.1. For each featurej in a clusterC k , we assign weights as w kj = 1 ( j2 arg max j 0 2C k n ^ B (j 0 ) o ), arg max j 0 2C k n ^ B (j 0 ) o : (A.2.1) That is, we assign equal weight to all of the most frequently selected individual cluster members, and 0 weight to the rest. In particular, if one cluster member was selected most frequently with no ties, assign weight 1 to that cluster member. Sparse cluster stability selection removes all but the most frequently selected individual features from each cluster. Its output resembles stability selection in the sense that sparse cluster stability selection often returns only clusters of size 1 (if the most frequently selected feature within a cluster is not tied with another feature in the cluster). This weighting scheme is in a way analogous to the protolasso [144], but the cluster prototype is the most frequently selected feature across all subsamples rather than the cluster member with the greatest marginal correlation with the response. (One advantage cluster stability selection enjoys over the protolasso is that cluster stability selection does not require a notion of correlation with the response.) This weighting scheme has the virtue of representing the important signals with as few features as possible. This could be particularly advantageous if the clusters are estimated, rather than known, and might be too large—sparse cluster stability selection may reject irrelevant noise features in a cluster. This is also useful if the clusters are known in advance, but sparsity is desired to aid interpretability, because individual features are expensive to measure, or for any number of other reasons. However, if all of the cluster members are genuine proxies as in our model, the weights in sparse cluster stability selection are always suboptimal by Proposition A.1.5(i) (except for the very special case where all the noise levels are 137 equal and every feature happens to tie in selection proportion). Our additional results in Sections A.2.3 and A.2.4 suggest that in practice, sparsity does indeed come at a steep price to both out-of-sample predictive performance and stability, so in general we recommend against using sparse cluster stability selection in settings where predictive accuracy is important. A.2.3 MoreSimulationPlots In this section will present larger versions of the plots from our simulation studies that appear in Sections 2.6.1 and 2.6.2 of the main paper, but rst we present two gures investigating how well each cluster stability selection weighting method estimates the weights. A.2.3.1 EvaluatingClusterStabilitySelectionWeightEstimates Proposition A.1.5 provides the optimal weightsw on each proxy member (in the sense of minimizing prediction risk) when taking a weighted average of the cluster members to approximateZ. On each simulation run in both simulation studies, we compared the weights estimated by cluster stability selection under each weighting scheme ^ w to the optimal weightsw , calculating the empirical mean squared error 1 jCj k ^ ww k 2 2 , wherejCj is the size of the cluster (9 for the simulation study in Section 2.6.1, 5 for the simulation study in Section 2.6.2). In Figure A.1, we display boxplots of the mean squared errors of the estimated weights from each of the three weighting schemes in the simulation study from Section 2.6.1. Because the proxies forZ have unequal noise, simple averaging is not optimal under Proposition A.1.5. We see that the weights estimated by weighted averaged cluster stability selection are indeed better estimates of the optimal weights. Sparse cluster stability selection typically estimates “weights" by placing all of the weight on a single strong proxy, and we see clearly that this is much worse than simple averaging. (In the plots to come, we will see that this also leads to much worse predictive performance.) The outlier results with lower MSE occur when 138 0.00 0.05 0.10 Simple Average Cluster Stability Selection Sparse Cluster Stability Selection Weighted Average Cluster Stability Selection Method MSE of weights MSEs of Estimated Weights Figure A.1: Boxplots of realized mean squared errors of estimated weights on proxies forZ (relative to optimal weights under Proposition A.1.5) from simulation study from Section 2.6.1. 139 there is a tie in selection proportions between more than one proxy, so sparse cluster stability selection takes an equal average of more than one proxy. The outlier results with higher MSE occur on the rare occasions where sparse cluster stability selection happens to select a weak proxy rather than a strong proxy by random chance. 0.00 0.05 0.10 0.15 Simple Average Cluster Stability Selection Sparse Cluster Stability Selection Weighted Average Cluster Stability Selection Method MSE of weights MSEs of Estimated Weights Figure A.2: Boxplots of realized mean squared errors of estimated weights on proxies forZ (relative to optimal weights under Proposition A.1.5) from simulation study from Section 2.6.2. Figure A.2 displays boxplots of the observed weight mean squared errors from the simulation study from Section 2.6.2. In this case, simple averaging is optimal, so we see that simple averaged cluster stability selection always has an MSE of 0. Weighted averaging in this context can only hurt, but weighted averaged cluster stability selection nonetheless estimates the optimal (equal) weights fairly well since the expected selection proportions for all of the proxies are equal. 140 Taken together, these results explain the performance gap we will see in the coming gures and tables between sparse cluster stability selection and the other clusters stability selection methods. The protolasso uses very similar weighting to sparse cluster stability selection, which explains part of the protolasso’s poor predictive performance in all of the simulation studies. On the other hand, since the cluster representative lasso uses equal weighting, we can infer from these results that it will approximateZ as well as simple averaged cluster stability selection. A.2.3.2 AdditionalMethodsWeEvaluate We include all three cluster stability selection methods (weighted averaged, simple averaged, and sparse) in each of these plots, as well as the elastic net with mixing parameter = 0:5 and stability selection with the elastic net (again with = 0:5) as the base procedure. We implement the elastic net using the Rglmnet package [68], and we implement all three cluster stability selection methods as well as stability selection with the elastic net as the base procedure using the Rcssr package [58]. Since the dierent cluster stability selection weighting schemes dier only in the way they combine the results (see Algorithm 1), on each simulation all three methods rely on the same shared feature and cluster selection proportions to reduce uninformative variance across simulations and compare the weighting schemes as directly as possible. The elastic net is a natural competitor method to consider because it is also motivated in part by settings with clusters of highly correlated features. In contrast to the lasso, the elastic net’s mixture of` 1 and` 2 penalties has the eect of tending to select all of the features in a highly correlated cluster; see Section 2.3 of Zou and Hastie [208]. Further, it seems plausible that stability selection with the elastic net as the base procedure could solve the vote-splitting problem because all of the members of a highly correlated cluster that tend to be selected together will tend to have high selection proportions. However, we will see in the plots to come that both the elastic net and stability selection with the elastic net as the base procedure turn out to be uncompetitive with cluster stability selection in our setting. The 141 elastic net performs poorly because, as we discussed in Section 2.4.1, selecting all of the features in a cluster is a poor choice for predictive accuracy because the bias/variance tradeo inherent in estimating coecients for nearly collinear features is poor, leading to overtting. For this reason, stability selection with the elastic net base procedure fails as well: though using the elastic net as the base procedure for stability selection may partially alleviate the vote-splitting problem, ranking all of the members of an important cluster highly results in models that repeat redundant cluster members while omitting useful weak signal features. Note that the cluster representative lasso generally performs much better than the elastic net, and simple and weighted averaged cluster stability selection outperform all of these methods. A.2.3.3 MorePlotsfromSection2.6.1SimulationStudy Figures A.3, A.4, and A.5 are larger versions of the plots from Figure 2.3 in the main paper that also include results from sparse cluster stability selection, the elastic net (with mixing paramter = 0:5), and stability selection with the elastic net as the base procedure from the simulation study from Section 2.6.1 where the methods use estimated clusters. Figures A.6, A.7, and A.8 are larger versions of the plots from Figure 2.4 (with additional methods) in the setting where the clusters are assumed to be known. The reason that weighted averaged cluster stability selection does not dramatically outperform simple averaged cluster stability selection in Figures A.3 and A.6 may be that the simple averaged weights are not so far from optimal in this setting, as we discussed in reference to Figure A.1. Table A.1 displays the means and standard errors of the MSE values for each method and each model size across all 2000 simulations in Section 2.6.1 in the case that the clusters are estimated. Table A.2 con- tains p-values for one-tailed paired t-tests of the alternative hypothesis that weighted averaged cluster stability selection has a lower mean MSE than each of the competitor methods at each model size. Table A.3 is similar, but containsp-values for tests that simple averaged cluster stability selection outperforms the competitors methods. In both of these tables, the rst row is blank because because no cluster stability 142 selection model of size one existed for either cluster stability selection method due to ties in cluster selec- tion proportions. Tables A.4, A.5, and A.6 present analogous results for the case when the methods that use clusters are provided with the known clusters. Table A.7 displays the estimated stability (with standard error) for each metric and model size according to the metric from Nogueira, Sechidis, and Brown [132]. Table A.8 containsp-values for one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each of the competitor methods at each model size, and Table A.9 makes the same comparisons for simple averaged stability selection. Finally, Tables A.10 - A.12 display the same results for the setting where the clusters are known. Table A.1: Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.1. SA CSS WA CSS SS CRL Protolasso Lasso 1 NaN (NA) NaN (NA) 4.18 (0.0344) 3.27 (0.00543) 3.38 (0.00459) 3.38 (0.00465) 2 2.22 (0.0173) 2.13 (0.017) 3.62 (0.0325) 2.34 (0.00655) 2.47 (0.00664) 2.89 (0.00896) 3 1.81 (0.00907) 1.73 (0.00941) 3.15 (0.0262) 1.94 (0.00599) 2.07 (0.00613) 2.38 (0.00933) 4 1.5 (0.00506) 1.42 (0.0054) 2.57 (0.0247) 1.66 (0.00539) 1.79 (0.00527) 1.96 (0.00654) 5 1.26 (0.00356) 1.18 (0.00384) 2.04 (0.0224) 1.43 (0.00507) 1.56 (0.00485) 1.67 (0.0057) 6 1.07 (0.00305) 0.99 (0.0033) 1.58 (0.0181) 1.25 (0.0048) 1.38 (0.00465) 1.44 (0.00524) 7 0.909 (0.0028) 0.827 (0.00304) 1.25 (0.0144) 1.08 (0.00451) 1.22 (0.0043) 1.24 (0.00478) 8 0.765 (0.00269) 0.685 (0.00298) 0.971 (0.0099) 0.943 (0.00417) 1.07 (0.00405) 1.07 (0.00453) 9 0.654 (0.00269) 0.575 (0.00299) 0.771 (0.00617) 0.814 (0.00382) 0.938 (0.00358) 0.912 (0.00422) 10 0.562 (0.00272) 0.485 (0.00304) 0.634 (0.0041) 0.706 (0.00352) 0.829 (0.00328) 0.777 (0.00394) 11 0.498 (0.00288) 0.422 (0.00314) 0.533 (0.00329) 0.622 (0.00334) 0.735 (0.00316) 0.662 (0.00367) A.2.3.4 MorePlotsfromSection2.6.2SimulationStudy Figures A.9 - A.14 are analogous to Figures A.3 - A.8 in the previous subsection; please refer to the descrip- tions there. Similarly, Tables A.13 - A.24 are analogous to tables A.1 - A.12 in the previous section. 143 1 2 3 4 2 4 6 8 10 12 No. Fitted Coefficients MSE Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.3: Additional results from the simulation study in Section 2.6.1 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations when methods are pro- vided with estimated clusters. Note that this plot is identical to the leftmost plot of Figure 2.3 with addi- tional methods included. 144 0.4 0.6 0.8 1.0 2 4 6 8 10 No. Fitted Coefficients NSB Stability Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.4: Additional results from the simulation study in Section 2.6.1 for stability against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with estimated clusters. Note that this plot is identical to the center plot of Figure 2.3 with additional methods included. 145 1 2 3 4 0.4 0.6 0.8 1.0 NSB Stability MSE Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.5: Additional results from the simulation study in Section 2.6.1 plotting average test set MSE against model size when methods are provided with estimated clusters. Note that this plot is identical to the rightmost plot of Figure 2.3 with additional methods included. Table A.2: Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. SA CSS SS CRL Protolasso Lasso 1 2 1.36e-11 6.72e-06 9.25e-05 2.42e-08 1.29e-16 3 4.44e-73 3.34e-48 1.49e-35 5.54e-75 5.25e-84 4 6.95e-190 1.11e-109 3.92e-131 7.63e-244 4.53e-264 5 5.5e-290 4.94e-136 5.58e-215 0 0 6 0 2.41e-133 4.07e-292 0 0 7 0 2.48e-122 0 0 0 8 0 1.19e-131 0 0 0 9 0 6.22e-167 0 0 0 10 0 2.25e-201 0 0 0 11 0 2.48e-195 0 0 0 146 1 2 3 4 2 4 6 8 10 12 No. Fitted Coefficients MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.6: Additional results from the simulation study in Section 2.6.1.1 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations when methods are pro- vided with known clusters. Note that this plot is identical to the leftmost plot of Figure 2.4 with additional methods included. 147 0.4 0.6 0.8 1.0 2 4 6 8 10 No. Fitted Coefficients NSB Stability Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.7: Additional results from the simulation study in Section 2.6.1.1 for stability against model size (measured by number of tted coecients) across all 2000 simulations when methods are provided with known clusters. Note that this plot is identical to the center plot of Figure 2.4 with additional methods included. 148 1 2 3 4 0.4 0.6 0.8 1.0 NSB Stability MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.8: Additional results from the simulation study in Section 2.6.1.1 plotting average test set MSE against model size when methods are provided with known clusters. Note that this plot is identical to the rightmost plot of Figure 2.4 with additional methods included. Table A.3: Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. WA CSS SS CRL Protolasso Lasso 1 2 1 1.13e-05 0.0253 5.04e-06 5.9e-15 3 1 1.03e-44 1.82e-16 2.21e-52 2.23e-71 4 1 1.27e-98 4.46e-79 1.35e-188 6.95e-224 5 1 6.15e-116 1.96e-135 2.18e-300 0 6 1 2.89e-103 1.76e-191 0 0 7 1 3.93e-83 1.15e-227 0 0 8 1 3.2e-74 1.08e-273 0 0 9 1 2.24e-67 2.81e-268 0 0 10 1 3.29e-60 1.97e-260 0 0 11 1 6.88e-24 2.24e-223 0 6.47e-261 149 Table A.4: Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.1. SA CSS WA CSS SS CRL Protolasso Lasso 1 NaN (NA) NaN (NA) 4.18 (0.0344) 3.24 (0.0053) 3.38 (0.00459) 3.38 (0.00465) 2 2.2 (0.014) 2.15 (0.016) 3.62 (0.0325) 2.27 (0.00453) 2.42 (0.00485) 2.89 (0.00896) 3 1.79 (0.00838) 1.75 (0.00879) 3.15 (0.0262) 1.81 (0.00383) 1.97 (0.00409) 2.38 (0.00933) 4 1.49 (0.00506) 1.45 (0.00514) 2.57 (0.0247) 1.51 (0.00354) 1.66 (0.00369) 1.96 (0.00654) 5 1.24 (0.00359) 1.2 (0.00369) 2.04 (0.0224) 1.27 (0.00314) 1.42 (0.0034) 1.67 (0.0057) 6 1.04 (0.00268) 0.998 (0.00284) 1.58 (0.0181) 1.07 (0.00288) 1.22 (0.00314) 1.44 (0.00524) 7 0.866 (0.0023) 0.828 (0.00247) 1.25 (0.0144) 0.898 (0.00264) 1.05 (0.00294) 1.24 (0.00478) 8 0.722 (0.00221) 0.684 (0.00235) 0.971 (0.0099) 0.757 (0.00266) 0.916 (0.00292) 1.07 (0.00453) 9 0.6 (0.00232) 0.562 (0.00246) 0.771 (0.00617) 0.643 (0.00282) 0.807 (0.00305) 0.912 (0.00422) 10 0.509 (0.0026) 0.47 (0.00272) 0.634 (0.0041) 0.555 (0.00288) 0.722 (0.00315) 0.777 (0.00394) 11 0.449 (0.00265) 0.411 (0.00277) 0.533 (0.00329) 0.49 (0.00288) 0.658 (0.00316) 0.662 (0.00367) Table A.5: Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. SA CSS SS CRL Protolasso Lasso 1 2 2.03e-08 4.42e-07 2.1e-08 2.69e-20 8.61e-16 3 6.28e-47 4.19e-47 2.51e-15 2.95e-72 5.94e-77 4 1.14e-136 3.17e-106 1.43e-33 6.19e-203 6.9e-243 5 2.99e-190 1.97e-133 5.91e-58 0 0 6 2.85e-250 1.59e-127 3.88e-101 0 0 7 1.71e-279 1.58e-128 7.6e-124 0 0 8 2.08e-306 2.79e-145 1.18e-147 0 0 9 6.42e-323 6.37e-184 2.27e-163 0 0 10 0 6.99e-241 4.87e-186 0 0 11 0 2.23e-233 5.66e-163 0 0 150 Table A.6: Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. WA CSS SS CRL Protolasso Lasso 1 2 1 4.64e-07 0.757 5.13e-19 4.62e-15 3 1 1.52e-45 0.000972 1.67e-57 3.16e-72 4 1 3.19e-101 1.17e-05 5.28e-157 4.34e-220 5 1 1.97e-124 4.44e-11 1.81e-262 0 6 1 1.81e-113 6.9e-25 0 0 7 1 4.39e-108 2.84e-33 0 0 8 1 4.77e-111 1.35e-44 0 0 9 1 1.26e-128 1.77e-57 0 0 10 1 1.08e-151 1.38e-70 0 0 11 1 3.82e-117 1.11e-55 0 0 Table A.7: Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.1. SA CSS WA CSS SS CRL Protolasso Lasso 1 NA (NA) NA (NA) 1 (0) 0.738 (0.0044) 0.308 (0.00229) 0.307 (0.00233) 2 0.834 (0.00585) 0.834 (0.00585) 0.691 (0.019) 0.767 (0.00259) 0.513 (0.00672) 0.486 (0.0031) 3 0.76 (0.00577) 0.76 (0.00579) 0.661 (0.00845) 0.723 (0.00184) 0.49 (0.00407) 0.641 (0.00414) 4 0.738 (0.00323) 0.738 (0.00323) 0.63 (0.00536) 0.697 (0.00183) 0.486 (0.0035) 0.696 (0.00447) 5 0.724 (0.00232) 0.725 (0.00232) 0.622 (0.00373) 0.683 (0.00173) 0.489 (0.00302) 0.686 (0.00407) 6 0.72 (0.00198) 0.72 (0.00199) 0.628 (0.00296) 0.675 (0.00166) 0.496 (0.00271) 0.672 (0.00358) 7 0.723 (0.00171) 0.723 (0.00171) 0.638 (0.00258) 0.673 (0.00158) 0.508 (0.00239) 0.664 (0.00314) 8 0.735 (0.00167) 0.735 (0.00167) 0.661 (0.00227) 0.676 (0.00149) 0.526 (0.00221) 0.664 (0.00286) 9 0.742 (0.00175) 0.743 (0.00175) 0.678 (0.00232) 0.686 (0.00148) 0.547 (0.00201) 0.667 (0.00256) 10 0.748 (0.00196) 0.748 (0.00196) 0.695 (0.00242) 0.694 (0.00159) 0.562 (0.00197) 0.673 (0.00236) 11 0.741 (0.00222) 0.741 (0.00222) 0.705 (0.0027) 0.698 (0.00177) 0.576 (0.00206) 0.681 (0.00225) 151 Table A.8: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. SA CSS SS CRL Protolasso Lasso 1 2 0.5 3.66e-13 4.27e-26 4.54e-285 0 3 0.487 1.74e-22 5.92e-10 0 1.14e-62 4 0.487 2.55e-67 3.78e-28 0 8.91e-15 5 0.479 8.9e-122 7.26e-47 0 1.34e-16 6 0.482 3.68e-147 2e-68 0 1.07e-31 7 0.483 1.95e-164 1.22e-103 0 3.21e-61 8 0.489 3.88e-152 8.22e-151 0 3.49e-103 9 0.489 3.68e-110 1.71e-134 0 6.05e-133 10 0.492 7.15e-65 5.69e-100 0 5.59e-130 11 0.495 3.43e-25 1.48e-52 0 3.86e-80 Table A.9: Table of p-values from one-tailed Z-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.1. WA CSS SS CRL Protolasso Lasso 1 2 0.5 3.66e-13 4.27e-26 4.54e-285 0 3 0.513 2.01e-22 6.89e-10 0 1.13e-62 4 0.513 3.47e-67 5.21e-28 0 1.07e-14 5 0.521 2.01e-121 1.56e-46 0 1.8e-16 6 0.518 7.57e-147 3.98e-68 0 1.49e-31 7 0.517 4.86e-164 3.2e-103 0 5.19e-61 8 0.511 7.44e-152 1.88e-150 0 5.48e-103 9 0.511 6.44e-110 3.77e-134 0 1.08e-132 10 0.508 9.45e-65 8.54e-100 0 8.23e-130 11 0.505 3.82e-25 1.76e-52 0 4.65e-80 152 Table A.10: Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.1. SA CSS WA CSS SS CRL Protolasso Lasso 1 NA (NA) NA (NA) 1 (0) 0.954 (0.00516) 0.308 (0.00229) 0.307 (0.00233) 2 1 (0) 0.989 (0.00446) 0.691 (0.019) 0.978 (0.00128) 0.555 (0.00595) 0.486 (0.0031) 3 0.946 (0.00306) 0.941 (0.00329) 0.661 (0.00845) 0.936 (0.00117) 0.557 (0.00401) 0.641 (0.00414) 4 0.916 (0.00196) 0.913 (0.00203) 0.63 (0.00536) 0.905 (0.00123) 0.565 (0.00337) 0.696 (0.00447) 5 0.897 (0.00155) 0.895 (0.00161) 0.622 (0.00373) 0.884 (0.00121) 0.58 (0.00289) 0.686 (0.00407) 6 0.884 (0.0013) 0.882 (0.00135) 0.628 (0.00296) 0.87 (0.00119) 0.599 (0.00254) 0.672 (0.00358) 7 0.878 (0.00114) 0.877 (0.00117) 0.638 (0.00258) 0.862 (0.00115) 0.622 (0.00228) 0.664 (0.00314) 8 0.877 (0.00112) 0.876 (0.00114) 0.661 (0.00227) 0.858 (0.00124) 0.643 (0.00229) 0.664 (0.00286) 9 0.88 (0.00129) 0.879 (0.00131) 0.678 (0.00232) 0.855 (0.00148) 0.654 (0.00254) 0.667 (0.00256) 10 0.877 (0.0017) 0.876 (0.00172) 0.695 (0.00242) 0.847 (0.00171) 0.657 (0.00281) 0.673 (0.00236) 11 0.862 (0.0019) 0.861 (0.00192) 0.705 (0.0027) 0.833 (0.00186) 0.65 (0.00292) 0.681 (0.00225) Table A.11: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. SA CSS SS CRL Protolasso Lasso 1 2 0.994 1.69e-52 0.0119 0 0 3 0.824 2.14e-210 0.0782 0 0 4 0.795 0 0.000353 0 0 5 0.755 0 8.12e-09 0 0 6 0.769 0 8.23e-12 0 0 7 0.789 0 3.43e-19 0 0 8 0.76 0 5.2e-26 0 0 9 0.728 0 1.87e-33 0 0 10 0.664 0 4.7e-33 0 0 11 0.639 0 1.56e-26 0 0 153 Table A.12: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.1. WA CSS SS CRL Protolasso Lasso 1 2 0.0062 2.52e-59 8.88e-65 0 0 3 0.176 1.82e-220 0.00269 0 0 4 0.205 0 3.61e-06 0 0 5 0.245 0 2.56e-11 0 0 6 0.231 0 9.5e-15 0 0 7 0.211 0 5.42e-23 0 0 8 0.24 0 8.01e-30 0 0 9 0.272 0 6.15e-37 0 0 10 0.336 0 1.15e-35 0 0 11 0.361 0 1.6e-28 0 0 1 2 3 4 2 4 6 8 10 12 14 16 No. Fitted Coefficients MSE Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.9: Additional results from the simulation study in Section 2.6.2 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations with estimated clusters. Note that this plot is identical to the left plot of Figure 2.5 with additional methods included. 154 0.00 0.25 0.50 0.75 1.00 2 4 6 8 10 12 14 16 No. Fitted Coefficients NSB Stability Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.10: Additional results from the simulation study in Section 2.6.2 for stability against model size (measured by number of tted coecients) across all 2000 simulations with estimated clusters. Note that this plot is identical to the center plot of Figure 2.5 with additional methods included. 155 1 2 3 4 0.00 0.25 0.50 0.75 1.00 NSB Stability MSE Method Cluster Rep. Lasso (est. clusts) Elastic Net Lasso Protolasso (est. clusts) Simple Averaged CSS (est. clusts) Sparse CSS (est. clusts) SS (Elastic Net) Stability Selection Weighted Averaged CSS (est. clusts) Figure A.11: Additional results from the simulation study in Section 2.6.2 plotting average test set MSE against model size with estimated clusters. Note that this plot is identical to the right plot of Figure 2.5 with additional methods included. 156 1 2 3 4 2 4 6 8 10 12 14 16 No. Fitted Coefficients MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.12: Additional results from the simulation study in Section 2.6.2 for average test set MSE against model size (measured by number of tted coecients) across all 2000 simulations. 157 0.25 0.50 0.75 1.00 2 4 6 8 10 12 14 16 No. Fitted Coefficients NSB Stability Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.13: Additional results from the simulation study in Section 2.6.2 for stability against model size (measured by number of tted coecients) across all 2000 simulations. 158 1 2 3 4 0.25 0.50 0.75 1.00 NSB Stability MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.14: Additional results from the simulation study in Section 2.6.2 plotting average test set MSE against model size. 159 Table A.13: Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.2. SA CSS WA CSS SS CRL Protolasso Lasso 1 3.29 (0.169) 3.31 (0.183) 4.58 (0.0272) 3.43 (0.00453) 3.76 (0.00454) 3.76 (0.00459) 2 2.43 (0.0129) 2.44 (0.0129) 4.06 (0.0225) 2.49 (0.00505) 2.83 (0.00547) 3.36 (0.00824) 3 2.01 (0.00815) 2.02 (0.00815) 3.63 (0.0201) 2.05 (0.00436) 2.38 (0.00461) 2.9 (0.0105) 4 1.7 (0.00508) 1.71 (0.00508) 3.19 (0.0216) 1.75 (0.00392) 2.08 (0.00417) 2.5 (0.00882) 5 1.48 (0.004) 1.49 (0.00401) 2.73 (0.0226) 1.53 (0.00376) 1.86 (0.00399) 2.16 (0.00693) 6 1.29 (0.00344) 1.3 (0.00346) 2.27 (0.0214) 1.33 (0.00354) 1.67 (0.00383) 1.89 (0.00558) 7 1.13 (0.00303) 1.14 (0.00304) 1.9 (0.0194) 1.18 (0.00346) 1.52 (0.00376) 1.67 (0.00502) 8 0.993 (0.00291) 1 (0.00293) 1.57 (0.0161) 1.03 (0.00319) 1.38 (0.00362) 1.49 (0.00448) 9 0.87 (0.00283) 0.879 (0.00284) 1.34 (0.0133) 0.91 (0.00315) 1.26 (0.00356) 1.32 (0.00415) 10 0.762 (0.00276) 0.771 (0.00277) 1.14 (0.01) 0.807 (0.00311) 1.16 (0.00346) 1.18 (0.00388) 11 0.672 (0.00271) 0.681 (0.00272) 0.966 (0.0067) 0.72 (0.00309) 1.07 (0.0035) 1.05 (0.0037) 12 0.595 (0.00281) 0.604 (0.00283) 0.856 (0.0055) 0.638 (0.00305) 0.998 (0.00345) 0.942 (0.00344) 13 0.531 (0.00295) 0.54 (0.00296) 0.758 (0.00417) 0.571 (0.00301) 0.932 (0.00344) 0.845 (0.00332) 14 0.483 (0.00306) 0.492 (0.00307) 0.676 (0.00358) 0.512 (0.00302) 0.877 (0.00337) 0.758 (0.00324) 15 0.437 (0.00303) 0.446 (0.00304) 0.61 (0.00332) 0.468 (0.00301) 0.829 (0.00331) 0.681 (0.00317) 16 0.404 (0.00303) 0.413 (0.00303) 0.559 (0.00348) 0.426 (0.00289) 0.789 (0.00328) 0.614 (0.00312) Table A.14: Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. WA CSS SS CRL Protolasso Lasso 1 2 4.02e-14 1.38e-10 0.662 1.19e-45 2.3e-38 3 8.18e-53 1.4e-91 0.00226 1.17e-164 1.06e-134 4 1.5e-114 1.49e-184 1.92e-08 0 1.46e-312 5 8.72e-165 9.19e-223 3.59e-25 0 0 6 1.54e-204 1e-207 1.05e-26 0 0 7 1.62e-219 1.43e-186 3.32e-41 0 0 8 1.52e-243 2.36e-177 2.06e-38 0 0 9 1.96e-257 9.72e-179 1.63e-47 0 0 10 1.63e-265 5.58e-198 6.23e-54 0 0 11 3.05e-278 1.43e-281 1.65e-62 0 0 12 1.35e-284 6.17e-301 5.68e-58 0 0 13 6.56e-284 0 1.43e-51 0 0 14 1.04e-294 0 4.07e-30 0 0 15 2.68e-288 0 2.96e-29 0 0 16 9.02e-290 2e-301 6.51e-19 0 0 160 Table A.15: Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. SA CSS SS CRL Protolasso Lasso 1 2 1 1.54e-10 1 3.69e-45 3.92e-38 3 1 6.29e-91 0.0676 6.63e-162 8.25e-133 4 1 7.76e-183 0.000289 0 6.97e-308 5 1 3.06e-220 3.8e-16 0 0 6 1 3.38e-204 2.44e-15 0 0 7 1 1.7e-182 5.59e-27 0 0 8 1 2.32e-172 4.1e-23 0 0 9 1 1.09e-172 3.86e-29 0 0 10 1 4.9e-190 3.33e-34 0 0 11 1 5.96e-269 5.54e-42 0 0 12 1 9.27e-286 2.22e-36 0 0 13 1 0 5.96e-32 0 0 14 1 0 2.25e-15 0 0 15 1 0 5.44e-14 0 0 16 1 6.99e-279 6.39e-07 0 0 Table A.16: Means and standard errors of the mean squared errors of each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.2. SA CSS WA CSS SS CRL Protolasso Lasso 1 NaN (NA) NaN (NA) 4.58 (0.0272) 3.43 (0.00453) 3.76 (0.00454) 3.76 (0.00459) 2 2.44 (0.0133) 2.44 (0.0135) 4.06 (0.0225) 2.49 (0.00505) 2.83 (0.00547) 3.36 (0.00824) 3 2.01 (0.00754) 2.02 (0.00758) 3.63 (0.0201) 2.05 (0.00436) 2.38 (0.00461) 2.9 (0.0105) 4 1.71 (0.00527) 1.72 (0.00528) 3.19 (0.0216) 1.75 (0.00392) 2.08 (0.00417) 2.5 (0.00882) 5 1.48 (0.00388) 1.49 (0.0039) 2.73 (0.0226) 1.53 (0.00376) 1.86 (0.00399) 2.16 (0.00693) 6 1.29 (0.00347) 1.3 (0.00348) 2.27 (0.0214) 1.33 (0.00354) 1.67 (0.00383) 1.89 (0.00558) 7 1.13 (0.00311) 1.14 (0.00313) 1.9 (0.0194) 1.18 (0.00346) 1.52 (0.00376) 1.67 (0.00502) 8 0.99 (0.00289) 1 (0.00291) 1.57 (0.0161) 1.03 (0.00319) 1.38 (0.00362) 1.49 (0.00448) 9 0.869 (0.00284) 0.878 (0.00285) 1.34 (0.0133) 0.91 (0.00315) 1.26 (0.00356) 1.32 (0.00415) 10 0.763 (0.0028) 0.772 (0.00281) 1.14 (0.01) 0.807 (0.00311) 1.16 (0.00346) 1.18 (0.00388) 11 0.673 (0.00279) 0.682 (0.00279) 0.966 (0.0067) 0.72 (0.00309) 1.07 (0.0035) 1.05 (0.0037) 12 0.596 (0.00285) 0.606 (0.00286) 0.856 (0.0055) 0.638 (0.00305) 0.998 (0.00345) 0.942 (0.00344) 13 0.532 (0.00298) 0.541 (0.00299) 0.758 (0.00417) 0.571 (0.00301) 0.932 (0.00344) 0.845 (0.00332) 14 0.477 (0.00294) 0.486 (0.00295) 0.676 (0.00358) 0.512 (0.00302) 0.877 (0.00337) 0.758 (0.00324) 15 0.439 (0.00306) 0.448 (0.00307) 0.61 (0.00332) 0.468 (0.00301) 0.829 (0.00331) 0.681 (0.00317) 16 0.407 (0.00309) 0.415 (0.0031) 0.559 (0.00348) 0.426 (0.00289) 0.789 (0.00328) 0.614 (0.00312) 161 Table A.17: Table ofp-values from pairedt-tests of the alternative hypothesis that simple averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. WA CSS SS CRL Protolasso Lasso 1 2 7.13e-10 1.53e-15 0.0325 7.53e-61 9.97e-43 3 3.7e-47 1.1e-101 0.0104 5.96e-165 6.88e-140 4 1.41e-111 5.34e-184 1.46e-08 0 1.67e-307 5 1.37e-162 1.26e-213 3.22e-19 0 0 6 7.29e-194 1.11e-204 4.23e-26 0 0 7 1.77e-219 1.32e-185 7.69e-37 0 0 8 1.22e-237 1.18e-177 1.61e-46 0 0 9 1.48e-249 1.92e-179 4.13e-49 0 0 10 1.01e-263 2.08e-199 2.51e-56 0 0 11 4.23e-272 3.2e-254 9.7e-64 0 0 12 5.05e-277 3.51e-317 1.1e-52 0 0 13 6.25e-279 0 1.15e-47 0 0 14 8.01e-286 0 9.91e-38 0 0 15 2.24e-291 0 1.05e-25 0 0 16 7.64e-280 0 1.09e-15 0 0 Table A.18: Table ofp-values from pairedt-tests of the alternative hypothesis that weighed averaged cluster stability selection has a lower MSE than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. SA CSS SS CRL Protolasso Lasso 1 2 1 1.91e-15 1 1.07e-61 1.66e-42 3 1 5.42e-101 0.188 5.96e-161 6.58e-138 4 1 3.21e-182 0.000139 0 1.29e-302 5 1 4.7e-211 7.22e-11 0 0 6 1 2.93e-201 4.77e-15 0 0 7 1 1.76e-181 3.9e-23 0 0 8 1 9.76e-173 4.54e-29 0 0 9 1 1.81e-173 6.58e-31 0 0 10 1 2.57e-191 1.41e-36 0 0 11 1 2.02e-242 3.3e-42 0 0 12 1 3.48e-301 3.43e-32 0 0 13 1 0 5.4e-29 0 0 14 1 0 1.68e-20 0 0 15 1 0 7.73e-12 0 0 16 1 7.57e-317 3.36e-05 0 0 162 Table A.19: Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are estimated in the simulation study in Section 2.6.2. Note that both simple averaged cluster stability selection and weighted averaged cluster stability selection had perfect stability in models of size 1 and 2 (that is, both methods selected the exact same models of sizes 1 and 2 for every one of the 2000 simulations). SA CSS WA CSS SS CRL Protolasso Lasso 1 1 (0) 1 (0) 0.946 (0.0361) 0.984 (0.00313) 0.185 (0.00115) 0.185 (0.00117) 2 1 (0) 1 (0) 0.693 (0.015) 0.96 (0.00213) 0.468 (0.00626) 0.32 (0.00148) 3 0.915 (0.00403) 0.915 (0.00403) 0.65 (0.00778) 0.896 (0.00171) 0.489 (0.00383) 0.452 (0.00232) 4 0.877 (0.00284) 0.877 (0.00284) 0.61 (0.00525) 0.853 (0.00187) 0.503 (0.00354) 0.549 (0.00292) 5 0.841 (0.00225) 0.841 (0.00225) 0.592 (0.00415) 0.814 (0.00182) 0.503 (0.0031) 0.592 (0.00288) 6 0.815 (0.00198) 0.814 (0.00198) 0.583 (0.00351) 0.79 (0.00178) 0.515 (0.00291) 0.614 (0.00301) 7 0.797 (0.00183) 0.797 (0.00183) 0.573 (0.00299) 0.77 (0.00178) 0.521 (0.00269) 0.62 (0.00291) 8 0.783 (0.00173) 0.782 (0.00174) 0.575 (0.00263) 0.759 (0.00172) 0.531 (0.00257) 0.622 (0.00276) 9 0.775 (0.00169) 0.775 (0.00169) 0.576 (0.0024) 0.751 (0.00174) 0.539 (0.00252) 0.627 (0.00259) 10 0.771 (0.0017) 0.77 (0.0017) 0.585 (0.00218) 0.742 (0.00177) 0.544 (0.00243) 0.631 (0.00241) 11 0.764 (0.00174) 0.764 (0.00174) 0.597 (0.00209) 0.732 (0.00182) 0.549 (0.00253) 0.635 (0.0023) 12 0.757 (0.00188) 0.756 (0.00188) 0.603 (0.00207) 0.727 (0.0019) 0.547 (0.00252) 0.64 (0.00225) 13 0.747 (0.00211) 0.747 (0.00212) 0.613 (0.0021) 0.718 (0.00196) 0.546 (0.00258) 0.643 (0.00224) 14 0.731 (0.00225) 0.731 (0.00225) 0.622 (0.00227) 0.709 (0.00207) 0.54 (0.00256) 0.645 (0.00225) 15 0.718 (0.00232) 0.718 (0.00232) 0.627 (0.00241) 0.694 (0.00211) 0.533 (0.00255) 0.644 (0.0023) 16 0.698 (0.00235) 0.697 (0.00235) 0.626 (0.00246) 0.68 (0.00211) 0.524 (0.00255) 0.643 (0.00236) A.2.4 MoreDataApplicationResults See Figures A.15, A.16, and A.17 for larger versions of the plots from Figure 2.6 in the main paper that also include results from sparse cluster stability selection, the elastic net (with mixing parameter = 0:5), and stability selection with the elastic net as the base procedure. Note that the new methods are not competitive in test set MSE or in stability (this is especially clear in Figure A.17, where we simultaneously account for stability and test set MSE). Table A.25 contains the average test set mean squared error of each method for each batch of model sizes across all 1000 simulations, and Table A.26 contains the stabilities of each method for all model sizes. 163 Table A.20: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. Thep-values comparing simple averaged cluster stability selection and weighted averaged cluster stability selection is undened because both meth- ods had perfect stability scores across all 2000 simulations; see Table A.19. As a result, the Z-statistic dened in Theorem 9 of Nogueira, Sechidis, and Brown [132] has the indeterminate value 0=0. WA CSS SS CRL Protolasso Lasso 1 NaN 0.069 2.21e-07 0 0 2 NaN 1.13e-93 7.05e-79 0 0 3 0.5 6.34e-201 4.28e-06 0 0 4 0.484 0 2.33e-13 0 0 5 0.5 0 4.66e-20 0 0 6 0.464 0 4.4e-20 0 0 7 0.465 0 1.45e-25 0 0 8 0.468 0 4.63e-22 0 0 9 0.47 0 1.13e-22 0 0 10 0.491 0 3.55e-32 0 0 11 0.475 0 2.01e-36 0 0 12 0.478 0 5.42e-29 0 0 13 0.481 0 6.26e-24 0 1.29e-250 14 0.483 5.86e-260 1.33e-13 0 8.82e-163 15 0.489 2.36e-162 1.14e-14 0 1.98e-112 16 0.484 1.45e-99 8.66e-09 0 2.21e-60 164 Table A.21: Table of p-values from one-tailed Z-tests of the alternative hypothesis that weighted aver- aged cluster stability selection has higher stability than each competitor method at each model size when the clusters are estimated in the simulation study in Section 2.6.2. Thep-values comparing simple aver- aged cluster stability selection and weighted averaged cluster stability selection is undened because both methods had perfect stability scores across all 2000 simulations; see Table A.19. As a result, theZ-statistic dened in Theorem 9 of Nogueira, Sechidis, and Brown [132] has the indeterminate value 0=0. SA CSS SS CRL Protolasso Lasso 1 NaN 0.069 2.21e-07 0 0 2 NaN 1.13e-93 7.05e-79 0 0 3 0.5 6.34e-201 4.28e-06 0 0 4 0.516 0 3.34e-13 0 0 5 0.5 0 4.66e-20 0 0 6 0.536 0 1.04e-19 0 0 7 0.535 0 3.65e-25 0 0 8 0.532 0 1.04e-21 0 0 9 0.53 0 2.38e-22 0 0 10 0.509 0 4.56e-32 0 0 11 0.525 0 4.83e-36 0 0 12 0.522 0 1.01e-28 0 0 13 0.519 0 1.08e-23 0 1.03e-249 14 0.517 3.99e-259 1.9e-13 0 3.75e-162 15 0.511 4.32e-162 1.4e-14 0 3.36e-112 16 0.516 2.83e-99 1.09e-08 0 3.85e-60 Table A.22: Stability metric from Nogueira, Sechidis, and Brown [132] with standard errors for each method and each model size across all 2000 simulations when clusters are known in the simulation study in Section 2.6.2. SA CSS WA CSS SS CRL Protolasso Lasso 1 NA (NA) NA (NA) 0.946 (0.0361) 0.984 (0.00313) 0.185 (0.00115) 0.185 (0.00117) 2 0.991 (0.00596) 0.991 (0.00596) 0.693 (0.015) 0.96 (0.00213) 0.468 (0.00626) 0.32 (0.00148) 3 0.916 (0.00404) 0.916 (0.00404) 0.65 (0.00778) 0.896 (0.00171) 0.489 (0.00383) 0.452 (0.00232) 4 0.874 (0.00291) 0.874 (0.00291) 0.61 (0.00525) 0.853 (0.00187) 0.503 (0.00354) 0.549 (0.00292) 5 0.838 (0.00216) 0.837 (0.00216) 0.592 (0.00415) 0.814 (0.00182) 0.503 (0.0031) 0.592 (0.00288) 6 0.813 (0.002) 0.813 (0.002) 0.583 (0.00351) 0.79 (0.00178) 0.515 (0.00291) 0.614 (0.00301) 7 0.795 (0.0018) 0.795 (0.0018) 0.573 (0.00299) 0.77 (0.00178) 0.521 (0.00269) 0.62 (0.00291) 8 0.784 (0.00175) 0.784 (0.00175) 0.575 (0.00263) 0.759 (0.00172) 0.531 (0.00257) 0.622 (0.00276) 9 0.775 (0.00173) 0.775 (0.00173) 0.576 (0.0024) 0.751 (0.00174) 0.539 (0.00252) 0.627 (0.00259) 10 0.77 (0.00173) 0.77 (0.00173) 0.585 (0.00218) 0.742 (0.00177) 0.544 (0.00243) 0.631 (0.00241) 11 0.763 (0.00177) 0.763 (0.00177) 0.597 (0.00209) 0.732 (0.00182) 0.549 (0.00253) 0.635 (0.0023) 12 0.757 (0.00191) 0.757 (0.0019) 0.603 (0.00207) 0.727 (0.0019) 0.547 (0.00252) 0.64 (0.00225) 13 0.747 (0.0021) 0.747 (0.0021) 0.613 (0.0021) 0.718 (0.00196) 0.546 (0.00258) 0.643 (0.00224) 14 0.736 (0.00219) 0.736 (0.00219) 0.622 (0.00227) 0.709 (0.00207) 0.54 (0.00256) 0.645 (0.00225) 15 0.716 (0.00232) 0.716 (0.00232) 0.627 (0.00241) 0.694 (0.00211) 0.533 (0.00255) 0.644 (0.0023) 16 0.696 (0.00239) 0.696 (0.00239) 0.626 (0.00246) 0.68 (0.00211) 0.524 (0.00255) 0.643 (0.00236) 165 Table A.23: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that simple averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. WA CSS SS CRL Protolasso Lasso 1 2 0.5 1.05e-76 3.99e-07 0 0 3 0.5 1.31e-202 1.15e-06 0 0 4 0.5 0 2.97e-10 0 0 5 0.486 0 6.57e-17 0 0 6 0.488 0 5.5e-18 0 0 7 0.488 0 5.68e-22 0 0 8 0.489 0 5.33e-24 0 0 9 0.49 0 1.93e-22 0 0 10 0.5 0 2.69e-30 0 0 11 0.492 0 1e-34 0 0 12 0.493 0 8.65e-29 0 0 13 0.494 0 9.22e-25 0 1.98e-254 14 0.5 6.97e-290 1.35e-19 0 1.73e-185 15 0.495 1.99e-156 6.75e-13 0 2.13e-107 16 0.5 1.27e-93 2.07e-07 0 5.95e-56 Table A.24: Table ofp-values from one-tailedZ-tests of the alternative hypothesis that weighted averaged cluster stability selection has higher stability than each competitor method at each model size when the clusters are known in the simulation study in Section 2.6.2. SA CSS SS CRL Protolasso Lasso 1 2 0.5 1.05e-76 3.99e-07 0 0 3 0.5 1.31e-202 1.15e-06 0 0 4 0.5 0 2.97e-10 0 0 5 0.514 0 9.57e-17 0 0 6 0.512 0 7.49e-18 0 0 7 0.512 0 7.79e-22 0 0 8 0.511 0 7.02e-24 0 0 9 0.51 0 2.48e-22 0 0 10 0.5 0 2.69e-30 0 0 11 0.508 0 1.32e-34 0 0 12 0.507 0 1.05e-28 0 0 13 0.506 0 1.06e-24 0 2.58e-254 14 0.5 6.97e-290 1.35e-19 0 1.73e-185 15 0.505 2.83e-156 7.48e-13 0 2.87e-107 16 0.5 1.27e-93 2.07e-07 0 5.95e-56 166 0.033 0.035 0.037 0.039 0.041 5 10 15 20 25 30 35 40 45 50 55 60 No. Fitted Coefficients MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.15: Additional results from the data application in Section 2.6.3 for average test set MSE against model size (measured by number of tted coecients) across all 1000 random draws. Note that this plot is identical to the leftmost plot of Figure 2.6 with additional methods included. Table A.25: Means of the test set mean squared errors of each method and each batch of model sizes across all 1000 replications in the data application in Section 2.6.3. SA CSS WA CSS SS CRL Protolasso Lasso 1 - 5 0.0385 0.0379 0.0384 0.0373 0.0363 0.0365 6 - 10 0.0357 0.035 0.0361 0.0353 0.0347 0.0351 11 - 15 0.0345 0.0338 0.0351 0.0343 0.034 0.0345 16 - 20 0.0339 0.0334 0.0345 0.0338 0.0337 0.0341 21 - 25 0.0335 0.0332 0.0344 0.0336 0.0336 0.034 26 - 30 0.0334 0.0333 0.0343 0.0336 0.0337 0.034 31 - 35 0.0336 0.0336 0.0342 0.0338 0.034 0.034 36 - 40 0.0338 0.0339 0.0345 0.0342 0.0343 0.0341 41 - 45 0.0345 0.0369 0.0347 0.0346 0.0347 0.0342 46 - 50 0.0349 0.038 0.0349 0.0352 0.0352 0.0344 51 - 55 0.0355 0.0372 0.0351 0.0355 0.0357 0.0346 56 - 60 0.0367 0.0407 0.0354 0.0361 0.0362 0.0349 167 0.2 0.3 0.4 0.5 0.6 5 10 15 20 25 30 35 40 45 50 55 No. Fitted Coefficients NSB Stability Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.16: Additional results from the data application in Section 2.6.3 for stability against model size (measured by number of tted coecients) across all 1000 random draws. Note that this plot is identical to the center plot of Figure 2.6 with additional methods included. Table A.26: Stability metric from Nogueira, Sechidis, and Brown [132] for each method and each batch of model sizes across all 1000 replications in the data application in Section 2.6.3. SA CSS WA CSS SS CRL Protolasso Lasso 1 - 5 0.336 0.326 0.386 0.628 0.406 0.399 6 - 10 0.395 0.333 0.349 0.553 0.324 0.332 11 - 15 0.441 0.353 0.329 0.482 0.304 0.298 16 - 20 0.456 0.366 0.328 0.434 0.296 0.287 21 - 25 0.451 0.37 0.329 0.382 0.294 0.28 26 - 30 0.431 0.37 0.335 0.333 0.296 0.276 31 - 35 0.416 0.372 0.344 0.294 0.296 0.276 36 - 40 0.407 0.374 0.35 0.265 0.298 0.274 41 - 45 0.398 0.378 0.353 0.242 0.305 0.271 46 - 50 0.396 0.384 0.358 0.229 0.314 0.268 51 - 55 0.392 0.388 0.362 0.222 0.327 0.264 56 - 60 0.394 0.393 0.367 0.216 0.343 0.262 168 0.033 0.035 0.037 0.039 0.041 0.2 0.3 0.4 0.5 0.6 NSB Stability MSE Method Cluster Rep. Lasso Elastic Net Lasso Protolasso Simple Averaged CSS Sparse CSS SS (Elastic Net) Stability Selection Weighted Averaged CSS Figure A.17: Additional results from the data application in Section 2.6.3 plotting average test set MSE against model size. Note that this plot is identical to the rightmost plot of Figure 2.6 with additional methods included. 169 A.3 ProofofTheorem2.3.1 A.3.1 ProofofStatement(i) Note that 100 p n logn < 19 5 s 2 + 2 4c 2 (logn) 3=4 n 1=2 () 500 19 s 4c 2 2 + 2 < (logn) 5=4 () exp ( 1000 19 r c 2 2 + 2 4=5 ) <n: Becausec 2 < (e 1)=(8e 2 ) and this quantity is decreasing in 2 0, a sucient condition for this is n> 81> exp 8 < : 1000 19 r e 1 8e 2 2 ! 4=5 9 = ; ; so 10 2 (n) = 100 p n logn < 19 10 s 2 + 2 c 2 (logn) 3=4 n 1=2 holds for n 100. It remains to show that I(n) (1; 2) under our assumptions. It is clear that 1 + 10 2 (n)> 1. To see that the upper bound ofI(n) is less than 2, note that since (logn) 3=4 n 1=2 < r c 2 3:61 (5 + 2 ) = 10 19 r c 2 5 + 2 from (2.12), we have 19 10 s 2 + 2 c 2 (logn) 3=4 n 1=2 < s 2 + 2 5 + 2 < 1; so we have that Z 2 (1; 2) for alln satisfying the assumptions of Theorem 2.3.1. 170 A.3.2 ProofofStatement(ii) Our proof strategy will be to walk through the lasso path as decreases from1 and the rst two features enter to show that a few events are sucient forX 3 to be the second feature to enter the lasso path. We then show that one of these events holds with probability tending towards 1=2 and the rest hold with probability tending towards 1. Then the nal result will come from a union bound. Throughout this proof we will refer to the Karush-Kuhn-Tucker (KKT) conditions 1 nkX j k 2 X > j y 3 X `=1 X ` kX ` k 2 ^ ` () ! +s j = 0; 8j2 [p] (A.3.1) where ^ ` () is the lasso estimated coecient for feature` at and s j 2 8 > > > > < > > > > : n sgn ^ j () o ; ^ j ()6= 0 [1; 1]; ^ j () = 0 ; 8j2 [3]: Consider 1 = max j ( X > j y nkX j k 2 ) : (A.3.2) For 1 , ^ () = 0 is a solution to (2.10) because it satises (A.3.1) with s j = X T j y nkX j k 2 ; ands j 2 [1; 1] as long as 1 . The rst feature enters the active set for< 1 , and is the featurej attaining the maximum in (A.3.2). That is, the rst feature to enter the active set is the one with the largest absolute cosine similarity withy. Consider the cosine similarities (uncentered sample correlations) ^ R jy = X > j y kX j k 2 kyk 2 ; j2 [p]; ^ R 12 = X > 2 X 1 kX 2 k 2 kX 1 k 2 : (A.3.3) 171 Dene the events A 12 := n ^ R 1y > ^ R 2y o ; A 13 := n ^ R 1y > ^ R 3y o ; S 1 := n ^ R 1y > 0; ^ R 2y > 0; ^ R 3y > 0; ^ R 12 > 0 o ; and A 3j := n ^ R 3y > ^ R jy o ; j2f4;:::;pg: Note that underA 12 \A 13 \ S 1 T p j=4 A 3j , X 1 is the rst feature to enter the active set and s 1 = sgn(X > 1 y) = 1. Next we will consider the second feature to enter the active set. Denote by supp ^ () [3] the active set at. Let 2 be the rst (greatest)< 1 where supp ^ () 6= supp ^ ( 1 ) . At each knot in the lasso path, a feature may either enter or leave the active set. We show thatX 1 cannot leave the active set before another feature enters: LemmaA.3.1. Therstfeaturetoenterthelassopathcannotleaveatthesecondknot. (Thatis,atthesecond knot in the lasso path, a second feature enters the model with probability one.) (The proofs of all lemmas stated in the proof of this theorem are provided in Appendix A.4.2.) That is, at 2 ,X 1 is never removed from the active set; instead, eitherX 2 orX 3 enters. In particular, for X 2 andX 3 there exist knots (2) 2 and (3) 2 determined by the KKT conditions (A.3.1) such that the next feature to enter the active set is arg max i2f2;3g n (i) 2 : (i) 2 < 1 o [171]. Therefore to show thatX 3 enters the lasso path second, it is enough to show max j2f2;4;5;:::;pg n (j) 2 o < (3) 2 < 1 : (A.3.4) 172 We will calculate (j) 2 ,j2f2;:::;pg to determine which feature enters next. We will show in Lemma A.3.2 that on the event A 3 := n ^ R 3y ^ R 13 ^ R 1y 0 o ; we have (3) 2 = kyk 2 n ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 : (A.3.5) Forj2f2; 4; 5;:::;pg, dene the events E (j) 1 := ( ^ R 1y ^ R jy 1 ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) ; E (j) 2 := ( ^ R 1y + ^ R jy 1 + ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) : We show thatA 3 T j2f2;4;5;:::;pg E (j) 1 T j2f2;4;5;:::;pg E (j) 2 , along with the other events so far, is sucient to ensure thatX 3 enters the active set second: LemmaA.3.2. UndertheeventA 12 \A 13 \S 1 \A 3 T p j=4 A 3j T j2f2;4;5;:::;pg E (j) 1 T j2f2;4;5;:::;pg E (j) 2 , (A.3.4) and (A.3.5) hold, and the rst two features in the lasso selection path areX 1 followed byX 3 . We will be almost done if we can show thatA 12 occurs with probability tending towards 1=2 and the rest of the events occur with probability tending towards 1. It is straightforward thatP(A 12 ) = 1=2 for alln by exchangeability ofX 1 andX 2 . However, the eventE (2) 1 complicates our analysis. Specically, sinceE[ ^ R 1y ^ R 3y ] = 1y (n) 3y > 0 and asymptotically 1 ^ R 13 ! 1 13 = 1 andE[1 ^ R 12 ] = 1 12 (n)! 0, forE (2) 1 to hold we need ^ R 1y ^ R 2y to vanish slowly enough inn. To sort this out, we dene A 12 (n) := n ^ R 1y ^ R 2y >(n) o and ~ E (2) 1 (n) := ( (n) 1 ^ R 12 > ^ R 1y ^ R 3y 1 ^ R 13 ) (A.3.6) 173 for some function : N! R ++ . Note thatA 12 (n)\ ~ E (2) 1 (n) impliesA 12 \E (2) 1 , so we can change our focus to bounding the probability of the event A 12 (n)\A 13 \S 1 \A 3 p \ j=4 A 3j \ ~ E (2) 1 (n) p \ j=4 E (j) 1 \ j2f2;4;:::;pg E (j) 2 : As discussed, if(n) goes to 0 too quickly, ~ E (2) 1 (n) could fail to hold with high probability. But asn!1, ^ R 1y ^ R 2y will concentrate around its expectation, 0, with high probability, so (n) will need to tend towards 0 quickly enough forA 12 (n) to hold with high probability, Soon we will concern ourselves with a good choice of(n), but for now, the following result allows us to boundP (A 12 (n)) for an arbitrary. PropositionA.3.3. Suppose 0 B B B B B B @ y i X i1 X i2 1 C C C C C C A N 0 B B B B B B @ 0; 2 6 6 6 6 6 6 4 yy 1y 2y 1y 11 12 2y 12 22 3 7 7 7 7 7 7 5 1 C C C C C C A ; 8i2 [n]; (A.3.7) are n i.i.d. draws with yy > 1y = 2y > 11 = 22 > 12 1. Lety := (y 1 ;:::;y n ) > and X j := (X 1j ;:::;X nj ) > ;j2 [3]. Assumen 100. Then for any> 0, P ^ R 1y ^ R 2y p n ~ + 1463 ~ + 14 3 yy n 1=2 ; (A.3.8) where ~ is dened in (A.4.10), the cosine similarities are as specied in (A.3.3), and is the distribution function for a standard Gaussian random variable. Proof. Provided in Appendix A.4.1. 174 To briey summarize how we prove Proposition A.3.3, note that ^ R 1y ^ R 2y is a well-behaved functional of the jointly Gaussian data which, when scaled by p n, converges to a normal distribution by the delta method. Our proof relies on a Berry-Esseen-type result for the delta method due to Pinelis and Molzon [140]. Because in our setting the correlation betweenX 1 andX 2 varies withn, in our setting 1=~ = O((n logn) 1=4 ), so examining the argument of () in (A.3.8) we see we will require a choice of =(n) that goes to 0 quickly enough that(n)n 3=4 (logn) 1=4 ! 0; thenP (A 12 (n))! 1=2. Meanwhile, all of the events besidesA 12 (n) can be shown to hold if the cosine similarities concentrate around their expectations due to the following lemma: LemmaA.3.4. Under the assumptions of Theorem 2.3.1, the following identities and inequalities hold: 12 (n)> 1 2 >(n); (A.3.9) 1y (n) = 2y (n)> 3y > 3(n)> 0; (A.3.10) 1y (n) 3y 2(n); and (A.3.11) 1y (n) 3y 19 5 (logn) 1=4 (n); (A.3.12) where (n) := s ( 2 Z + 1 + 2 ) logn 4c 2 n = 1 20 s 2 Z + 1 + 2 c 2 log(n) 2 (n); (A.3.13) wherec 2 2 0; e1 8e 2 is dened in (A.4.23). 175 Examining the denitions of eventsA 13 ,S 1 ,A 3 , ~ E 1 (), andE 2 , one can show that for a suitably chosen (n) (that is,(n) not vanishing too quickly), Lemma A.3.4 implies that all of these events will hold if the relevant cosine similarities are within(n) of their expectations. Denote this event by F n := n ^ R 1y 1y (n) <(n) o \ n ^ R 2y 2y (n) <(n) o \ n ^ R 3y 3y <(n) o \ n ^ R 12 12 (n) <(n) o \ n ^ R 13 <(n) o p \ j=4 n ^ R jy <(n) o p \ j=4 n ^ R 1j <(n) o : Later we prove thatF n holds with high probability under our assumptions using a concentration inequality on the cosine similarities from Sun et al. [168]. LemmaA.3.5. Under the assumptions of Theorem 2.3.1, for(n) as dened in (A.3.14) P(F n ) 1 12c 8 (logn) 3=2 n 1=4 : We also prove our claim thatA 13 ,S 1 ,A 3 , ~ E 1 (), andE 2 hold underF n . LemmaA.3.6. Under the assumptions of Theorem 2.3.1, F n 0 @ A 13 \S 1 \A 3 3 \ j=4 A 3j \ ~ E (2) 1 (n) p \ j=4 E (j) 1 \ j2f2;4;5;:::;pg E (j) 2 1 A for(n) dened in (A.3.14) below. It turns out that forF n ~ E 1 (n) to hold, we require(n) to be no less than (that is, go to 0 no faster than) (n) := 2 2 + 19 5 (logn) 1=4 (n)[(n) + 1 12 (n)]: (A.3.14) Due to the decaying variance of the noise onX 1 andX 2 dened in (2.9), under our assumptions 1 12 (n) tends to 0 at the same rate (up to log terms) as(n) (in practice, this is how we chose the form of 2 (n)). So 176 (n) as dened in (A.3.14) goes to 0 at a rate equal to 2 (n) (again, up to log terms). Since for(n) dened in (A.3.13) we have 2 (n) =O (logn=n), this denition of(n) also allows(n)n 3=4 (logn) 1=4 ! 0 as required earlier forP (A 12 (n))! 1=2. In particular, in the following lemma, we work out the bound onP(A 12 (n)) from Proposition A.3.3 for (n) as dened in (A.3.14). Lemma A.3.7. Under the assumptions of Theorem 2.3.1, there exists a nite constantc 6 > 0 such that for (n) as dened in (A.3.14), P (A 12 (n) c )< 1 2 +c 6 2 Z + 1 + 2 7=2 (logn) 3=2 n 1=4 : (It was at this point in the derivation process that we chose(n) to match the rates of convergence ofP (A 12 (n)) andP (F n ). If (n) were to vanish faster than our specied rate, (n) would also go to 0 faster, so n would not have to grow large as quickly forA 12 (n) to hold, which would make the rate of convergence ofP (A 12 (n)) faster. But this change would requiren to grow faster forF n to continue to hold with high probability, so the rate of convergence ofP (F n ) would be slower. Our choice of(n) balances these considerations, allowing for the fastest overall rate of convergence up to log terms.) Finally, using Lemma A.3.5, Lemma A.3.7, and the union bound, we bound the probability of the event from Lemma A.3.2: P A 12 (n)\A 13 \S 1 \A 3 \ ~ E 1 (n)\E 2 P (A 12 (n)\F n ) 1P (A 12 (n) c )P (F c n ) > 1 2 c 6 2 Z + 1 + 2 7=2 (logn) 3=2 n 1=4 12c 8 (logn) 3=2 n 1=4 1 2 c 3 2 Z + 1 + 2 7=2 (logn) 3=2 n 1=4 177 forc 3 :=c 6 + 12c 8 2 7=2 <c 6 + 2c 8 (since Z > 1). A.4 ProofsofSupportingResultsForTheorem2.3.1 Appendix A.4.1 contains the proof of Proposition A.3.3 and Appendix A.4.2 contains the proofs of the remaining lemmas stated in the proof of Theorem 2.3.1. These proofs require more supporting lemmas, the proofs of which (which are mostly technical, or just algebraic manipulations) are contained in Appendix A.6. A.4.1 ProofofPropositionA.3.3 Our approach will be to establish a Berry-Esseen-type nite sample bound for the delta method applied to ^ R 1y ^ R 2y considered as a nonlinear function of certain sample moments. In particular, we will apply Theorem 2.11 in Pinelis and Molzon [140], which we state here for completeness. TheoremA.4.1 (Theorem 2.11 in Pinelis and Molzon [140]). LetX be a Hilbert space and letg :X!R be a Borel-measurable functional. Suppose that kr 2 g(x)k op M 8x2X withkxk 2 < for some2 (0;1) and someM 2 (0;1); (A.4.1) wherekk op is the operator norm. Let V 1 ;V 2 ;:::;V n be i.i.d. zero-mean random vectors with ~ := EjL(V 1 )j 2 1=2 > 0 andEjL(V 1 )j 3 <1, whereL :X ! R is the linear functional that is the rst derivative ofg at the origin (the tangent plane). Then for allz2R, P p ng(V ) ~ z (z) C p n : 178 whereV :=n 1 P n i=1 V i , () is the distribution function of the standard normal distribution, and C :=k 0 +k 1 EjL(V 1 )j 3 EjL(V 1 )j 2 3=2 + 0 B @k 20 +k 21 EjL(V 1 )j 3 1=3 EjL(V 1 )j 2 1=2 1 C AEkV 1 k 2 + 0 B @k 30 +k 31 EjL(V 1 )j 3 1=3 EjL(V 1 )j 2 1=2 1 C A EkV 1 k 3 2=3 +k ; (A.4.2) where k 0 := 0:13925; k 1 := 2:33554; (k 20 ;k 21 ;k 30 ;k 31 ) := M 2~ 2 2 1=6 ; 2 + 2 2=3 n 1=6 ; (8=) 1=6 n 1=3 ; 2 n 1=2 ! ; and k := min 8 > < > : EkV 1 k 2 2 n 1=2 ; 2 EkV 1 k 2 3=2 +EkV 1 k 3 =n 1=2 3 n 9 > = > ; : (A.4.3) Remark A.4.2. To be precise, Pinelis and Molzon do not state smoothness condition (A.4.1) in Theorem 2.11, but rather the following smoothness condition: there exists a continuous linear functionalL :X ! R such that jg(x)L(x)j M 2 kxk 2 2 8x2X withkxk 2 : However, at the top of page 1007, the authors state that (A.4.1) andg(0) = 0 are sucient conditions for this smoothness condition to hold. We will begin by dening the relevant sample moments along with the nonlinear functiong and its linear approximationL that we will make use of. Let V i := X 2 i1 11 ;X 2 i2 22 ;y 2 i yy ;X i1 y i 1y ;X i2 y i 2y ; i2 [n] (A.4.4) 179 and V := 1 n n X i=1 V i = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 n 1 P n i=1 X 2 i1 11 n 1 P n i=1 X 2 i2 22 n 1 P n i=1 y 2 i yy n 1 P n i=1 X i1 y i 1y n 1 P n i=1 X i2 y i 2y 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : We show the following inequalities hold: LemmaA.4.3. EkV 1 k 2 < 12 2 yy ; (A.4.5) and EkV 1 k 3 < 140 3 yy : (A.4.6) (The proofs of all lemmas are contained in Appendix A.6.) Dene g : ( 11 ;1) ( 22 ;1) ( yy ;1)R 2 !R as g(u) := u 4 + 1y p (u 1 + 11 )(u 3 + yy ) u 5 + 2y p (u 2 + 22 )(u 3 + yy ) 1y p 11 yy 2y p 22 yy ! (A.4.7) =) g(V ) = n 1 P n i=1 X i1 y i q n 1 P n i=1 X 2 i1 n 1 P n i=1 y 2 i n 1 P n i=1 X i2 y i q n 1 P n i=1 X 2 i2 n 1 P n i=1 y 2 i 1y p 11 yy 2y p 22 yy ! = ^ R 1y ^ R 2y ; 180 the dierence of the cosine similarities as dened in (A.3.3). Let :=EV = 02R 5 . We have rg(u) = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 u 4 + 1y p u 3 +yy 1 2 (u 1 + 11 ) 3=2 u 5 + 2y p u 3 +yy 1 2 (u 2 + 22 ) 3=2 1 2 u 5 + 2y p u 2 + 22 u 4 + 1y p u 1 + 11 (u 3 + yy ) 3=2 1 p (u 1 + 11 )(u 3 +yy ) 1 p (u 2 + 22 )(u 3 +yy ) 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 =) rg() = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 2 1y p yy 3 11 1 2 2y p yy 3 22 0 1 p 11 yy 1 p 22 yy 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : (A.4.8) Next, the tangent plane tog at 0 is given by L(u) :=g(0) +rg(0) > u = 0 1 2 1y p yy 3 11 u 1 + 1 2 2y p yy 3 22 u 2 + 1 p 11 yy u 4 1 p 22 yy u 5 = 1 2 p yy 2y 3=2 22 u 2 1y 3=2 11 u 1 ! + 1 p yy u 4 p 11 u 5 p 22 ; (A.4.9) in particular, L(V 1 ) = 1 2 p yy 2y 3=2 22 X 2 12 22 1y 3=2 11 X 2 11 11 ! + 1 p yy X 11 y 1 1y p 11 X 12 y 1 2y p 22 ; 181 so L(EV 1 ) = 1 2 p yy 2y 3=2 22 X 2 12 22 1y 3=2 11 X 2 11 11 ! + 1 p yy X 11 y 1 1y p 11 X 12 y 1 2y p 22 = 1 2 p yy 2y 3=2 22 EX 2 12 22 1y 3=2 11 EX 2 11 11 ! + 1 p yy EX 11 y 1 1y p 11 EX 12 y 1 2y p 22 = 0: Later we will prove the following result: LemmaA.4.4. Under the assumptions of Proposition A.3.3, forL : (1;1) 3 R 2 !R dened in (A.4.9) andV dened in (A.4.4), ~ := q EjL(V 1 )j 2 = r (1 12 ) 3 2 1y + 2 1y 12 + 2 2 0; p 2 i ; (A.4.10) where 12 is the correlation (population cosine similarity) betweenX 1 andX 2 , and 1y is the correlation betweeny andX 1 . Further,EjL(V 1 )j 3 is nite (in particular,EjL(V 1 )j 3 16 p 2=), and EjL(V 1 )j 3 1=3 EjL(V 1 )j 2 1=2 3 r 8 : (A.4.11) In order to apply Theorem A.4.1, we must show thatg() satises (A.4.1); we show this is the case in the proof of the following lemma: Lemma A.4.5. Under the assumptions of Proposition A.3.3, the functionalg : ( 11 ;1) ( 22 ;1) ( yy ;1)R 2 !R as dened in (A.4.7) satises (A.4.1) with = 11 =2 andM := 36 yy : 182 Then Theorem A.4.1 yields that for allz2R P p n( ^ R 1y ^ R 2y ) ~ z ! (z) C p n =) P ^ R 1y ^ R 2y ~ z p n (z) + C p n () P ^ R 1y ^ R 2y p n ~ + C p n ; where := ~ z= p n may equal any real number sincez may equal any real number. We upper boundC in the following result: LemmaA.4.6. Under the assumptions of Proposition A.3.3, C < 3 yy 1462:717 ~ + 13:859 ; where ~ is dened in (A.4.10). This yields (A.3.8). A.4.2 ProofsofLemmasStatedinProofofTheorem2.3.1 Proof of Lemma A.3.1. Suppose (without loss of generality) that featureX 1 enters the lasso path rst. Dene cross 2 := X > 1 y nkX 1 k 2 s 1 : The featureX 1 would be removed from the active set beforeX 2 orX 3 were added to the active set under the event n (2) 2 _ (3) 2 < cross 2 < 1 o [171]. 183 But since featureX 1 entered rst, 1 = max j ( X > j y nkX j k 2 ) = X > 1 y nkX 1 k 2 = X > 1 y nkX 1 k 2 s 1 = cross 2 ; so the probability of this event is 0. Proof of Lemma A.3.2. We will rst verify that (A.3.5) holds under the specied events. Then we will use (A.3.5) to show that (A.3.4) holds as well. We will want to know the coecient ^ 1 () onX 1 in the part of the lasso path before the second feature enters the active set. On the eventA 12 \A 13 \S 1 T p j=4 A 3j , by the KKT conditions (A.3.1) for 2 [ 2 ; 1 ] it holds that s 1 = 1 nkX 1 k 2 X > 1 y X 1 kX 1 k 2 ^ 1 () () ^ 1 () =n X > 1 y nkX 1 k 2 =kyk 2 ^ R 1y n; (A.4.12) 184 and ^ j () = 0 forj6= 1. Now we are prepared to show (A.3.5). Using (A.3.1), for allj2f2;:::;pg, for s j 2f1; 1g the knot (j) 2 satises 0 = 1 nkX j k 2 X > j y X 1 kX 1 k 2 ^ 1 (j) 2 + (j) 2 s j (a) () (j) 2 = 1 nkX j k 2 X > j y X 1 kX 1 k 2 ^ 1 (j) 2 = 1 nkX j k 2 X > j y X 1 kX 1 k 2 h kyk 2 ^ R 1y n (j) 2 i = X > j y nkX j k 2 X > j X 1 nkX j k 2 kX 1 k 2 kyk 2 ^ R 1y + X > j X 1 kX j k 2 kX 1 k 2 (j) 2 = kyk 2 n ^ R jy ^ R 1j ^ R 1y + ^ R 1j (j) 2 ; where (a) follows becauses j 2f1; 1g takes on the value that ensures (j) 2 is positive and we used (A.4.12) and the cosine similarities from (A.3.3). So ifA 3 holds, we can write (3) 2 = kyk 2 n ^ R 3y ^ R 13 ^ R 1y + ^ R 13 (3) 2 (b) = kyk 2 n ^ R 3y ^ R 13 ^ R 1y + ^ R 13 (3) 2 () (3) 2 = kyk 2 n ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 ; which is (A.3.5), where we usedA 3 holding in (b). Now we seek to verify (A.3.4) on the assumed events. To see that the right side of (A.3.4) holds, note that (3) 2 < 1 () kyk 2 n ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 < kyk 2 n ^ R 1y () ^ R 3y ^ R 13 ^ R 1y < ^ R 1y 1 ^ R 13 () ^ R 3y < ^ R 1y ; 185 which is true underA 13 . Next we show the left inequality in (A.3.4). Observe that for anyj2f2; 4;:::;pg we can write (j) 2 = max s j 2f1;1g 8 < : n 1 kyk 2 ^ R jy ^ R 1j ^ R 1y s j ^ R 1j 9 = ; ; so under the assumed events we have that for anyj2f2; 4;:::;pg (j) 2 < (3) 2 () max s j 2f1;1g 8 < : n 1 kyk 2 ^ R jy ^ R 1j ^ R 1y s j ^ R 1j 9 = ; < kyk 2 n ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 () max s j 2f1;1g ( ^ R jy ^ R 1j ^ R 1y s j ^ R 1j ) < ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 : (A.4.13) We can write the argument of the left side of (A.4.13) as ^ R jy ^ R 1j ^ R 1y s j ^ R 1j = s j ^ R 1j ^ R 1y + ^ R jy s j ^ R 1y s j ^ R 1j = ^ R 1y + ^ R jy s j ^ R 1y s j ^ R 1j and similarly the right side is ^ R 3y ^ R 13 ^ R 1y 1 ^ R 13 = ^ R 1y + ^ R 3y ^ R 1y 1 ^ R 13 : 186 Using this, we have that on the eventA 12 \A 13 \S 1 \A 3 \S 1 T p j=4 A 3j , (A.3.4) is equivalent to (A.4.13) holding for everyj2f2; 4;:::;pg. We conclude by noting that the event (A.4.13) is equivalent to max s j 2f1;1g ( ^ R jy s j ^ R 1y s j ^ R 1j ) < ^ R 3y ^ R 1y 1 ^ R 13 () ( ^ R jy ^ R 1y 1 ^ R 1j < ^ R 3y ^ R 1y 1 ^ R 13 ) \ ( ^ R jy + ^ R 1y 1 ^ R 1j < ^ R 3y ^ R 1y 1 ^ R 13 ) () ( ^ R 1y ^ R jy 1 ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) \ ( ^ R 1y + ^ R jy 1 + ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) =E (j) 1 \E (j) 2 : Proof of Lemma A.3.4. We begin by stating a few results we will require. The proofs of these lemmas are provided in Appendix A.6. Remark A.4.7. The inequality max 2 12 + 2 ; 5 1 + 2 > 9 4 5 + 2 along with 2 Z < 4 (from Statement i) yields the following inequalities from (2.11) and (2.12) that we will make use of throughout the proof: n logn > 2 Z + 1 + 2 c 2 max 1 4t 2 0 (2 + 2 ) 2 ; 2 12 + 2 ; 5 1 + 2 ; 9 4 2 Z + 1 + 2 (A.4.14) and n (logn) 3=2 > 3:61 c 2 2 Z + 1 + 2 : (A.4.15) 187 LemmaA.4.8. Under the assumptions of Theorem 2.3.1, (n)< min ( 5 19 (logn) 1=4 ; 1 p 20 (1 + 2 ) ;t 0 2 Z + 1 + 2 ) ; (A.4.16) where(n)isdenedin (A.3.13). Thisimpliesthat 12(n) p 3 + 2 > 1 5 . Further,thefollowinginequalities hold: q 1 + 2 (n) 1< 20 r c 2 2 Z + 1 + 2 (n) logn ; and (A.4.17) 2 (n)< 1; (A.4.18) where 2 (n) is dened in (2.9). LemmaA.4.9. Under the assumptions of Theorem 2.3.1, 1< q 1 + 2 (n) 1 2(n) p 3 + 2 < Z < 5 5 19 (logn) 1=4 (n) : Lemma A.4.10. Under the assumptions of Theorem 2.3.1, the random variablesy i ;X i1 ;X i2 , andX 3i are distributed as 0 B B B B B B B B B B @ y i X i1 X i2 X 3i 1 C C C C C C C C C C A N 0 B B B B B B B B B B @ 0; 2 6 6 6 6 6 6 6 6 6 6 4 2 Z + 1 + 2 Z Z 1 Z 1 + 2 (n) 1 0 Z 1 1 + 2 (n) 0 1 0 0 1 3 7 7 7 7 7 7 7 7 7 7 5 1 C C C C C C C C C C A ; 8i2 [n]: (A.4.19) 188 It follows that the cosine similarity matrix is 0 B B B B B B B B B B @ yy 1y (n) 2y (n) 3y 1y (n) 11 12 (n) 13 2y (n) 12 (n) 22 23 3y 13 23 33 1 C C C C C C C C C C A = 0 B B B B B B B B B B B B B @ 1 Z q ( 2 Z +1+ 2 )(1+ 2 (n)) Z q ( 2 Z +1+ 2 )(1+ 2 (n)) 1 p 2 Z +1+ 2 Z q ( 2 Z +1+ 2 )(1+ 2 (n)) 1 1 1+ 2 (n) 0 Z q ( 2 Z +1+ 2 )(1+ 2 (n)) 1 1+ 2 (n) 1 0 1 p 2 Z +1+ 2 0 0 1 1 C C C C C C C C C C C C C A : (A.4.20) Observe that max i ( ii ) = 2 Z + 1 + 2 and min i ( ii ) = 1, where is the covariance matrix in (A.4.19). Lastly, Z > 1 + 2 (n). From (A.4.18) in Lemma A.4.8 and (A.4.20) in Lemma A.4.10 we have that 12 (n) = 1 1 + 2 (n) > 1 1 + 1 : Since we know from Lemma A.4.8 that(n)< 1=2, (A.3.9) is veried. Next, from (A.4.14) we have n logn > 1 c 2 9 4 2 Z + 1 + 2 2 () 1 2 Z + 1 + 2 > 9 ( 2 Z + 1 + 2 ) logn 4c 2 n = 9 2 (n) =) 3y > 3(n): 189 By (A.4.20) we have jy (n) = Z q 1+ 2 (n) 3y , j 2 [2]. Since from Lemma A.4.9 we know that Z > q 1+ 2 (n) 12(n) p 3+ 2 > q 1 + 2 (n), we have Z q 1+ 2 (n) > 1. Therefore 1y (n) = 2y (n) > 3y > 3(n) > 0, which is (A.3.10). Next we will show (A.3.11). From Lemma A.4.8 we have 1 2(n) p 3 + 2 > 0, so Z > q 1 + 2 (n) 1 2(n) p 3 + 2 q 1 + 2 (n) 1 2(n) q 2 + 2 (n) + 2 = q 1 + 2 (n) 1 + 2(n) q 2 + 2 (n) + 2 1 4 2 (n) 2 + 2 (n) + 2 q 1 + 2 (n) 1 + 2(n) r 1 + 2 (n) + (1 + 2 ) h 1 4 2 (n) 1 + 2 (n) i 1 4 2 (n) 1 + 2 (n) = 2 q 1 + 2 (n) + 4(n) r 1 + 2 (n) 1 + 2 (n) + (1 + 2 ) h 1 4 2 (n) 1 + 2 (n) i 2 h 1 4 2 (n) 1 + 2 (n) i = b + r 16 2 (n) 1 + 2 (n) 1 + 2 (n) + (1 + 2 ) h 1 4 2 (n) 1 + 2 (n) i 2a 190 wherea = 1 4 2 (n) 1 + 2 (n) andb =2 q 1 + 2 (n), and 16 2 (n) 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 4 2 (n) 1 + 2 (n) =16 2 (n) 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 + 2 4 2 (n) 1 + 2 (n) =4 1 + 2 (n) 4 2 (n) 1 + 2 (n) + 4 2 (n) 1 + 2 4 2 (n) 1 + 2 4 2 (n) 1 + 2 (n) =4 1 + 2 (n) 1 1 + 4 2 (n) 1 + 2 (n) + 4 2 (n) 1 + 2 1 4 2 (n) 1 + 2 (n) =4 1 + 2 (n) 1 1 4 2 (n) 1 + 2 (n) 1 4 2 (n) 1 + 2 =4 1 + 2 (n) 4 1 4 2 (n) 1 + 2 (n) 1 + 2 (n) 1 4 2 (n) 1 + 2 =b 2 4ac wherec = 1 + 2 (n) 1 4 2 (n) 1 + 2 . Since from Lemma A.4.9 we have Z > q 1+ 2 (n) 12(n) p 3+ 2 , it follows that Z > q 1+ 2 (n) 12(n) p 3+ 2 b+ p b 2 4ac 2a . Then we have Z b + p b 2 4ac 2a =) a 2 Z +b Z +c 0 () 2 Z 1 4 2 (n) 1 + 2 (n) 2 Z q 1 + 2 (n) + 1 + 2 (n) 1 4 2 (n) 1 + 2 0 () 2 Z + 1 + 2 (n) 2 Z q 1 + 2 (n) 4 2 (n) 2 Z + 1 + 2 1 + 2 (n) () Z q 1 + 2 (n) 2(n) r 2 Z + 1 + 2 1 + 2 (n) () Z r 2 Z + 1 + 2 1 + 2 (n) 1 q 2 Z + 1 + 2 2(n) () 1y (n) 3y 2(n); 191 yielding (A.3.11). Next we will show (A.3.12). From (A.4.16) in Lemma A.4.8 we have 1 19 5 (logn) 1=4 (n)> 0: Therefore we have Z < 5 5 19 (logn) 1=4 (n) < q 1 + 2 (n) 1 19 5 (logn) 1=4 (n) q 1 + 2 (n) = q 1 + 2 (n) 1 + 19 5 (logn) 1=4 (n) q 1 + 2 (n) h 1 19 5 (logn) 1=4 (n) q 1 + 2 (n) ih 1 + 19 5 (logn) 1=4 (n) q 1 + 2 (n) i q 1 + 2 (n) 1 + 19 5 (logn) 1=4 (n) 1 + 2 (n) + 1 + 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1=2 ! 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = 2 q 1 + 2 (n) + 2(n) 19 5 (logn) 1=4 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1=2 ! 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = ~ b + 15:2 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1=2 !, (2~ a) 192 where ~ a = 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) and ~ b =2 q 1 + 2 (n), and 15:2 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = 15:2 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1 + 2 (n) + 1 + 2 1 + 2 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = 4 1 + 2 (n) 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) + 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = 4 1 + 2 (n) 1 1 + 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) + 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) = 4 1 + 2 (n) 1 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 = 4 1 + 2 (n) 4 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 1 + 2 (n) 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 = ~ b 2 4~ a~ c 193 where ~ c = 1 + 2 (n) 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 . Since from Lemma A.4.9 we have Z < q 1+ 2 (n) 1 19 5 (logn) 1=4 (n) q 1+ 2 (n) , it follows that Z < ~ b+ p ~ b 2 4~ a~ c 2~ a . Then Z < ~ b + p ~ b 2 4~ a~ c 2~ a =) ~ a 2 Z + ~ b Z + ~ c< 0 () 2 Z 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 (n) 2 Z q 1 + 2 (n) + 1 + 2 (n) 1 3:8 h (logn) 1=4 i 2 2 (n) 1 + 2 < 0 () 2 Z + 1 + 2 (n) 2 Z q 1 + 2 (n)< 3:8 h (logn) 1=4 i 2 2 (n) 2 Z + 1 + 2 1 + 2 (n) (a) () Z q 1 + 2 (n)< 19 5 (logn) 1=4 (n) r 2 Z + 1 + 2 1 + 2 (n) (A.4.21) () Z r 2 Z + 1 + 2 1 + 2 (n) 1 q 2 Z + 1 + 2 < 19 5 (logn) 1=4 (n) () 1y (n) 3y < 19 5 (logn) 1=4 (n); where (a) follows because Z q 1 + 2 (n) from Lemma A.4.9, yielding (A.3.12). Proof of Lemma A.3.5. We will make use of the following concentration inequality: Lemma A.4.11 (Lemma D.3 from Sun et al. [168]). LetX = (X 1 ;X 2 ;:::;X d ) > be a zero-mean sub-Gaussian random vector with covariance with (i;j) th element ij . (That is, each X i = ii is sub- Gaussian with variance proxy 1.) Let n X (k) o n k=1 ben i.i.d. samples fromX. LetW 2 be a diagonal matrix with diagonal elements of , and letC :=W 1 W 1 be the cosine similarity matrix. Let ij be the (i;j) th element ofC . Consider the corresponding uncentered estimators: let ^ := n 1 P n k=1 X (k) X (k) > denote the sample covarianceand ^ C := ^ W 1 ^ ^ W 1 denotethecosinesimilarity(uncenteredsamplecorrelation)matrix,where 194 ^ W 2 is the diagonal matrix with diagonal elements of ^ . Let ^ ij be the (i;j) th element of ^ , and let ^ R ij be the (i;j) th element of ^ C. By Lemma D.1 in Sun et al. [168], there exist constantst 0 2 (0; 1] and ~ c 1 2 0; e 1 2e 2 max i f ii g such that for allt with 0tt 0 the sample covariance ^ with (i;j) th element ^ ij satises P ^ ij ij t 8 exp ~ c 1 nt 2 : Dene ~ c 2 := min 1 4 ~ c 1 min ( ii ) 2 ; 1 6 : Then for any2 0; min 1 2 ;t 0 max i ( ii ) and for anyi;j2 [d];i6=j, P ^ R ij ij > 6 exp ~ c 2 n 2 : From (A.4.16) we have 0<(n)< min ( 1 p 20 (1 + 2 ) ;t 0 ( 2 Z + 1 + 2 ) ) min 1 2 ;t 0 max i ( ii ) ; 195 so we can apply Lemma A.4.11 using =(n). Then P n ^ R 1y 1y (n) (n) o \ n ^ R 2y 2y (n) (n) o \ n ^ R 3y 3y (n) o \ n ^ R 12 12 (n) (n) o \ n ^ R 13 (n) o p \ j=4 n ^ R jy (n) o p \ j=4 n ^ R j1 (n) o ! = 1P n ^ R 1y 1y (n) >(n) o [ n ^ R 2y 2y (n) >(n) o [ n ^ R 3y 3y >(n) o [ n ^ R 12 12 (n) >(n) o [ n ^ R 13 >(n) o p [ j=4 n ^ R jy (n) o p [ j=4 n ^ R j1 (n) o ! 1P ^ R 1y 1y (n) >(n) P ^ R 2y 2y (n) >(n) P ^ R 3y 3y >(n) P ^ R 12 12 (n) >(n) P ^ R 13 >(n) p X j=4 P ^ R jy >(n) p X j=4 P ^ R j1 >(n) 1 (5 + 2[p 3]) 6 exp ~ c 2 n 2 (n) 1 12p exp ~ c 2 n 2 (n) 1 12c 8 (logn) 3=2 exp ~ c 2 n 2 (n) ; (A.4.22) where the last step follows from our assumption thatpc 8 (logn) 3=2 . Consider the expression ~ c 2 n 2 (n). From Lemma A.4.10 we see that min i ( ii ) = 1 and max i ( ii ) = 2 Z + 1 + 2 , so ~ c 1 2 0; e 1 2e 2 ( 2 Z + 1 + 2 ) and ~ c 2 = min ~ c 1 4 ; 1 6 = ~ c 1 4 2 0; e 1 8e 2 2 Z + 1 + 2 # 196 for alln2 N. In general we will be interested in how changes in Z and 2 aect our results, so rather than treating these as constants, we will dene c 1 := ~ c 1 2 Z + 1 + 2 2 0; e 1 2e 2 and c 2 := c 1 4 2 0; e 1 8e 2 : (A.4.23) Then we have exp ~ c 2 n 2 (n) = exp c 2 2 Z + 1 + 2 n ( 2 Z + 1 + 2 ) logn 4c 2 n = 1 n 1=4 : The result follows from substituting this into (A.4.22). Proof of Lemma A.3.6. We will prove the results one at a time. • First we will show thatF n A 13 . Note that A 13 = n ^ R 1y ^ R 3y > 0 o = n ^ R 1y 1y (n) ( ^ R 3y 3y ) + 1y (n) 3y > 0 o (a) n ^ R 1y 1y (n) ( ^ R 3y 3y ) + 2(n)> 0 o n ^ R 1y 1y (n) + ^ R 3y 3y < 2(n) o n ^ R 1y 1y (n) <(n) o \ n ^ R 3y 3y <(n) o F n ; where (a) follows from (A.3.11). 197 • Next we will show thatF n S 1 . S 1 = n ^ R 1y > 0 o \ n ^ R 2y > 0 o \ n ^ R 3y > 0 o \ n ^ R 12 > 0 o n j ^ R 1y 1y (n)j< 1y (n) o \ n j ^ R 2y 2y (n)j< 2y (n) o \ n j ^ R 3y 3y j< 3y o \ n j ^ R 12 12 (n)j< 12 (n) o (b) n j ^ R 1y 1y (n)j<(n) o \ n j ^ R 2y 2y (n)j<(n) o \ n j ^ R 3y 3y j<(n) o \ n j ^ R 12 12 (n)j<(n) o F n ; where (b) follows from by (A.3.9) and (A.3.10). • Next we show thatF n A 3 . A 3 = n ^ R 13 ^ R 1y ^ R 3y 0 o n j ^ R 13 jj ^ R 1y j ( ^ R 3y 3y ) 3y o n j ^ R 13 j + ( 3y ^ R 3y ) 3y o n j ^ R 13 j 3y =2 o \ n j ^ R 3y 3y j 3y =2 o (c) n j ^ R 13 j<(n) o \ n j ^ R 3y 3y j<(n) o F n ; where (c) follows from (A.3.10). • Now we showF n T p j=4 A 3j . For anyj2f4;:::;pg, we have that onF n ^ R 3y = ^ R 3y 3y + 3y () >(n) + 3(n)>(n)> ^ R jy ; 198 where () uses (A.3.10) and the fact that we are onF n and the last step also uses the fact that we are onF n . • Next we will showF n ~ E (2) 1 (n). We want to show that F n ( (n) 1 ^ R 12 > ^ R 1y ^ R 3y 1 ^ R 13 ) ; or, equivalently, F n n ( ^ R 1y ^ R 3y )(1 ^ R 12 ) +(n) ^ R 13 1 < 0 o for(n) dened in (A.3.14). Dene ~ (n) := 1 12 (n) = 2 (n) 1 + 2 (n) = 10 p n logn + 10 ; (A.4.24) where 2 (n) = 10= p n logn as in (2.9) and the expression for 12 (n) is calculated in Lemma A.4.10. Observe that onF n we have ( ^ R 1y ^ R 3y )(1 ^ R 12 ) +(n) ^ R 13 1 = ( ^ R 1y ^ R 3y )(1 12 (n) [ ^ R 12 12 (n)]) +(n) ^ R 13 1 () ^ R 1y ^ R 3y h ~ (n) +(n) i +(n) [(n) 1] (d) ^ R 1y 1y (n) + ^ R 3y 3y + 1y (n) 3y h ~ (n) +(n) i [1(n)](n) using (A.3.12) () 2 + 19 5 (logn) 1=4 (n)[ ~ (n) +(n)] [1(n)](n) (e) < 2 + 19 5 (logn) 1=4 (n)[ ~ (n) +(n)] 1 2 (n) (f) = 0 199 where (d) follows from the triangle inequality, (e) follows from(n)< 1=2 from Lemma A.4.8, (f) follows because from the denition of(n) in (A.3.14) (n) = 2 2 + 19 5 (logn) 1=4 (n)[(n) + ~ (n)] () 2 + 19 5 (logn) 1=4 (n)[(n) + ~ (n)] 1 2 (n) = 0; and the steps labeled with () use the fact that we are onF n . • Next we will showF n T p j=4 E (j) 1 . For anyj2f4;:::;pg, since E (j) 1 = ( ^ R 1y ^ R jy 1 ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) ; it is enough to show F n n ^ R 1y ^ R 3y 1 ^ R 1j + ^ R 1y ^ R jy ^ R 13 1 < 0 o : 200 Observe that onF n , ^ R 1y ^ R 3y 1 ^ R 1j + ^ R 1y ^ R jy ^ R 13 1 = ^ R 1y 1y (n) + 1y (n) h ^ R 3y 3y i 3y 1 ^ R 1j + ^ R 1y 1y (n) + 1y (n) ^ R jy ^ R 13 1 (a) < ((n) + 1y (n) +(n) 3y ) (1 +(n)) + ((n) + 1y (n) +(n)) ((n) 1) (b) (2(n) + 1 3(n)) (1 +(n)) + (2(n) + 1) ((n) 1) = (1(n)) (1 +(n)) (1(n)) (1 + 2(n)) = (1(n)) ((n)) < 0; where in (a) we used that we are onF n , in (b) we used (A.3.10) and 1y (n) 1, and in the last step we used 0<(n)< 1=2. • Now we will showF n E (2) 2 . We want to show that F n ( ^ R 1y + ^ R 2y 1 + ^ R 12 > ^ R 1y ^ R 3y 1 ^ R 13 ) or, equivalently, F n n ^ R 1y ^ R 3y 1 + ^ R 12 ^ R 1y + ^ R 2y 1 ^ R 13 < 0 o : 201 OnF n , ^ R 1y ^ R 3y 1 + ^ R 12 ^ R 1y + ^ R 2y 1 ^ R 13 = ^ R 3y 3y 3y + ^ R 1y 1y (n) ^ R 12 ^ R 3y 3y ^ R 12 + ( 1y (n) 3y ) ^ R 12 ^ R 2y 2y (n) 2y (n) + ^ R 1y + ^ R 2y ^ R 13 () (n) 3y + 2(n) ^ R 12 + ( 1y (n) 3y ) ^ R 12 +(n) 2y (n) + 2 ^ R 13 () 2(n) 3y + 2(n) + 1y (n) 3y 2y (n) + 2(n) = 6(n) 2 3y (g) < 0; where we used 1y (n) = 2y (n), (g) follows from (A.3.10), and the steps where we used the fact that we are onF n are labeled with (). • Finally we will showF n T p j=4 E (j) 2 . For anyj2f4;:::;pg, E (j) 1 = ( ^ R 1y + ^ R jy 1 + ^ R 1j > ^ R 1y ^ R 3y 1 ^ R 13 ) ; so it is enough to show that onF n ^ R 1y ^ R 3y 1 + ^ R 1j + ^ R 1y + ^ R jy ^ R 13 1 < 0: 202 Similarly to the above, onF n we have ^ R 1y ^ R 3y 1 + ^ R 1j + ^ R 1y + ^ R jy ^ R 13 1 = ^ R 1y + 1y (n) 1y (n) h ^ R 3y 3y i 3y 1 + ^ R 1j + ^ R 1y 1y (n) + 1y (n) + ^ R jy ^ R 13 1 (a) ((n) + 1y (n) +(n) 3y ) (1 +(n)) + ((n) + 1y (n) +(n)) ((n) 1) (b) (2(n) + 1 3(n)) (1 +(n)) + (2(n) + 1) ((n) 1) = (1(n)) (1 +(n)) (1 + 2(n)) (1(n)) = (1(n)) ((n)) < 0; where in (a) we used that we are onF n , in (b) we used (A.3.10) and 1y (n) 1, and in the last step we used 0<(n)< 1=2. Proof of Lemma A.3.7. We will apply Proposition A.3.3. We see from Lemma A.4.10 that all of the required assumptions on the covariance matrix for Proposition A.3.3 are satised. Note that ~ as dened in (A.4.10) varies with n in the setting of Theorem 2.3.1; in particular, substituting in the quantities from Lemma A.4.10 into (A.4.10) yields ~ (n) := r (1 12 (n)) 3 2 1y (n) + 2 1y (n) 12 (n) + 2 : (A.4.25) It only remains to substitute this expression,(n) from (A.3.14), and the cosine similarity and covariance expressions from Lemma A.4.10 into the conclusion of Proposition A.3.3. Some of these manipulations are tedious and we defer them to Appendix A.6. 203 LemmaA.4.12. Ifn 100, there exists a nite constantc 7 > 0 such that p n(n) ~ (n) < 1 2 +c 7 2 Z + 1 + 2 3=2 (1 + 2 ) 1=2 (logn) 3=2 n 1=4 (A.4.26) and 1 ~ (n) s 2 Z + 1 + 2 10(1 + 2 ) (n logn) 1=4 ; (A.4.27) where ~ (n) is dened in (A.4.25). Using this, we have (n) p n ~ (n) + 1463 ~ (n) + 14 3 yy n 1=2 < 1 2 +c 7 2 Z + 1 + 2 3=2 (1 + 2 ) 1=2 (logn) 3=2 n 1=4 + 0 @ 1463 p 10 s 2 Z + 1 + 2 1 + 2 (n logn) 1=4 + 14 1 A 2 Z + 1 + 2 3 n 1=2 < 1 2 +c 6 2 Z + 1 + 2 7=2 (logn) 3=2 n 1=4 forc 6 :=c 7 =2 2 + 1463= p 10 + 14= p 2. A.5 OtherResults In Section A.5.1 we prove Corollary 2.3.2, Theorem 2.5.1, and Theorem 2.5.3. We provide the statement of Theorem A.5.2 and outlines of proofs of Theorems 2.5.5 and A.5.2 in Section A.5.2. Finally, in Section A.5.3 we provide the proofs of Proposition A.1.2, Corollary A.1.3, and Proposition A.1.5. A.5.1 ProofsofCorollary2.3.2,Theorem2.5.1,andTheorem2.5.3 Proof of Corollary 2.3.2. We begin by presenting a lemma we will make use of. 204 LemmaA.5.1. Forn 200 and (n) dened in (2.14), (bn=2c)< 2(n). Proof. Provided later in this section. By linearity of expectation, in the Meinshausen and Bühlmann [127] stability selection algorithm E 2 6 4 n bn=2c 1 ( n bn=2c ) X b=1 1 n j2 ^ S (A b ) o 3 7 5 =P j2 ^ S bn=2c (A b ) 8j2 [3]: The Shah and Samworth [159] estimator has the same expectation. Therefore we only need to bound these selection probabilities, which we can do with Theorem 2.3.1. First we will upper-bound the probability of the eventE j := n j2 ^ S bn=2c (A b ) o forj2 [2] on any one lasso t from a subsample of stability selection for any between the second and third knots of the lasso path. (Lemma A.3.1 assures us that there will be two features in the selected set at this point.) We have 1 =P(fX 1 is selected rst, thenX 3 g) +P(fX 2 is selected rst, thenX 3 g) +P(felseg) so it follows from Theorem 2.3.1 that P(E 1 ) 1P(fX 2 is selected rst, thenX 3 g) 1 2 + 2 Z + 1 + 2 7=2 (bn=2c) < 1 2 + 2 2 Z + 1 + 2 7=2 (n) where in the last step we used Lemma A.5.1. By exchangeability, the same is true ofX 2 . 205 Next we will lower-bound the probability that 32 ^ S (A b ). Between the second and third knots of the lasso path, by Theorem 2.3.1 and Lemma A.5.1 we have P(32 ^ S bn=2c (A b ))P(fX 1 is selected rst, thenX 3 g) +P(fX 2 is selected rst, thenX 3 g) 2 1 2 2 Z + 1 + 2 7=2 (bn=2c) > 1 4 2 Z + 1 + 2 7=2 (n): Proof of Theorem 2.5.1. For a subsampleA of sizebn=2c, let E k := n C k \ ^ S(1)6=; o ; k2 [2]: By linearity of expectation, we see from (2.16) that E h ^ B (C k ) i = 1 2B B X b=1 E h 1 n C k \ ^ S (A b )6=; oi +E h 1 n C k \ ^ S A b 6=; oi = 1 2B B X b=1 P C k \ ^ S (A b )6=; +P C k \ ^ S A b 6=; =P (E k ): So we will need to examine the selection probabilities for features entering the lasso path rst or second, which we can do with Theorem 2.3.1. Dene the event A :=fX 1 orX 2 enters the lasso path rst andX 3 enters secondg: 206 Note that if the base feature selection procedure selects the rst feature to enter the lasso path, then E h ^ B (C 1 ) i P (A) andE h ^ B (C k ) i 1P (A) fork2f2;:::;p 1g. If the base feature selection procedure selects the rst two features to enter the lasso path, thenE h ^ B (C 1 ) i P (A),E h ^ B (C 2 ) i P (A), andE h ^ B (C k ) i 1P (A) fork2f3;:::;p 1g. From Theorem 2.3.1 we have P (A) 1 2c 3 2 Z + 1 + 2 7=2 (bn=2c)> 1 4 2 Z + 1 + 2 7=2 (n); where the last step follows from Lemma A.5.1, yielding the result. Proof of Theorem 2.5.3. Dene the event where the clusters are all correctly identied as S := n ( ^ C 1 ; ^ C 2 ; ^ C 3 ;:::; ^ C p1 ) = (C 1 ;C 2 ;C 3 ;:::;C p1 ) o : Since ^ B ^ C k is bounded between 0 and 1 for allk, we have ^ B ^ C k = ^ B (C k ) 1fSg + ^ B ^ C k 1fS c g ^ B 1fSg =) E h ^ B ^ C k i E h ^ B (C k ) i P (S); (A.5.1) 207 where the last step uses the independence of the data sets used for estimating the clusters and estimating cluster stability selection. Similarly, ^ B ^ C k = ^ B (C k ) 1fSg + ^ B ^ C k 1fS c g ^ B (C k ) + 1fS c g =) E h ^ B ^ C k i E h ^ B (C k ) i +P (S c ): (A.5.2) The result will therefore follow from Theorem 2.5.1 if we can bound P (S). It turns out that forS to hold it will be enough for the cosine similarities between the observed features to concentrate near their population values, and we can bound the probability of this happening using Lemma A.4.11. For > 0, consider the event ~ F n () := n ^ R 12 12 (n) o 2 \ j=1 p \ j 0 =3 n ^ R j;j 0 o : We will use Lemma A.4.11 to show that ~ F n 1 ( ) holds with high probability for < 1=4, and then we will show that ~ F n 1 ( )S to boundP (S c ). Using a similar argument as the one used in the proof of Lemma A.3.5, we have P ~ F n 1 () = 1P 0 @ n ^ R 12 12 (n 1 ) > o 2 [ j=1 p [ j 0 =3 n ^ R j;j 0 > o 1 A 1P ^ R 12 12 (n 1 ) > 2 X j=1 p X j 0 =3 P ^ R j;j 0 > (a) 1 6p 2 exp ~ c 2 n 1 2 = 1 6 exp 2 logp ~ c 2 2 n 1 (b) 1 6 exp 2 logc 8 + 3 2 log logn 2 ~ c 2 2 n 1 ; 208 for any minf1=2;t 0 Z g, where ~ c 2 = 1=6,t 0 2 (0; 1] is dened in the statement of Lemma A.4.11, in (a) we applied Lemma A.4.11 and in (b) we usedp c 8 (logbn=2c) 3=2 c 8 (logn) 3=2 . Therefore for 2 (0; minf1=4;t 0 Z g) we have P ~ F n 1 ( ) 1 6 exp 2 logc 8 + 3 log logn 2 ~ c 2 ( ) 2 n 1 1 6 exp 1 2 ~ c 2 ( ) 2 n 1 ~ c 2 ( ) 2 n 1 = 1 6 expfc 5 n 1 g; wherec 5 := 1 2 ~ c 2 ( ) 2 2 (0; 1=192) and in the second step we used that from (2.20) we have n 1 2 ~ c 2 ( ) 2 (2 logc 8 + 3 log logn 2 ) () 1 2 ~ c 2 ( ) 2 n 1 2 logc 8 + 3 log logn 2 : We also know from (A.3.9) that 12 (n 1 )> 1=2, so on ~ F n 1 ( ) it holds that ^ R 12 > 1 2 1 4 2 \ j=1 p \ j 0 =3 ^ R j;j 0 < 1 4 : 209 Therefore ~ F n 1 ( )S because under our clustering scheme,X 1 andX 2 are clustered together on ~ F n 1 ( ) andX 3 ;:::;X p remain in their own clusters. Applying these facts and Theorem 2.5.1 to (A.5.1) and (A.5.2) yields that if the base feature selection method selects the rst feature to enter the lasso path, E h ^ B (C 1 ) i E h ^ B (C 1 ) i P (S) E h ^ B (C 1 ) i P ~ F n 1 ( ) > 1 4 2 Z + 1 + 2 7=2 (n 2 ) (1 6 expfc 5 n 1 g) 1 4 2 Z + 1 + 2 7=2 (n 2 ) 6 expfc 5 n 1 g; and fork2f2;:::;p 1g, E h ^ B (C k ) i E h ^ B (C k ) i +P (S c )< 4 2 Z + 1 + 2 7=2 (n 2 ) + 6 expfc 5 n 1 g: The case where the base feature selection method selects the rst two features to enter the lasso path is similar. Proof of Lemma A.5.1. (bn=2c) =c 3 2 Z + 1 + 2 7=2 (logbn=2c) 3=2 (bn=2c) 1=4 <c 3 2 Z + 1 + 2 7=2 (logn) 3=2 (1=2)n 1=4 = 2c 3 2 Z + 1 + 2 7=2 (logn) 3=2 n 1=4 210 where in the second step we used that forn 200, (bn=2c) 1=4 = bn=2c n n 1=4 b201=2c 201 1=4 n 1=4 = 100 201 1=4 n 1=4 > 1 2 n 1=4 : A.5.2 StatementofTheoremA.5.2andOutlineofProofsofTheorems2.5.5andA.5.2 We begin this section by stating our theorem generalizing Theorem 2, Equation 7, and Equation 8 of Shah and Samworth [159] to our setting with clusters of features. TheoremA.5.2. Recall the setup of Theorem 2.5.5. 1. (Generalization of Theorem 2 of [159].) Dene ^ S ;C (A) for any set A [n] as ^ S ;C jAj applied to the observations inA. For anyC2C, dene the simultaneous selection proportion ~ B (C) := 1 B B X b=1 1 n C2 ^ S ;C (A b ) o 1 n C2 ^ S ;C A b o : Suppose ~ B (C k )hasaunimodaldistributionforeachC k 2L . Thenforany2f1=2 + 1=B; 1=2 + 3=(2B);:::; 1g there exists a functionC(;) such that E ^ S CSS;;C n; \L C(;B)E ^ S ;C bn=2c \L ; where, when 1= p 3, C(;B) = 8 > > > > < > > > > : 1 2[211=(2B)] ; 2 min 3 4 2 + 1 4B + 1 2 ; 2 + 1 2 ; 3 4 ; 4[1+1=(2B)] 1+1=B ; 2 3 4 ; 1 : 211 2. (Generalization of Equation 7 of [159].) Recall the denition of r-concavity from Denitions 3 and 4 Shah and Samworth [159]. Suppose ~ B (C k ) has an r-concave distribution for each C k 2 L . If 2f1=2 + 1=B; 1=2 + 3=(2B);:::; 1g, then there exists a functionD such that E ^ S CSS;;C n; \L D 2 ; 2 1;B;r jL j: 3. (GeneralizationofEquation8of[159].) Suppose ~ B (C k )hasanr 1 -concavedistributionforeachC k 2 L , and likewise every ^ B (C k ) isr 2 -concave. If2f1=2 + 1=B; 1=2 + 3=(2B);:::; 1g, then E ^ S CSS;;C n; \L min D 2 ; 2 1;B;r 1 ;D (;; 2B;r 2 ) jL j forall2 (; 1],whereD isdenedinthesamewayasinpart(ii)withtheconvention ~ D(;t;;) = 1 fort 0. Theorems 2.5.5 and A.5.2 follow from identical proofs to the corresponding results in Shah and Sam- worth [159] after some minor swaps, so we omit these proofs rather than duplicating them. For instance, to arrive at a proof for our Theorem 2.5.5, take the proofs for Shah and Samworth’s Lemma 1 and Theorem 1 and replace everywherek withC k (for someC k 2C), replacep withK, replacep k;bn=2c withp C k ;bn=2c; , replace ^ B (k) with our ^ B (C k ), replace ~ B (k) with our ~ B (C k ), replace ^ S bn=2c with ^ S ;C bn=2c , and replace ^ S CPSS n; with ^ S CSS;;C n; . The results in Theorem A.5.2 follow in the same way. Remark A.5.3. Shah and Samworth demonstrate empirically that the assumptions of Theorem A.5.2(iii) with r 1 =1=2 and r 2 =1=4 are reasonable when (C 1 ;:::;C p ) = (f1g;:::;fpg), but we do not investigate the reasonableness of this assumption in our more general setting. 212 A.5.3 ProofsofPropositionA.1.2,CorollaryA.1.3,andPropositionA.1.5 Before proving these results, we state some lemmas that we will need. (The proofs are provided in Ap- pendix A.6.) Lemma A.5.4. Assume the setup of (A.1.1), (A.1.2), and (A.1.3) with only one weak signal featureX q+1 (that is,p =q + 1). Then the following identities hold: E h y > y i =n( 2 Z + 2 q+1 + 2 ); (A.5.3) ^ j jX j N Z 1 + 2 j ; 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 X > j X j ! ; j2 [q]; (A.5.4) ^ q+1 jX q+1 N q+1 ; 2 Z + 2 X > q+1 X q+1 ! ; (A.5.5) ^ Z jZN Z ; 2 q+1 + 2 Z > Z ! ; (A.5.6) E h X > j X j i =n(1 + 2 j ); j2 [q]; (A.5.7) E h X > j y i = Z n; j2 [q]; (A.5.8) E h X > q+1 y i = q+1 n; (A.5.9) E " 1 X > j X j # = 1 (n 2)(1 + 2 j ) ; j2 [q]; (A.5.10) E " 1 X > q+1 X q+1 # = 1 n 2 ; (A.5.11) E h ^ 2 j i = 1 1 + 2 j " 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 n 2 + 2 Z 1 + 2 j # ; j2 [q]; (A.5.12) E h ^ 2 q+1 i = 2 Z + 2 n 2 + 2 q+1 ; and (A.5.13) E h ^ 2 Z i = 2 q+1 + 2 n 2 + 2 Z : (A.5.14) LemmaA.5.5. Assume the setup of (A.1.1), (A.1.2), and (A.1.3). 213 1. (Prediction risk of selectingX j ;j2 [q].) R(j) = n 1 n 2 0 @ 2 Z 2 j 1 + 2 j + p X j 0 =q+1 2 j 0 + 2 1 A ; j2 [q]: 2. (Prediction risk of selectingX j ;j2fq + 1;:::;pg.) R(j) = n 1 n 2 0 @ 2 Z + X j 0 2fq+1;:::;pgnj 2 j 0 + 2 1 A ; j2fq + 1;:::;pg: 3. If for somej2 [q]X j =Z (that is, if 2 j = 0), R(j) =E ideal := n 1 n 2 0 @ p X j 0 =q+1 2 j 0 + 2 1 A : (A.5.15) LemmaA.5.6. Assume the setup of Proposition A.1.5. The prediction risk for arbitrary weights (w 1 ;:::;w q )2 q1 ; as dened in (A.1.4), is n 1 n 2 0 @ 2 Z P q j=1 w 2 j 2 j 1 + P q j=1 w 2 j 2 j + p X j=q+1 2 j + 2 1 A : (A.5.16) We are now prepared to prove our main results from Appendix A.1. Proof of Proposition A.1.2. By Lemma A.5.5(i) the risk of an arbitraryX j ,j2 [q] is R(j) = n 1 n 2 0 @ 2 Z 2 j 1 + 2 j + p X k 0 =q+1 2 k 0 + 2 1 A 8j2 [q]; 214 and by Lemma A.5.5(ii), the risk for an arbitraryk2fq + 1;:::;pg is R(k) = n 1 n 2 0 @ 2 Z + X k 0 2fq+1;:::;pgnk 2 k 0 + 2 1 A : Then for anyj2 [q] and anyk2fq + 1;:::;pg, R(j) = n 1 n 2 0 @ 2 Z 2 j 1 + 2 j + p X k 0 =q+1 2 k 0 + 2 1 A = n 1 n 2 0 @ 2 Z 2 Z 1 + 2 j + X k 0 2fq+1;:::;pgnk 2 k 0 + 2 k + 2 1 A =R(k) n 1 n 2 2 Z 1 + 2 j 2 k ! : Proof of Corollary A.1.3. If we can show that under the assumptions of Theorem 2.3.1 2 Z > 1 + 2 (n), thenR(1)<R(3) is immediate by Proposition A.1.2 (andR(2) =R(1) comes from exchangeability). By Lemma A.4.9, 2 Z > 1 + 2 (n) 1 2(n) p 3 + 2 2 : Since we have 1 2(n) p 3 + 2 > 0 from Lemma A.4.8 and clearly 1 2(n) p 3 + 2 < 1 (because (n)> 0), the result is proven. 215 Proof of Proposition A.1.5. 1. Let 2 := ( 2 1 ;:::; 2 q ) > , and denotew 2 := (w 2 1 ;:::;w 2 q ) > . Consider the optimization problem to nd the optimal weights for the expression of prediction risk from (A.5.16) (with no restriction on whether some weights equal 0): w := arg min w 2 > 2 1 + (w 2 ) > 2 subject to 1 > w 1 = 0; w j 0 8j2 [q]: This can be simplied becauset7! t 1+t is monotonically increasing fort 0, so it is sucient to minimize w 2 > 2 . For any feasible choice ofw, let K 2 := w 2 > 2 = w 2 1 1= 2 1 +::: + w 2 q 1= 2 q : Considered as a function ofw, this is the equation for an ellipsoid with semi-axes of length K= 1 ;:::;K= q centered at the origin. Our optimization problem is to nd the smallestK 2 such that this ellipsoid intersects with the hyperplane described by 1 > w = 1 in the orthant withw j 0 for allj. We can nd this optimalw using the method of Lagrange multipliers. The Lagrangian is L(w;;) := w 2 > 2 1 > w: We have r w j L(w ;;) = 2w j 2 j = 0; j2 [q]; 216 which yieldsw j ==(2 2 j ). Finally, the constraint P j w j = 1 leads to X j 1 2 2 j = 1 () = 2 0 @ X j 1 2 j 1 A 1 =) w j = 1 2 j , X j 0 1 2 j 0 ; j2 [q]: 2. By Lemma A.5.6, the prediction risk of these weights is n 1 n 2 0 B B B @ 2 Z P q j=1 1 2 j P j 0 1 2 j 0 2 2 j 1 + P q j=1 1 2 j P j 0 1 2 j 0 2 2 j + p X j=q+1 2 j + 2 1 C C C A = n 1 n 2 0 @ 2 Z 1 + P q j=1 1 2 j + p X j=q+1 2 j + 2 1 A : 3. By Lemma A.5.5(ii), the prediction risk for an arbitraryk2fq + 1;:::;pg is R(k) = n 1 n 2 0 @ 2 Z + X j2fq+1;:::;pgnk 2 j + 2 1 A : Then for anyk2fq + 1;:::;pg, using the result from part (ii) we have R([q];w ) = n 1 n 2 0 @ 2 Z 1 + P q j=1 1 2 j + p X j=q+1 2 j + 2 1 A = n 1 n 2 0 @ 2 Z 2 Z P q j=1 1 2 j 1 + P q j=1 1 2 j + X j2fq+1;:::;pgnk 2 j + 2 k + 2 1 A =R(k) n 1 n 2 0 @ 2 Z P q j=1 1 2 j 1 + P q j=1 1 2 j 2 k 1 A : 217 A.6 ProofsofTechnicalLemmas Proof of Lemma A.4.3. Using thatX 11 andy 1 are jointly Gaussian, we have thatE X 2 11 y 2 1 = 11 yy + 2 2 1y . Then EkV 1 k 2 =E X 2 11 11 ;X 2 12 11 ;y 2 1 yy ;X 11 y 1 1y ;X 12 y 1 1y 2 =E h X 2 11 11 2 + X 2 12 11 2 + y 2 1 yy 2 + (X 11 y 1 1y ) 2 + (X 12 y 1 1y ) 2 i = Var X 2 11 + Var X 2 12 + Var y 2 1 +E X 2 11 y 2 1 1y X 11 y 1 + 2 1y +X 2 12 y 2 1 1y X 12 y 1 + 2 1y = 2 11 Var X 2 11 11 + 2 11 Var X 2 12 11 + 2 yy Var y 2 1 yy + 2E X 2 11 y 2 1 2 1y E [X 11 y 1 ] + 2 2 1y = 2 2 11 + 2 2 11 + 2 2 yy + 2 11 yy + 2 2 1y 2 2 1y + 2 2 1y = 4 2 11 + 2 2 yy + 2 11 yy + 4 2 1y ; 218 where we used thatX 2 11 = 11 ,X 2 12 = 11 , andy 2 1 = yy are 2 1 random variables and the exchangeability of X 11 andX 12 . Finally, using that yy > maxf 11 ; 1y g, we haveEkV 1 k 2 < 12 2 yy , which is (A.4.5). Further, EkV 1 k 3 =E X 2 11 1;X 2 12 1;y 2 1 1;X 11 y 1 1y ;X 12 y 1 1y 3 =E h X 2 11 1 3 + X 2 12 1 3 + y 2 1 1 3 +jX 11 y 1 1y j 3 +jX 12 y 1 1y j 3 i E h X 2 11 + 1 3 + X 2 12 + 1 3 + y 2 1 + 1 3 + (jX 11 y 1 j + 1y ) 3 + (jX 12 y 1 j + 1y ) 3 i =2E h X 2 11 + 1 3 i +E y 2 1 + 1 3 + 2E h (jX 11 y 1 j + 1y ) 3 i =2E X 6 11 + 3X 4 11 + 3X 2 11 + 1 +E y 6 1 + 3y 4 1 + 3y 2 1 + 1 + 2E h jX 11 y 1 j 3 + 3 1y jX 11 y 1 j 2 + 3 2 1y jX 11 y 1 j + 3 1y i 2E X 6 11 + 3X 4 11 + 3X 2 11 + 1 +E y 6 1 + 3y 4 1 + 3y 2 1 + 1 + 2 q EjX 11 j 6 Ejy 1 j 6 + 6 1y q EjX 11 j 4 Ejy 1 j 4 + 6 2 1y q EjX 11 j 2 Ejy 1 j 2 + 2 3 1y =E 2X 6 11 + 6X 4 11 + 6X 2 11 +y 6 1 + 3y 4 1 + 3y 2 1 + 3 + 2 q EX 6 11 Ey 6 1 + 6 1y q EX 4 11 Ey 4 1 + 6 2 1y q EX 2 11 Ey 2 1 + 2 3 1y =2 15 3 11 + 6 3 2 11 + 6 11 + 15 3 yy + 3 3 2 yy + 3 yy + 3 + 2 q 15 3 11 15 3 yy + 6 1y q 3 2 11 3 2 yy + 6 2 1y p 11 Var(y 1 ) + 2 3 1y =30 3 11 + 18 2 11 + 6 11 + 15 3 yy + 9 2 yy + 3 yy + 3 + 30 3=2 11 3=2 yy + 18 1y 11 yy + 6 2 1y p 11 yy + 2 3 1y ; where we used the triangle inequality, the Cauchy-Schwarz inequality and the exchangeability ofX 11 and X 12 . Again, since yy > maxf 11 ; 1y g, this yieldsEkV 1 k 3 < 140 3 yy which is (A.4.6). 219 Proof of Lemma A.4.4. Note that L(V 1 ) = 1 2 1y p yy 3 11 (X 2 12 X 2 11 ) + 1 p 11 yy (X 11 y 1 X 12 y 1 ) = (X 12 X 11 ) 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! ; so ~ 2 :=EjL(V 1 )j 2 =E 2 4 (X 12 X 11 ) 2 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! 2 3 5 : We will use the fact thatX 12 X 11 and 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 are mean zero jointly Gaussian random variables. Note that they are independent: Cov X 12 X 11 ; 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! = Cov X 12 ; 1 2 1y p yy 3 11 (X 12 +X 11 ) ! Cov X 12 ; 1 p 11 yy y 1 ! Cov X 11 ; 1 2 1y p yy 3 11 (X 12 +X 11 ) ! + Cov X 11 ; 1 p 11 yy y 1 ! = 1 2 1y p yy 3 11 [Cov (X 12 ;X 12 +X 11 ) Cov (X 11 ;X 12 +X 11 )] 1 p 11 yy 1y + 1 p 11 yy 1y = 0 220 (where we used the exchangeability ofX 11 andX 12 ). Therefore ~ 2 =E 2 4 (X 12 X 11 ) 2 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! 2 3 5 =E h (X 12 X 11 ) 2 i E 2 4 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! 2 3 5 = Var [X 12 X 11 ] Var " 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 # : (A.6.1) We have Var(X 12 X 11 ) = 11 + 11 2 12 = 2( 11 12 ); Var 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! = 1 4 2 1y yy 3 11 Var(X 12 +X 11 ) + 1 11 yy Var(y 1 ) 2 1 2 1y yy 2 11 Cov(X 12 +X 11 ;y 1 ) = 1 4 2 1y yy 3 11 ( 11 + 11 + 2 12 ) + 1 11 1y yy 2 11 ( 1y + 1y ) = 3 2 2 1y yy 2 11 + 1 2 2 1y 12 yy 3 11 + 1 11 ; so ~ 2 = 2( 11 12 ) 3 2 2 1y yy 2 11 + 1 2 2 1y 12 yy 3 11 + 1 11 ! = 2 11 11 12 11 0 @ 3 2 1y p yy 11 ! 2 + 1 2 1y p yy 11 ! 2 12 11 + 1 1 A = (1 12 ) 3 2 1y + 2 1y 12 + 2 ; 221 which yields the expression in (A.4.10). Finally, to see that ~ 2 2, note that ~ 2 = (1 12 ) 3 2 1y + 2 1y 12 + 2 = 2 (1 12 ) 2 1y 3 12 2 + 1 2 1 (0 + 1) = 2: (A.6.2) This veries (A.4.10). Next we will considerEjL(V 1 )j 3 . We will begin by showing that (A.4.11) holds. First, observe that (again using the independence ofX 12 X 11 and 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ) EjL(V 1 )j 3 =E (X 12 X 11 ) 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 ! 3 =EjX 12 X 11 j 3 E 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 3 = r 8 [Var (X 12 X 11 )] 3=2 r 8 " Var 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 !# 3=2 : (A.6.3) Using (A.6.3) and (A.6.1), we have EjL(V 1 )j 3 1=3 EjL(V 1 )j 2 1=2 8 [Var (X 12 X 11 )] 3=2 Var 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 3=2 ! 1=3 Var [X 12 X 11 ] Var 1 2 1y p yy 3 11 (X 12 +X 11 ) 1 p 11 yy y 1 1=2 = 8 1=3 : This veries (A.4.11). Since from (A.6.2) we have EjL(V 1 )j 2 1=2 = ~ p 2, it follows that EjL(V 1 )j 3 1=3 p 2 8 1=3 =) EjL(V 1 )j 3 8 2 3=2 = 16 p 2 ; 222 verifying thatEjL(V 1 )j 3 is nite. Proof of Lemma A.4.5. For notational ease, let ~ u j :=u j + 11 forj2f1; 2g, let ~ u 3 :=u 3 + yy , and let ~ u j :=u j + 1y forj2f4; 5g. Then dierentiating (A.4.8) yields r 2 g(u) = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 3~ u 4 4 p ~ u 3 ~ u 5 1 0 ~ u 4 4 p ~ u 3 1 ~ u 3 3 1 2 p ~ u 3 ~ u 3 1 0 0 3~ u 5 4 p ~ u 3 ~ u 5 2 ~ u 5 4 p ~ u 3 2 ~ u 3 3 0 1 2 p ~ u 3 ~ u 3 2 ~ u 4 4 p ~ u 3 1 ~ u 3 3 ~ u 5 4 p ~ u 3 2 ~ u 3 3 3 4 p ~ u 5 3 ~ u 5 p ~ u 2 ~ u 4 p ~ u 1 1 2 p ~ u 1 ~ u 3 3 1 2 p ~ u 2 ~ u 3 3 1 2 p ~ u 3 ~ u 3 1 0 1 2 p ~ u 1 ~ u 3 3 0 0 0 1 2 p ~ u 3 ~ u 3 2 1 2 p ~ u 2 ~ u 3 3 0 0 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : (A.6.4) Let := 11 =2. Then (A.4.1) is satised if for allu2X where X := u2 ( 11 ;1) ( 22 ;1) ( yy ;1)R 2 :kuk 2 ; kr 2 g(u)k op M for someM 2 (0;1). Since for any real-valued square matrixA2R kk it holds that kAk op kAk F p ~ ` max i;j jA ij j (where ~ ` is the number of nonzero entries inA), it suces to bound max u2X ( max i;j r 2 g(u) ij ) . Note that inX we have ~ u j =u j + 11 2 ( 11 =2; 3 11 =2); j2f1; 2g; ~ u 3 =u 3 + yy 2 ( yy 11 =2; yy + 11 =2); and ~ u j =u j + 1y 2 ( 1y 11 =2; 1y + 11 =2); j2f4; 5g: 223 Since (using 11 < 1y < yy ) 1=2 minf 11 =2; yy =2; 1y =2g = minf 11 =2; yy yy =2; 1y 1y =2g minf 11 =2; yy 11 =2; 1y 11 =2g and similarly yy + 11 =2> maxf 1y + 11 =2; 3 11 =2g, we can bound all of the ~ u j inX using 1=2 ~ u j yy + 11 =2 8j2 [5]: Therefore inX the absolute values of the (1; 1) and (2; 2) terms of (A.6.4) can be upper-bounded by 3 4 yy + 11 =2 q (1=2) 6 = 3 8 4 ( yy + 11 =2) = 6 yy + 3 11 < 9 yy ; the absolute values of the (1; 3) and (2; 3) terms of (A.6.4) (and their symmetric counterparts, the (3; 1) and (3; 2) terms) can be upper-bounded by 1 4 yy + 11 =2 q (1=2) 6 = 8 4 ( yy + 11 =2) = 2 yy + 11 < 3 yy < 9 yy ; and the absolute values of the (1; 4), (2; 5), (3; 4), and (3; 5) terms of (A.6.4) and their symmetric counter- parts can be upper-bounded by 1 2 1 q (1=2) 4 = 4 2 < 9 yy : 224 Finally, the absolute value of the center (3; 3) term can be upper-bounded by 3 4 p ~ u 5 3 ~ u 5 p ~ u 2 ~ u 4 p ~ u 1 3 4 q (1=2) 5 yy + 11 =2 p 1=2 1=2 p yy + 11 =2 ! 3 ( yy + 11 =2) 4 q (1=2) 6 = 3 8 4 ( yy + 11 =2) = 6 yy + 3 11 < 9 yy : Therefore for the 15 non-zero terms inr 2 g(u) we have max i;j r 2 g(u) ij < 9 yy , so for allu2X , kr 2 g(u)k op kr 2 g(u)k F p 15 max i;j r 2 g(u) ij < 9 p 15 yy < 36 yy =M : Proof of Lemma A.4.6. Substituting (A.4.5), (A.4.6), and = 11 =2 into the expression for k in (A.4.3) yields k < min ( 12 2 yy 2 11 n 1=2 =4 ; 2 12 2 yy 3=2 + 140 3 yy =n 1=2 3 11 n=8 ) = min ( 12 2 yy 2 11 n 1=2 =4 ; 384 p 3 3 yy + 1120 3 yy =n 1=2 3 11 n ) 384 p 3 3 yy + 1120 3 yy =n 1=2 3 11 n : 225 Substituting this along with the other expressions from (A.4.3) into (A.4.2) (also using (A.4.5), (A.4.6), and (A.4.11)) yields Ck 0 +k 1 3 r 8 ! 3 + k 20 +k 21 3 r 8 ! 12 2 yy + k 30 +k 31 3 r 8 ! 140 3 yy 2=3 +k < 0:13925 + 8 2:33554 + M 2~ 2 2 1=6 + " 2 + 2 2=3 n 1=6 # 8 1=3 ! 12 2 yy + M 2~ (8=) 1=6 n 1=3 + 2 n 1=2 8 1=3 ! 140 3 yy 2=3 + 384 p 3 3 yy + 1120 3 yy =n 1=2 3 11 n < 6:087 + M ~ 2 yy 12 2 1=6 + 12 1 + 1 2 1=3 n 1=6 8 1=3 + (140) 2=3 n 1=3 1=6 p 2 + (140) 2=3 n 1=2 8 1=3 + 384 p 3 + 1120=n 1=2 n 3 yy 3 11 < 6:087 + M ~ 2 yy 27:517 + 13:007 n 1=6 + 15:754 n 1=3 + 36:819 n 1=2 + 3 yy 3 11 384 p 3 n + 1120 n 3=2 ! : Also, note that forn 100, 27:517 + 13:007 n 1=6 + 15:754 n 1=3 + 36:819 n 1=2 27:517 + 13:007 100 1=6 + 15:754 100 1=3 + 36:819 100 1=2 < 40:631; and 384 p 3 n + 1120 n 3=2 384 p 3 n + 1120 n 100 1=2 < 777:108 n : 226 Using this along withM = 36 yy from Lemma A.4.5 and 11 1 yields C < 6:087 + 36 yy ~ 2 yy 40:631 + 3 yy 777:108 n < 6:087 + 1462:717 3 yy ~ + 3 yy 777:108 n (a) < 6:087 + 1462:717 3 yy ~ + 7:772 3 yy (b) < 1462:717 3 yy ~ + 13:859 3 yy ; where (a) follows because sincen 100 we have 3 yy 777:108 n 3 yy 777:108 100 and (b) follows because yy > 1. Proof of Lemma A.4.8. First we will show the inequalities from (A.4.16). By (A.4.15), n (logn) 3=2 > 361 25 ( 2 Z + 1 + 2 ) 4c 2 () 25 361 (logn) 1=2 > ( 2 Z + 1 + 2 ) logn 4c 2 n () s ( 2 Z + 1 + 2 ) logn 4c 2 n < 5 19 (logn) 1=4 () (n)< 5 19 (logn) 1=4 : 227 Next, from (A.4.14) we have n logn > 2 Z + 1 + 2 c 2 5 1 + 2 () 1 20 (1 + 2 ) > logn 4c 2 n ( 2 Z + 1 + 2 ) () 1 p 20 (1 + 2 ) >(n): Note that this yields (n)< 1 p 20 (1 + 2 ) < 2 5 p 3 + 2 () 2(n) p 3 + 2 < 4 5 : Lastly, we have by (A.4.14) n logn > 1 c 2 2 Z + 1 + 2 4t 2 0 (2 + 2 ) 2 () s ( 2 Z + 1 + 2 ) logn 4c 2 n <t 0 (2 + 2 ) () (n)<t 0 (2 + 2 ) =) (n)<t 0 2 Z + 1 + 2 ; where we used that Z > 1, so (A.4.16) is veried. Now we will show that the remaining inequalities hold. From the denition of(n) in (A.3.13) we have that (n) = log(n) 2 (n) 20 s 2 Z + 1 + 2 c 2 () 2 (n) = 20 r c 2 2 Z + 1 + 2 (n) logn : 228 Since 1 + 2 (n)> 1 for alln2N andx> p x for allx> 1, q 1 + 2 (n) 1< 1 + 2 (n) 1 = 2 (n) = 20 r c 2 2 Z + 1 + 2 (n) logn ; which is (A.4.17). Finally, usingn 100, 2 (n) = 10 p n logn 10 p 100 log(100) < 1: This veries (A.4.18). Proof of Lemma A.4.9. First, note that 1< q 1 + 2 (n) 1 2(n) p 3 + 2 229 because 12(n) p 3 + 2 > 0 due to (A.4.16), 12(n) p 3 + 2 < 1 since(n)> 0, and q 1 + 2 (n)> 1 since 2 (n)> 0. Next we will show the inequality on the right. Due to the assumption that Z 2I(n) from (2.13) (and using, in particular, Z > 1) we have Z < 1 + 19 10 s 2 + 2 c 2 (logn) 3=4 n 1=2 =) n 1=2 (logn) 3=4 < 19 10 s 2 + 2 c 2 1 Z 1 < 19 10 s 2 Z + 1 + 2 c 2 Z Z 1 =) 10 Z 1 Z < 19 (logn) 3=4 n 1=2 s 2 Z + 1 + 2 c 2 () 5 1 1 Z < 19 (logn) 3=4 n 1=2 s 2 Z + 1 + 2 4c 2 = 19 (logn) 1=4 (n) () Z < 5 5 19 (logn) 1=4 (n) ; 230 where the last step is permissible because 5 19 (logn) 1=4 (n)> 0 due to (A.4.16). Finally we will show the middle inequality. From Lemma A.4.8 we have 2(n) p 3 + 2 < 4 5 . Using the inequality 1=(1t) 1 + 5t valid fort2 [0; 4=5], we have q 1 + 2 (n) 1 2(n) p 3 + 2 q 1 + 2 (n) 1 + 5 2(n) p 3 + 2 = q 1 + 2 (n) 0 @ 1 + 5 2 s ( 2 Z + 1 + 2 ) logn 4c 2 n p 3 + 2 1 A (a) 1 + 1 2 2 (n) 0 @ 1 + 5 s 2 Z + 1 + 2 logn c 2 n p 3 + 2 1 A = 1 + 5 p n logn 1 +h( Z ; 2 ) r logn n ! = 1 + 5 p n logn +h( Z ; 2 ) r logn n + 5h( Z ; 2 ) n ; (A.6.5) where in (a) we used the inequality p 1 +t 1 + 1 2 t, valid fort 0 and h( Z ; 2 ) := 5 s 2 Z + 1 + 2 (3 + 2 ) c 2 : (A.6.6) The following lemma allows us to bound this expression. LemmaA.6.1. Under the assumptions of Theorem 2.3.1, 5 p n logn +h( Z ; 2 ) r logn n + 5h( Z ; 2 ) n < 100 p n logn forh( Z ; 2 ) dened in (A.6.6). Proof. Provided later in Appendix A.6. 231 Finally, the assumption that Z 2I(n) from (2.13) yields Z > 1 + 100 p n logn (b) > 1 + 5 p n logn +h( Z ; 2 ) r logn n + 5h( Z ; 2 ) n ; (c) q 1 + 2 (n) 1 2(n) p 3 + 2 ; where in (b) we used Lemma A.6.1 and in (c) we used (A.6.5). Proof of Lemma A.4.10. The calculation of the covariance and cosine similarity matrices is trivial. Note that 2 Z + 1 + 2 > 1 + 1 + 2 > 1 + 2 (n) since Z > 1 and by (A.4.18) 2 (n) < 1. So max ( ii ) = 2 Z + 1 + 2 = yy . By inspection we see min ( ii ) = 1. Finally, we will show that Z 1 + 2 (n) > 0. We have Z 1 + 2 (n) (a) > q 1 + 2 (n) 1 2(n) p 3 + 2 1 + 2 (n) = q 1 + 2 (n) 1 1 2(n) p 3 + 2 q 1 + 2 (n) ! (b) > 1 1 2(n) p 3 + 2 1 + 20 r c 2 2 Z + 1 + 2 (n) logn = 2(n) p 3 + 2 1 2(n) p 3 + 2 10 r c 2 2 Z + 1 + 2 1 logn ! (c) > 0; 232 where (a) uses Lemma A.4.9, (b) follows from (A.4.17) and 2 (n)> 0, and (c) comes from n 100 =) n> exp ( 10 r e 1 48e 2 ) =) logn> 10 s e 1 8e 2 ( 2 Z + 1 + 2 )(3 + 2 ) (d) =) logn> 1 2(n) p 3 + 2 p 3 + 2 10 r c 2 2 Z + 1 + 2 () p 3 + 2 1 2(n) p 3 + 2 > 10 r c 2 2 Z + 1 + 2 1 logn where (d) follows fromc 2 < (e 1)=(8e 2 ) and we used that Z > 1. Proof of Lemma A.4.12. To prove the rst result we will use one more lemma. LemmaA.6.2. Under the assumptions of Theorem 2.3.1, 2 ~ (n) 1 + 2 2 Z + 1 + 2 < ~ 2 (n) 2; where ~ (n) is dened in (A.4.24) and ~ (n) is dened in (A.4.25). Proof. Provided later in Appendix A.6. 233 First we will show that (A.4.26) holds. Using the denitions of(n) from (A.3.13),(n) from (A.3.14), and ~ (n) from (A.4.24), we have that the argument of () in (A.4.26) is p n(n) ~ (n) = 2 2 + 19 5 (logn) 1=4 p n(n)[(n) + ~ (n)] ~ (n) (a) < 2 2 + 19 5 (logn) 1=4 q 2 Z + 1 + 2 p n(n)[(n) + ~ (n)] q 2 (1 + 2 ) ~ (n) = 2 2 + 19 5 (logn) 1=4 q 2 Z + 1 + 2 q ( 2 Z +1+ 2 ) logn 4c 2 q ( 2 Z +1+ 2 ) logn 4c 2 n + 10 p n logn+10 q 2 (1 + 2 ) 10 p n logn+10 = 2 + 19 5 (logn) 1=4 2 Z + 1 + 2 s 4 80c 2 (1 + 2 ) p logn q ( 2 Z +1+ 2 ) logn 4c 2 n + 1 p n 10 p logn+10= p n q 1 p n logn+10 (b) < 2 19 5 (logn) 1=4 2 Z + 1 + 2 p 20c 2 (1 + 2 ) p logn q ( 2 Z +1+ 2 ) logn 4c 2 n + 2 q logn n q 1 p n logn+10 = 38 5 (logn) 1=4 2 Z + 1 + 2 2 p 5c 2 (1 + 2 ) logn s (n logn) 1=2 + 10 n 2 4 q 2 Z + 1 + 2 2 p c 2 + 2 3 5 (c) < 19 5 (logn) 1=4 2 Z + 1 + 2 p 5c 2 (1 + 2 ) logn s 2 (n logn) 1=2 n q 2 Z + 1 + 2 p c 2 = 19 p 2 5 (logn) 1=4 2 Z + 1 + 2 3=2 c 2 p 5(1 + 2 ) (logn) 5=4 n 1=4 =k( z ; 2 ) (logn) 3=2 n 1=4 ; where (a) follows from Lemma A.6.2, (b) follows from 10 p logn + 10= p n < 2 p logn 234 and 2< 19 5 (logn) 1=4 for alln 100, (c) follows from 10< p n logn forn 100 and q 2 Z + 1 + 2 2 p c 2 > p 2 2 r 8e 2 e 1 > 2; and k( z ; 2 ) := 19 p 2 5c 2 p 5 2 Z + 1 + 2 3=2 (1 + 2 ) 1=2 : So we have that k( z ; 2 ) (logn) 3=2 n 1=4 ! is a valid upper bound for ((n) p n=~ (n)). The rst order Taylor expansion of (k( z ; 2 )x) centered at 0 is (0) + d dx k( z ; 2 )x x=0 x +R(x) = 1 2 +k( z ; 2 ) (0)x +R(x) = 1 2 + 1 p 2 k( z ; 2 )x +R(x) 235 where (x) = 1 p 2 e x 2 2 is the pdf of a standard Gaussian random variable and R(x) = (x 2 ). Since 00 (x) = 0 (x) 0 for allx 0, (k( z ; 2 )x) is concave for allx 0, so the rst order Taylor series is an upper bound for (k( z ; 2 )x) whenx 0. ThereforeR(x) 0 for allx 0, so for alln 100, (n) p n ~ (n) < k( z ; 2 ) (logn) 3=2 n 1=4 ! 1 2 + 1 p 2 k( z ; 2 ) (logn) 3=2 n 1=4 < 1 2 +c 7 2 Z + 1 + 2 3=2 (1 + 2 ) 1=2 (logn) 3=2 n 1=4 for c 7 := 19 5c 2 p 5 ; which is (A.4.26). We conclude by showing that (A.4.27) holds. Using the denition of ~ (n) in (A.4.24), we have 1 ~ (n) (d) < q 2 Z + 1 + 2 p 2(1 + 2 ) 1 q ~ (n) = s 2 Z + 1 + 2 2(1 + 2 ) p n logn + 10 1=2 p 10 (e) < s 2 Z + 1 + 2 2(1 + 2 ) p 2 (n logn) 1=4 p 10 where (d) follows from Lemma A.6.2 and (e) follows from 10< p n logn forn 100. This yields (A.4.27). Proof of Lemma A.5.4. We establish the identities one at at time. 236 • Note that 1 q 2 Z + 2 q+1 + 2 yN (0;I n ) =) E 0 B @ 2 4 1 q 2 Z + 2 q+1 + 2 y 3 5 > 2 4 1 q 2 Z + 2 q+1 + 2 y 3 5 1 C A =n: This establishes (A.5.3). • We have forj2 [q] 0 B B @ Z X j 1 C C A = 0 B B @ Z Z 1 C C A + 0 B B @ 0 j 1 C C A N 0 B B @ 0 B B @ 0 0 1 C C A ; 0 B B @ I n I n I n I n + 2 j I n 1 C C A 1 C C A : Then ZjX j N I n I n + 2 j I n 1 X j ;I n I n (I n + 2 j I n ) 1 I n =N X j 1 + 2 j ; " 1 1 1 + 2 j # I n ! =N X j 1 + 2 j ; 2 j 1 + 2 j I n ! : (A.6.7) Next, 237 E h ^ j jX j i =E h (X > j X j ) 1 X > j ( Z Z + q+1 X q+1 +)jX j i = (X > j X j ) 1 X > j ( Z E [ZjX j ] + q+1 E [X q+1 jX j ] +E [jX j ]) = Z 1 + 2 j (X > j X j ) 1 X > j X j + 0 = Z 1 + 2 j where we used (A.6.7) andE [jX j ] = 0 becauseE [] = 0 and andX j are independent by assumption, and similarlyE [X q+1 jX j ] = 0. Next, Var h ^ j jX j i = Var h (X > j X j ) 1 X > j ( Z Z + q+1 X q+1 +)jX j i = 1 (X > j X j ) 2 X > j Var [ Z Z + q+1 X q+1 +jX j ]X j = 1 (X > j X j ) 2 X > j ( 2 Z Var [ZjX j ] + 2 q+1 Var [X q+1 jX j ] + Var [jX j ])X j = 1 (X > j X j ) 2 X > j 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! I n X j = 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! X > j X j (X > j X j ) 2 = 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 X > j X j 238 where we used (A.6.7). Note also that ^ conditioned onX j is a Gaussian random variable because ZjX j andjX j are independent Gaussian random variables. Putting this together we have ^ j jX j N Z 1 + 2 j ; 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 X > j X j ! : This proves (A.5.4). • E h ^ q+1 jX q+1 i =E h (X > q+1 X q+1 ) 1 X > q+1 ( Z Z + q+1 X q+1 +)jX q+1 i = q+1 (X > q+1 X q+1 ) 1 X > q+1 X q+1 = q+1 ; where we used the fact thatE [jX q+1 ] = 0 becauseE [] = 0 and andX q+1 are independent by assumption, and similarlyE [ZjX q+1 ] = 0. Next, Var h ^ q+1 jX q+1 i = Var h (X > q+1 X q+1 ) 1 X > q+1 ( Z Z + q+1 X q+1 +)jX q+1 i = 1 (X > q+1 X q+1 ) 2 X > q+1 Var [ Z Z + q+1 X q+1 +jX q+1 ]X q+1 = 1 (X > q+1 X q+1 ) 2 X > q+1 ( 2 Z Var [ZjX q+1 ] + 2 q+1 Var [X q+1 jX q+1 ] + Var [jX q+1 ])X q+1 = 1 (X > q+1 X q+1 ) 2 X > q+1 2 Z + 2 I n X q+1 = 2 Z + 2 X > q+1 X q+1 (X > q+1 X q+1 ) 2 = 2 Z + 2 X > q+1 X q+1 : 239 Note also that ^ q+1 conditioned onX q+1 is a Gaussian random variable becauseZjX q+1 ,X q+1 j X q+1 , andjX q+1 are independent Gaussian random variables. Putting this together we have ^ q+1 jX q+1 N q+1 ; 2 Z + 2 X > q+1 X q+1 ! : This shows (A.5.5). • E h ^ Z jZ i =E h (Z > Z) 1 Z > ( Z Z + q+1 X q+1 +)jZ i = Z (Z > Z) 1 Z > Z = Z ; where we used the fact that E [jZ] = 0 because E [] = 0 and andZ are independent by assumption, and similarlyE [X q+1 jZ] = 0. Next, Var h ^ Z jZ i = Var h (Z > Z) 1 Z > ( Z Z + q+1 X q+1 +)jZ i = 1 (Z > Z) 2 Z > Var [ Z Z + q+1 X q+1 +jZ]Z = 1 (Z > Z) 2 Z > ( 2 Z Var [ZjZ] + 2 q+1 Var [X q+1 jZ] + Var [jZ])Z = 1 (Z > Z) 2 Z > 2 q+1 + 2 I n Z = 2 q+1 + 2 Z > Z (Z > Z) 2 = 2 q+1 + 2 Z > Z : 240 Note also that ^ Z conditioned onZ is a Gaussian random variable becauseZjZ,X q+1 jZ, and jZ are independent Gaussian random variables. Putting this together we have ^ Z jZN Z ; 2 q+1 + 2 Z > Z ! : This proves (A.5.6). • X j N 0; (1 + 2 j )I n =) 1 q 1 + 2 j X j N (0; 1) This veries (A.5.7). • For anyj2 [q], E h X > j y i =E h Z + j > ( Z Z + q+1 X q+1 +) i =E h Z Z > Z i = Z n: This establishes (A.5.8). • E h X > q+1 y i =E h X > q+1 ( Z Z + q+1 X q+1 +) i = q+1 n: This proves (A.5.9). • Since 1 q 1+ 2 j X j N (0;I n ), for anyj2 [q] the random variable 1 + 2 j X > j X j has an inverse 2 distribution withn degrees of freedom. Therefore 241 E " 1 + 2 j X > j X j # = 1 n 2 =) E " 1 X > j X j # = 1 (n 2)(1 + 2 j ) ; verifying (A.5.10). • SinceX q+1 N (0;I n ), the random variable 1 X > q+1j X q+1 has an inverse 2 distribution withn degrees of freedom, so E " 1 X > q+1 X q+1 # = 1 n 2 : This shows (A.5.11). • To calculateE h ^ 2 j i for anyj2 [q], we will use (A.5.4) and (A.5.10): E h ^ 2 j i =E h E h ^ 2 j jX j ii =E Var h ^ j jX j i +E h ^ j jX j i 2 =E 2 4 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 X > j X j + Z 1 + 2 j ! 2 3 5 = 1 1 + 2 j " 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 n 2 + 2 Z 1 + 2 j # : This proves (A.5.12). 242 • Using (A.5.5) and (A.5.11) we have E h ^ 2 q+1 i =E h E h ^ 2 q+1 jX q+1 ii =E Var h ^ q+1 jX q+1 i +E h ^ q+1 jX q+1 i 2 =E " 2 Z + 2 X > q+1 X q+1 + 2 q+1 # = 2 Z + 2 n 2 + 2 q+1 This veries (A.5.13). • Using (A.5.6) we have E h ^ 2 Z i =E Var h ^ Z jZ i +E h ^ Z jZ i 2 =E " 2 q+1 + 2 Z > Z + 2 Z # = 2 q+1 + 2 n 2 + 2 Z ; which is (A.5.14). Proof of Lemma A.5.5. 1. First consider the case where there is only one directly observed feature (that is,p =q + 1). For anyj2 [q], 243 E 1 n ~ y ^ j ~ X j 2 2 = 1 n E ~ y ^ j ~ X j > ~ y ^ j ~ X j (a) = 1 n E h ~ y > ~ y i 2 n E h ^ j i E h ~ X > j ~ y i + 1 n E h ^ 2 j i E h ~ X > j ~ X j i (b) = 2 Z + 2 q+1 + 2 2 n E h E ^ j jX j i Z n + 1 n 1 1 + 2 j " 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 n 2 + 2 Z 1 + 2 j # n(1 + 2 j ) (c) = 2 Z + 2 q+1 + 2 2 2 Z 1 + 2 j + 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! 1 n 2 + 2 Z 1 + 2 j = n 1 n 2 2 q+1 + 2 + 2 Z 2 Z 1 + 2 j + 2 Z 2 j 1 + 2 j 1 n 2 = n 1 n 2 2 q+1 + 2 + 2 Z 2 j 1 + 2 j + 2 Z 2 j 1 + 2 j 1 n 2 = n 1 n 2 2 Z 2 j 1 + 2 j + 2 q+1 + 2 ! ; (A.6.8) where (a) follows because ~ X j and ~ y are independent fromX j andy, so ^ j = (X > j X j ) 1 X > j y is independent from ~ X j and ~ y, (b) follows from (A.5.3), (A.5.7), (A.5.8), and (A.5.12), and (c) follows from (A.5.4). Now suppose we have an arbitrary number of directly observed features (that is, an arbitraryp>q). Then for anyj2 [q], y = Z Z + p X j=q+1 j X j + = Z Z + q+1 X q+1 + ~ where ~ := p X j 0 =q+2 j 0X j 0 + 244 is independent ofZ andX q+1 . So we can use (A.6.8) and we have E 1 n ~ y ^ j ~ X j 2 2 = n 1 n 2 2 Z 2 j 1 + 2 j + 2 q+1 + Var (~ ) ! = n 1 n 2 0 @ 2 Z 2 j 1 + 2 j + 2 q+1 + p X j 0 =q+2 2 j 0 + 2 1 A : 2. Again, we start by considering the case wherep =q + 1. E 1 n ~ y ^ q+1 ~ X q+1 2 2 = 1 n E ~ y ^ q+1 ~ X q+1 > ~ y ^ q+1 ~ X q+1 (a) = 1 n E h ~ y > ~ y i 2 n E h ^ q+1 i E h ~ X > q+1 ~ y i + 1 n E h ^ 2 q+1 i E h ~ X > q+1 ~ X q+1 i (b) = 2 Z + 2 q+1 + 2 2 n E h E ^ q+1 jX q+1 i q+1 n + 1 n 2 Z + 2 n 2 + 2 q+1 n (c) = 2 Z + 2 q+1 + 2 2 2 q+1 + 2 Z + 2 n 2 + 2 q+1 = n 1 n 2 2 Z + 2 ; (A.6.9) where (a) follows because ~ X q+1 and ~ y are independent from ^ q+1 , (b) follows from (A.5.3), (A.5.9), (A.5.13), and the fact thatE h ~ X > q+1 ~ X q+1 i = 1, and (c) follows from (A.5.5). Now for an arbitraryp>q, for anyj2fq + 1;:::;pg we have y = Z Z + p X j=q+1 j X j + = Z Z + j X j + ~ where ~ := X j 0 2fq+1;:::;pgnj j 0X j 0 + 245 is independent ofZ andX j . So we can use (A.6.9) and we have E 1 n ~ y ^ j ~ X q+1 2 2 = n 1 n 2 2 Z + Var (~ ) = n 1 n 2 0 @ 2 Z + X j 0 2fq+1;:::;pgnj 2 j 0 + 2 1 A : 3. First we handle the case with only one directly observed featureX q+1 , as in the previous parts. E 1 n ~ y ^ Z ~ Z 2 2 = 1 n E h ~ y > ~ y 2 ^ Z ~ Z > ~ y + ^ 2 Z ~ Z > ~ Z i (a) = 1 n n( 2 Z + 2 q+1 + 2 ) 2E h ^ Z i E h ~ Z > E ~ yj ~ Z i +E h ^ 2 Z i E h ~ Z > ~ Z i ! (b) = 1 n n( 2 Z + 2 q+1 + 2 ) 2E h E ^ Z jZ i Z E h ~ Z > ~ Z i +n 2 q+1 + 2 n 2 + 2 Z !! (c) = 1 n n( 2 Z + 2 q+1 + 2 ) 2 2 Z n +n 2 q+1 + 2 n 2 + 2 Z !! = 2 Z + 2 q+1 + 2 2 2 Z + 2 q+1 + 2 n 2 + 2 Z = n 1 n 2 2 q+1 + 2 ; (A.6.10) where in (a) we used (A.5.3) and the independence of ^ Z from ~ Z and ~ y, in (b) we usedE Z > Z =n and (A.5.14), and in (c) we used (A.5.6) andE Z > Z = n. Now we make this more general. We have y = Z Z + p X j=q+1 j X j + = Z Z + q+1 X q+1 + ~ where ~ := p X j=q+2 j X j +N 0 @ 0; 0 @ p X j=q+2 2 j + 2 1 A I n 1 A : 246 Substituting into (A.6.10), we see that the prediction risk ofZ is n 1 n 2 0 @ 2 q+1 + 0 @ p X j=2 2 j + 2 1 A 1 A = n 1 n 2 0 @ p X j=q+1 2 j + 2 1 A : Proof of Lemma A.5.6. Note that q X j=1 w q X j =Z + q X j=1 w q j d =Z + ~ ; where d = denotes equality in distribution and ~ N 0; P q j=1 w 2 j 2 j I n is independent ofZ. Then the result follows from Lemma A.5.5(i). Proof of Lemma A.6.1. Using logn n < c 2 2 Z + 1 + 2 1 2 (12 + 2 ) from (A.4.14), we have h( Z ; 2 ) r logn n <h( Z ; 2 ) s c 2 2 Z + 1 + 2 1 2 (12 + 2 ) = 5 s 3 + 2 2 (12 + 2 ) < 5 p 2 2 : Similarly, 5h( Z ; 2 ) n = 5h( Z ; 2 ) r logn n 1 p n logn < 25 p 2 2 1 p n logn : 247 This yields 5 p n logn +h( Z ; 2 ) r logn n + 5h( Z ; 2 ) n < 5 p n logn + 5 p 2 2 + 25 p 2 2 p n logn < 5 p n logn 1 + p 2 2 p 100 log 100 + 5 p 2 2 ! < 100 p n logn : Proof of Lemma A.6.2. Note that by Lemma A.4.10 the assumed covariance matrix structure in Proposition A.3.3 holds. Also, the denition of ~ in (A.4.10) matches the denition in (A.4.25). Therefore the assump- tions of Lemma A.4.4 are satised, and the right side of the inequality follows since in Lemma A.4.4 we show that ~ 2 2. To see that the left side of the inequality holds, note that from Lemma A.4.10 we have 2 1y (n) = 2 Z 2 Z + 1 + 2 1 + 2 (n) =r 12 (n) where r := 2 Z 2 Z + 1 + 2 2 (0; 1): 248 So ~ 2 = 2(1 12 (n)) 3 2 2 1y (n) + 1 2 2 1y (n) 12 (n) + 1 = 2(1 12 (n)) 2 1y (n) 3 12 (n) 2 + 1 = 2(1 12 (n)) r 12 (n) 1 12 (n) 2 + 1 + 1 = 2(1 12 (n)) r 1 12 (n) 2 + 1 r 12 (n) 1 12 (n) 2 + 1 r 1 12 (n) 2 + 1r = 2(1 12 (n)) r 1 12 (n) 2 + 1 [1 12 (n)]r 1 12 (n) 2 + 1r = 2(1 12 (n)) r [1 12 (n)] 1 12 (n) 2 + 1 1 2 + 1r > 2(1 12 (n))(1r) = 2(1 12 (n)) 1 + 2 2 Z + 1 + 2 : 249 AppendixB PredictingRareEventsbyShrinkingTowardsProportionalOdds Appendix In Section B.1, we display summary statistics and additional gures for the observed mean squared errors (MSEs) for each method from the synthetic data experiments from Sections 3.4.1 and 3.4.2. We also briey investigate the eect of implementing PRESTO with a squared` 2 (ridge) penalty rather than an` 1 penalty in Section B.1.1. We provide the proofs of Theorems 3.2.3 and 3.2.2 in Section B.2. In Section B.3, we present synthetic data experiments and analysis justifying the validity of one of the assumptions of Theorem 3.2.4 in Sections B.3.1 and B.3.2, and we then prove Theorem 3.2.4. Theorems 3.2.3, 3.2.2, and 3.2.4 depend on Lemma B.2.1, which is stated at the beginning of Section B.2 and proven in Section B.4. We prove Theorem 3.3.1 in Section B.5. Finally, in Section B.6 we provide implementation details for estimatingPRESTO. B.1 MoreSimulationResults For more results from the synthetic experiments, see Tables B.1, B.2, and B.3, along with Figures B.1, B.2, B.3, B.4, and B.5. 250 Table B.1: Similar to Table 3.1; calculatedp-values for one-tailed pairedt-tests for sparse dierences sim- ulation setting of Section 3.4.1 (statistically signicantp-values indicate better performance for PRESTO). Rare Prop. Sparsity Logitp-value POp-value 1% 1/3 1.69e-33 6.42e-41 1.17% 1/2 1.61e-15 2.78e-66 0.61% 1/3 5.19e-74 4.21e-19 0.71% 1/2 8.68e-48 3.38e-35 0.37% 1/3 3.08e-61 0.00165 0.43% 1/2 3.75e-64 2.57e-11 Table B.2: Means and standard errors of empirical MSEs for each method in each of three intercept settings in the sparse dierences synthetic experiment setting of Section 3.4.1. Rare Class Proportion Sparsity PRESTO Logistic Regression Proportional Odds 1% 1/3 6.05e-05 (2.1e-06) 9.38e-05 (2.5e-06) 8.62e-05 (3.1e-06) 1.17% 1/2 9.87e-05 (2.9e-06) 1.25e-04 (3.3e-06) 1.66e-04 (5.5e-06) 0.61% 1/3 3.03e-05 (1.1e-06) 6.90e-05 (2.1e-06) 3.64e-05 (1.4e-06) 0.71% 1/2 5.22e-05 (1.9e-06) 8.89e-05 (2.5e-06) 7.50e-05 (3e-06) 0.37% 1/3 1.40e-05 (6e-07) 5.66e-05 (2.4e-06) 1.49e-05 (6.1e-07) 0.43% 1/2 2.63e-05 (1e-06) 7.21e-05 (2.7e-06) 3.17e-05 (1.4e-06) 1e−05 1e−04 1e−03 PRESTO Logit PO Rare Probability MSE Rare Proportion: 1.1664% 1e−06 1e−05 1e−04 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.4348% Figure B.1: MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=3. (These plots are for the two intercept settings that weren’t shown in the main text for the sparsity setting of 1=3) 251 Table B.3: Means and standard errors of empirical MSEs for each method in each of four intercept settings in the uniform dierences synthetic experiment setting of Section 3.4.2. Rare Class Proportion PRESTO Logistic Regression Proportional Odds 1.6% 1.13e-04 (2.7e-06) 1.35e-04 (2.9e-06) 1.85e-04 (5.2e-06) 0.99% 5.82e-05 (1.7e-06) 9.25e-05 (2.3e-06) 8.53e-05 (2.6e-06) 0.62% 2.86e-05 (8.3e-07) 6.51e-05 (1.7e-06) 3.43e-05 (9.8e-07) 0.36% 1.33e-05 (4.4e-07) 5.36e-05 (2.2e-06) 1.43e-05 (5e-07) 3e−06 1e−05 3e−05 1e−04 3e−04 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.614% 0.03 0.10 0.30 1.00 3.00 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.614% 0.01 0.10 1.00 10.00 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.37% 0.1 0.3 1.0 3.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 0.996% Figure B.2: Same as Figure 3.1, but for the simulations with sparsity 1=2. 252 3e−06 1e−05 3e−05 1e−04 3e−04 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.996% 1e−06 1e−05 1e−04 1e−03 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.37% Figure B.3: Same as Figure B.1, but for the simulations with sparsity 1=2. 1e−05 3e−05 1e−04 3e−04 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.994% 1e−06 1e−05 1e−04 1e−03 PRESTO Logit PO Rare Probability MSE Rare Proportion: 0.364% Figure B.4: MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in uniform dierences synthetic experiment setting of Section 3.4.2. (These plots are for two of the intercept settings that weren’t shown in the main text.) 253 1e−05 1e−04 1e−03 PRESTO Logit PO Rare Probability MSE Rare Proportion: 1.598% 0.1 0.3 1.0 3.0 Logit PO MSE Ratio (PRESTO/Other) Rare Proportion: 1.598% Figure B.5: MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in uniform dierences synthetic experiment setting of Section 3.4.2 for intercept setting of (0; 2:5; 4:5). B.1.1 RidgePRESTO We briey investigate the eect of implementing PRESTO with a ridge penalty instead of an` 1 penalty, similarly to proposals by Tutz and Gertheiss [175, Section 4.2.2] and Ugba, Mörlein, and Gertheiss [176, Equation 8], n 0 @ p X j=1 2 j1 + p X j=1 K1 X k=2 ( jk j;k1 ) 2 1 A : We implement this method (“PRESTO_L2") in the sparse dierences synthetic data experiment of Section 3.4.1 on the same simulated data that was used for the other methods in the intercept setting (0; 3; 5) for both sparsity levels. The implementation is identical to PRESTO in every way except for the ridge penalty—the method is implemented using our modication of theordinalNet R package and the tuning parameter is selected in the same way. 254 Figures B.6 and B.7 display the results. (The results for all methods but PRESTO_L2 are identical to previous plots and are only displayed for reference.) We also present the means and standard deviations of the MSEs for each method in these settings in Table B.4, andp-values for one-tailed pairedt-tests of the alternative hypothesis that PRESTO has a lower MSE than the competitor methods in Table B.5. We see that in practice, PRESTO and PRESTO_L2 seem to perform similarly in our setting, though thet-tests show that PRESTO does outperform PRESTO_L2 at a 5% signicance level in both settings. We might expect PRESTO to better outperform PRESTO_L2 in settings wherep=n is larger and where the sparsity is lower. We also note it is not clear if PRESTO_L2 enjoys a high-dimensional consistency guarantee similar to Theorem 3.2.4. 3e−06 1e−05 3e−05 1e−04 3e−04 PRESTO PRESTO_L2 Logit PO Rare Probability MSE Rare Proportion: 0.996% 0.1 0.3 1.0 3.0 Logit PO PRESTO_L2 MSE Ratio (PRESTO/Other) Rare Proportion: 0.996% Figure B.6: MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=3 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. B.2 StatementofLemmaB.2.1andProofsofTheorems3.2.2and3.2.3 In Section B.2.1 we state Lemma B.2.1, and we prove Theorems 3.2.2 and 3.2.3 in Section B.2.2. 255 1e−05 1e−04 1e−03 PRESTO PRESTO_L2 Logit PO Rare Probability MSE Rare Proportion: 1.166% 0.1 0.3 1.0 3.0 Logit PO PRESTO_L2 MSE Ratio (PRESTO/Other) Rare Proportion: 1.166% Figure B.7: MSE of predicted rare class probabilities for each method across alln = 2500 observations, across 700 simulations, in sparse dierences synthetic experiment setting of Section 3.4.1 with sparsity 1=2 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. Table B.4: Means and standard errors of empirical MSEs for each method in each of three intercept settings in the sparse dierences synthetic experiment setting of Section 3.4.1 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. Prop. Sparsity PRESTO Logit PO PRESTO_L2 1% 1=3 6.05e-05 (2.1e-06) 9.38e-05 (2.5e-06) 8.62e-05 (3.1e-06) 6.25e-05 (2.4e-06) 1.17% 1=2 9.87e-05 (2.9e-06) 1.25e-04 (3.3e-06) 1.66e-04 (5.5e-06) 1.17e-04 (4.2e-06) Table B.5: Calculated p-values for one-tailed paired t-tests for sparse dierences simulation setting of Section 3.4.1 for intercept setting of (0; 3; 5), with PRESTO_L2 implemented as well, as described in Section B.1.1. (Statistically signicantp-values indicate better performance for PRESTO). Rare Class Proportion Sparsity Logistic Regression Proportional Odds PRESTO_L2 1% 1=3 1.69e-33 6.42e-41 0.0358 1.17% 1=2 1.61e-15 2.78e-66 1.51e-14 256 B.2.1 StatementofLemmaB.2.1 Theorems 3.2.2, 3.2.3, and 3.2.4 relate to the asymptotic covariance matrices of the maximum likelihood estimators of the parameters of the proportional odds and logistic regression models. Under mild regularity conditions, the asymptotic covariance matrix of any maximum likelihood estimator (when scaled by p n) is known to be the inverse of the Fisher information matrix E @ 2 @ > L() ; where are the parameters estimated by the model andL() is the log likelihood [156, Section 4.2.2]. In the proof of Lemma B.2.1, we calculate these Fisher information matrices for the proportional odds and logistic regression models and verify the needed regularity conditions. Lemma B.2.1. Assume that no class has probability 0 for anyx2S (equivalently, assume that all of the intercepts in the proportional odds model (3.1) are not equal, so 1 <:::< K1 ). Assume thatdF (x) has bounded support. 1. The Fisher information matrix for the maximum likelihood estimator of the proportional odds model (3.1) is I prop. odds (;) = 0 B B @ I prop. odds I prop. odds > I prop. odds I prop. odds 1 C C A 2R (K1+p)(K1+p) 257 where I prop. odds (;) = 0 B B B B B B B B B B B B B B B B B B B @ M 1 ~ M 2 0 0 0 ~ M 2 M 2 ~ M 3 0 0 0 ~ M 3 M 3 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 M K2 ~ M K1 0 0 0 ~ M K1 M K1 1 C C C C C C C C C C C C C C C C C C C A ; (B.2.1) I prop. odds (;) = 0 B B B B B B B B B B @ J x 1 + ~ J x 2 J x 2 + ~ J x 3 . . . J x K1 + ~ J x K 1 C C C C C C C C C C A ; and (B.2.2) I prop. odds (;) = K X k=1 J xx > k + ~ J xx > k ; (B.2.3) where M k := Z [p k (x)(1p k (x))] 2 1 k (x) + 1 k+1 (x) dF (x); k2f1;:::;K 1g; (B.2.4) ~ M k := Z p k (x)(1p k (x))p k1 (x)(1p k1 (x)) 1 k (x) dF (x); k2f2;:::;K 1g (B.2.5) 258 and J k := Z k (x)p k (x)[1p k (x)]dF (x)2R; J x k := Z x k (x)p k (x)[1p k (x)]dF (x)2R p ; J xx > k := Z xx > k (x)p k (x)[1p k (x)]dF (x)2R pp ; ~ J k := Z k (x)p k1 (x)[1p k1 (x)]dF (x)2R; ~ J x k := Z x k (x)p k1 (x)[1p k1 (x)]dF (x)2R p ; and ~ J xx > k := Z xx > k (x)p k1 (x)[1p k1 (x)]dF (x)2R pp for allk2 [K]. 2. The Fisher information matrix for the maximum likelihood estimator of the logistic regression model predicting whether or not each observation is in class 1 is I logistic ( 1 ;) = 0 B B @ I logistic I logistic > I logistic I logistic 1 C C A =E h 1 (X)[1 1 (X)] ~ X ~ X > i 2R (p+1)(p+1) (B.2.6) where ~ X := 1 X (ann-vector of all ones followed byX) and I logistic ( 1 ;) =M logistic 1 ; (B.2.7) I logistic ( 1 ;) =J x; logistic 1 + ~ J x; logistic 2 ; and (B.2.8) I logistic ( 1 ;) =J xx > ; logistic 1 + ~ J xx > ; logistic 2 ; (B.2.9) 259 where we dene M logistic 1 := Z 1 (x)(1 1 (x))dF (x) (B.2.10) and J x; logistic 1 := Z x 1 (x) 2 [1 1 (x)]dF (x)2R p =J x 1 ; J xx > ; logistic 1 := Z xx > 1 (x) 2 [1 1 (x)]dF (x)2R pp =J xx > 1 ; ~ J x; logistic 2 := Z x 1 (x)[1 1 (x)] 2 dF (x)2R p ; and ~ J xx > ; logistic 2 := Z xx > 1 (x)[1 1 (x)] 2 dF (x)2R pp : 3. The information matricesI prop. odds (;) andI logistic ( 1 ;) are nite and positive denite, and the following convergences hold: p n ^ prop. odds 1 1 d !N 0; I prop. odds (;) 1 and p n ^ logistic 1 1 d !N 0; I logistic ( 1 ;) 1 : 260 Further, because these information matrices are symmetric and positive denite, by Observation 7.1.2 in Horn and Johnson [88] the principal submatrices I prop. odds :=I prop. odds (;); I prop. odds :=I prop. odds (;); and I logistic :=I logistic ( 1 ;) are all positive denite. Finally, we can characterize the nite-sample bias: for anyv2R p+1 , [(^ ; ^ > ) (;) > ]v =O(1=n): Proof. Provided in Section B.4. Remark B.2.2. Note thatJ K = 0 because 1p K (x) = 1P(y (x)Kjx) = 0 for allx, and similarly forJ x K andJ xx > K . Likewise, ~ J 1 = 0 becausep 0 (x) =P(y (x) 0jx) = 0 for allx, and similarly for ~ J x 1 and ~ J xx > 1 . We also take a moment to briey establish some identities we will use later. For anyk2f1;:::;K1g, J x k + ~ J x k+1 = Z x k (x)p k (x)[1p k (x)]dF (x) + Z x k+1 (x)p k (x)[1p k (x)]dF (x) = Z x [ k (x) + k+1 (x)]p k (x)[1p k (x)]dF (x): (B.2.11) Similarly, J xx > k + ~ J xx > k+1 = Z xx > [ k (x) + k+1 (x)]p k (x)[1p k (x)]dF (x); 261 so from (B.2.3) we have I prop. odds (;) = K X k=1 J xx > k + ~ J xx > k = ~ J xx > 1 + K1 X k=1 J xx > k + ~ J xx > k+1 +J xx > K = Z xx > 1 (x) p 0 (x) |{z} =P(y0jx)=0 [1p 0 (x)]dF (x) + K1 X k=1 J xx > k + ~ J xx > k+1 + Z xx > K (x)p K (x) 2 6 41 p K (x) | {z } =P(yKjx)=1 3 7 5 dF (x) = K1 X k=1 Z xx > [ k (x) + k+1 (x)]p k (x)[1p k (x)]dF (x): (B.2.12) B.2.2 ProofsofTheorems3.2.2and3.2.3 Equipped with the results of Lemma B.2.1, we proceed to prove Theorems 3.2.2 and 3.2.3. (Recall that for square matricesA andB of equal dimensionp, we sayAB ifBA is positive semidenite.) Proof of Theorem 3.2.2. 1. Note that the assumptions of Lemma B.2.1 are satised. Since the inverse of the asymptotic covariance matrix is E h (X)[1(X)] ~ X ~ X > i rare (1 rare )E h ~ X ~ X > i (where the second step is valid becauset7! t(1t) is monotone increasing int fort2 [0; 1=2]), by Corollary 7.7.4 in Horn and Johnson [88] the largest eigenvalue of the inverse of the asymptotic covariance matrix is no larger than rare (1 rare ) max . Therefore Asym:Cov ( p n ^ ; p n ^ ) = E h (X)[1(X)] ~ X ~ X > i 1 rare (1 rare )E h ~ X ~ X > i 1 262 has smallest eigenvalue at least 1=[ max rare (1 rare )], which is larger than 1=[ max rare ], again using Corollary 7.7.4 in Horn and Johnson [88]. So for anyv2R p+1 , by Theorem 5.1.8 in Lehmann [111] Asym:Var (^ ; ^ > )v =v > Asym:Cov ( p n ^ ; p n ^ ) v v > v max rare : Finally, since we have already shown that (^ ; ^ > ) is asymptotically unbiased, the asymptotic MSE is equal to this asymptotic variance: Asym:MSE((^ ; ^ > )v)) =E lim n!1 p n h (^ ; ^ > )v (; > )v i 2 =E lim n!1 p n h E h (^ ; ^ > )v i (^ ; ^ > )v i 2 + lim n!1 p n h (; > )vE[(^ ; ^ > )v i 2 = Asym:Var p n h (; > )v (^ ; ^ > )v i + 0 v > v max rare ; where in the second-to-last step we used [(^ ; ^ > ) (;) > ]v =O(1=n) from Lemma B.2.1. 2. Because (^ ; ^ )7! ^ (z) is dierentiable for allz2R p , by the delta method (Theorem 3.1 in [177]) p n [^ (z)(z)] d !N 0;(z) 2 [1(z)] 2 1;z > I logistic (;) 1 1;z > > 263 for anyz2R p . Therefore Asym:Var p n ^ (z) =(z) 2 [1(z)] 2 1;z > I logistic (;) 1 1;z > > (z) 2 [1(z)] 2 1;z > 2 2 min I logistic (;) 1 = (z) 2 [1(z)] 2 1;z > 2 2 kI logistic (;)k op (z) 2 [1(z)] 2 1;z > 2 2 rare (1 rare ) E h ~ X ~ X > i op () (z) 2 [1 rare ] 2 rare (1 rare ) max = (z) 2 [1 rare ] rare max ; where min () denotes the minimum eigenvalue of and () uses 1;z > 2 2 1 and(z) rare for allz2S. This yields Asym:Var p n (z) ^ n (z) (z) = 1 (z) 2 Asym:Var p n ^ (z) 1 rare rare max : Similarly to the previous result, ^ (z) is asymptotically unbiased, its nite sample bias (z) E[^ n (z)] isO(1=n) by standard maximum likelihood theory [45] since it is a maximum likelihood 264 estimator by the functional equivariance of maximum likelihood esimators, and its asymptotic MSE is equal to its asymptotic variance: Asym:MSE(^ (z)) =E " lim n!1 p n (z) ^ n (z) (z) 2 # =E " lim n!1 p n E[^ n (z)] ^ n (z) (z) 2 + lim n!1 p n (z)E[^ n (z)] (z) 2 # = Asym:Var p n (z) ^ n (z) (z) + 0 1 rare rare max : Proof of Theorem 3.2.3. 1. Again, the assumptions of Lemma B.2.1 are satised. Lemma B.2.1 shows that the asymptotic covariance matrix of the scaled maximum likelihood estimates of the parameters of logistic regression (a special case of the proportional odds model withK = 2 categories) is Asym:Cov p n ^ ; ^ > = I logistic (;) 1 = 0 B B @ I logistic I logistic > I logistic I logistic 1 C C A 1 ; so in the case that is known, we have Asym:Var p n ^ q = I logistic 1 = 1 I logistic : If is not known, then ifI logistic is positive denite (and therefore invertible) the formula for block matrix inversion yields Asym:Var p n ^ = 1 I logistic I logistic > I logistic 1 I logistic : (B.2.13) 265 We know thatI logistic is positive denite becauseI logistic (;) is nite and positive denite from Lemma B.2.1, so the principal submatrixI logistic is positive denite by Observation 7.1.2 in Horn and Johnson [88]. Further, since we know from Lemma B.2.1 that the covariance matrix of (^ ; ^ ) is nite and positive denite under our conditions, this also implies that 0 I logistic 1 I logistic <1: (B.2.14) Now we seek a lower bound for Asym:Var ( p n ^ ). We see from (B.2.13) that we can get such a bound by lower-bounding I logistic > I logistic 1 I logistic . Becauset7!t(1t) is upper-bounded by 1=4 for allt2 [0; 1], I logistic = Z xx > 2 (x)[1 2 (x)]dF (x) Z xx > 1 4 dF (x) = 1 4 E h XX > i : Then I logistic > I logistic 1 I logistic I logistic > 1 4 E h XX > i 1 I logistic 4 min E h XX > i 1 I logistic 2 2 = 4 I logistic 2 2 E XX > op 4 2 min (1 min ) 2 kE [X]k 2 2 max =: ; (B.2.15) 266 where min () denotes the minimum eigenvalue of and the last step follows because I logistic 2 = Z x 2 (x)[1 2 (x)]dF (x) 2 Z x min (1 min )dF (x) 2 = min (1 min )kE [X]k 2 (where we used the fact thatX has support only over nonnegative numbers). Therefore (B.2.13) and (B.2.15) yield Asym:Var p n ^ 1 Asym:Var ( p n ^ q ) 1 : (B.2.16) The remainder of the argument is similar to the end of the proof of Theorem 3.2.2: the asymptotic unbiasedness andO(1=n) nite-sample bias of these estimators yields Asym:MSE(^ ) = Asym:Var p n [ ^ ] and Asym:MSE(^ q ) = Asym:Var p n [ ^ q ] : Then from (B.2.14) we know that I logistic > I logistic > I logistic 1 I logistic 4 2 min (1 min ) 2 kE [X]k 2 2 max : (B.2.17) Making the appropriate substitutions into (B.2.17) yields 1 Asym:MSE(^ q ) > 0; 267 and then substituting into (B.2.16) yields Asym:MSE(^ ) 1 Asym:MSE(^ q ) 1 = Asym:MSE(^ q ) 1 Asym:MSE(^ q ) () Asym:MSE(^ q ) (1 + Asym:MSE(^ q )) () Asym:MSE(^ ) Asym:MSE(^ q ) [Asym:MSE(^ q )] 2 ; where in () we used the inequalityc=(1ct)c(1 +ct) for anyc> 0,t< 1 c . 2. Because (^ ; ^ )7! ^ (z) is dierentiable for allz2R p , by the delta method (Theorem 3.1 in [177]) p n [^ (z)(z)] d !N 0;(z) 2 [1(z)] 2 1;z > I logistic (;) 1 1;z > > for anyz2R p , and similarly p n [^ q (z)(z)] d !N 0; (z) 2 [1(z)] 2 I logistic ! : We can nd I logistic (;) 1 using the formula for block matrix inversion if D :=I logistic I logistic I logistic > I logistic is positive denite (and therefore invertible; note thatD is symmetric) andI logistic > 0. We have I logistic = Z 2 (x)[1 2 (x)]dF (x) Z min [1 min ]dF (x) = min [1 min ]> 0; 268 and by Theorem 1.12 in Zhang [202], we then knowD is positive denite (and invertible) since I logistic (;) is by Lemma B.2.1 andI logistic > 0. Let D max :=kDk op be the largest eigenvalue of D; then 1= D max is the smallest eigenvalue ofD 1 . Then for anyz2R p , we have 1 z > I logistic (;) 1 0 B B @ 1 z 1 C C A = 1 z > 0 B B B @ 1 I logistic + 1 I logistic 2 I logistic > D 1 I logistic 1 I logistic I logistic > D 1 1 I logistic D 1 I logistic D 1 1 C C C A 0 B B @ 1 z 1 C C A = 1 I logistic + 1 I logistic 2 I logistic > D 1 I logistic +z > D 1 z 2 1 I logistic I logistic D 1 z (a) = 1 I logistic + 1 I logistic 2 D 1=2 I logistic I logistic z 2 2 (b) 1 I logistic + 1 I logistic 2 D max I logistic I logistic z 2 2 (B.2.18) 1 I logistic ; where (a) follows from 1 I logistic 2 D 1=2 I logistic I logistic z 2 2 = 1 I logistic 2 I logistic > D 1 I logistic + I logistic 2 z > D 1 z 2I logistic I logistic D 1 z 269 and (b) uses the fact that 1= p D max is the smallest eigenvalue of D 1=2 . If we can show that I logistic I logistic z 2 6= 0, then (B.2.18) is enough to establish the strict inequality in the result. Using (B.2.6), note that 0 =I logistic I logistic z = Z x(x)[1(x)]dF (x)z Z (x)[1(x)]dF (x) () z = R x(x)[1(x)]dF (x) R (x)[1(x)]dF (x) ; so for allz6=E [X(X)[1(X)]]=E [(X)[1(X)]], we have I logistic I logistic z 2 2 > 0. So for anyz2R p nE [X(X)[1(X)]]=E [(X)[1(X)]], we have (z) 2 [1(z)] 2 1 z > 0 B B @ I logistic I logistic > I logistic I logistic 1 C C A 1 0 B B @ 1 z 1 C C A > (z) 2 [1(z)] 2 I logistic () Asym:Var p n ^ q (z) < Asym:Var p n (^ (z) : B.3 InvestigationofAssumption (3.5)andProofofTheorem3.2.4 Before we prove Theorem 3.2.4, we begin by investigating the plausibility of Assumption (3.5). We start by investigating whether min I 2 I 1 I > 1 I 1 1 is bounded away from 0 in Section B.3.1. Then in Section B.3.2 we directly investigate whether Assumption (3.5) seems to hold in a variety of contexts. We prove Theorem 3.2.4 in Section B.3.3, and we prove the supporting result Lemma B.3.1 in Section B.3.4. 270 B.3.1 InvestigatingWhether min I 2 I 1 I > 1 I 1 1 isBoundedAwayFromZero To investigate the plausibility of the assumption that I 2 I 1 I > 1 I 1 1 is positive denite (and that the upper bound for rare in Equation 3.5 can hold) empirically, we perform two simulation studies, with setups similar to those of our synthetic data experiments in Section 3.4 of the paper. We repeat Simulation Study A 25 times. Usingn = 10 6 ,p = 10, andK = 3, we generateX2R np withX ij Uniform(1; 1) iid for alli2f1;:::;ng andj2f1;:::;pg. We then generatey i 2f1; 2; 3g fromx i for eachi2f1;:::;ng according to the proportional odds model (3.1), using = (1;:::; 1) > and = (0; 20) (so that class 3 is very rare, with rare 4:54 10 5 ). We estimateI ,I 1 , andI 1 1 using empirical (“plug-in") estimates of the expressions in (B.2.1), (B.2.2), and (B.2.3); for example, using (B.2.12) we estimatedI by 1 n n X i=1 K1 X k=1 x i x > i [ k (x i ) + k+1 (x i )]p k (x i )[1p k (x i )]: (Note that we used the true k () andp k () functions in these estimates.) Finally, we use these estimated quantities to estimateI 2 I 1 I > 1 I 1 1 , and we calculate the minimum eigenvalue of this estimated matrix. Across all 25 simulations, the sample mean of this minimum eigenvalue is 0.02361, and the minimum is 0.02359. The standard error is 2:94 10 6 , and the 95% condence interval for the mean of the minimum eigenvalue is (0:02360; 0:02362). See Figure B.8 for a boxplot of the 25 estimated minimum eigenvalues. These results seem to suggest that the assumption that min I 2 I 1 I > 1 I 1 1 > 0 is reasonable under the assumptions of Theorem 3.2.4 forX with this iid uniform distribution. Next, we conduct another simulation study to investigate whether min I 2 I 1 I > 1 I 1 1 is bounded away from 0 if the features inX are correlated, and over a wider range of rare . In Simulation Study B, 271 0.02360 0.02362 0.02364 Minimum Eigenvalue Figure B.8: Boxplot of the estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study A as described in Section B.3.1. 272 we generate matrices of standard multivariate Gaussian data truncated entrywise between3 and 3, with n = 10 6 andp = 10. We generate matrices with covariance matrices 0 B B B B B B B B B B @ 1 1 . . . . . . . . . . . . 1 1 C C C C C C C C C C A for2f0; 0:25; 0:5; 0:75g. We then generate ordinal responses from the proportional odds model with K = 3 and = (1; 1;:::; 1) > . The intercept for the rst separating hyperplane is 0 (so that the ex- pected class probabilities for classes 1 and 2 are both close to 1=2) and we choose three dierent values for the second intercept by analytically solving for intercepts such that rare = sup x2[3;3] pf 3 (x)g equals 10 5 ; 10 6 , or 10 7 . To see the behavior as rare vanishes, we also investigate a fourth setting where all of the 3 (x) terms in the Fisher information are simply set to 0 (and 2 (x) is coerced to equal 1 1 (x) pointwise), as if the intercept for the second separating hyperplane were innity. In each setting, we gen- erate 7 random matrices and estimate min I 2 I 1 I > 1 I 1 1 in the same way as in Simulation Study A. The results are displayed in Figures B.9, B.10, B.11, and B.12 for correlations of 0, 0.25, 0.5, and 0.75, respectively. In each plot the rare class probability is displayed on the horizontal axis on a log scale (with a line break after the far left result for the asymptotic case with rare = 0), and the average minimum eigenvalue for each of the seven random matrices is displayed on the vertical axis. We see that the means of the minimum eigenvalues are well above 0; we also note that in every generated matrix in every setting, the minimum eigenvalue was strictly positive. 273 0.000 0.005 0.010 0.015 0.020 0.025 1e−07 1e−06 1e−05 Rare Class Probability Mean Minimum Eigenvalue Correlation = 0 Figure B.9: Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0 with all of the others. 0.000 0.005 0.010 0.015 0.020 1e−07 1e−06 1e−05 Rare Class Probability Mean Minimum Eigenvalue Correlation = 0.25 Figure B.10: Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.25 with all of the others. 274 0.000 0.005 0.010 0.015 1e−07 1e−06 1e−05 Rare Class Probability Mean Minimum Eigenvalue Correlation = 0.5 Figure B.11: Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.5 with all of the others. 0.000 0.003 0.006 0.009 1e−07 1e−06 1e−05 Rare Class Probability Mean Minimum Eigenvalue Correlation = 0.75 Figure B.12: Plot of the average estimated minimum eigenvalues ofI 2 I 1 I > 1 I 1 1 in Simulation Study B as described in Section B.3.1 with each column ofX having correlation 0.75 with all of the others. 275 Finally, we also investigate analytically whether min I 2 I 1 I > 1 I 1 1 is bounded away from 0 as rare becomes arbitrarily small. To see that this seems reasonable, note that from (B.2.12) we have I = 2 X k=1 Z xx > [ k (x) + k+1 (x)]p k (x)[1p k (x)]dF (x) = Z xx > ([ 1 (x) + 2 (x)] 1 (x)[1 1 (x)] + [ 2 (x) + 3 (x)] 3 (x)[1 3 (x)]) dF (x) = Z xx > ([1 3 (x)] 1 (x)[1 1 (x)] + [1 1 (x)] 3 (x)[1 3 (x)]) dF (x) = Z xx > [1 1 (x)][1 3 (x)] ( 1 (x) + 3 (x)) dF (x) = Z xx > [ 2 (x) + 3 (x)][1 3 (x)][ 1 (x) + 3 (x)]dF (x); which is non-vanishing in rare . (In particular, there seems to be no reason to suspect that the eigenvalues of I change drastically for, say 3 (x) 10 7 for allx2S, as in Simulation Study B, versus 3 (x) 10 20 for allx2S.) Further, (B.3.7) below shows that I 1 I > 1 I 1 1 op = kI 1 k 2 2 I 1 1 (wherekk op is the operator norm) is bounded from above by a constant not depending on rare . This also suggests that min I 2 I 1 I > 1 I 1 1 should be insensitive to rare becoming vanishingly small. Taken together, this suggests that Assumption (B.3.1) does not become more implausible as rare becomes arbi- trarily small. 276 B.3.2 InvestigatingWhetherAssumption (3.5)isPlausible Having developed evidence that min I 2 I 1 I > 1 I 1 1 is bounded away from 0, we now directly inves- tigate the plausibility of the assumption rare min I 2 I 1 I > 1 I 1 1 3M 2 (2 +M) : (B.3.1) First we examine this in the context of Simulation Study A. Note thatM =k(1;:::; 1)k 2 = p p,S = [1; 1] p and sup x2S f 3 (x)g = 3 (x ) = 1 1 1 + expf(20 + P p j=1 x j )g ; forx = (1;:::;1). So all of the quantities in (B.3.1) are known except the minimum eigenvalue, which we were able to estimate with seemingly high precision. It turns out that (B.3.1) is satised even when we use the minimum value of min across all 25 simulations as our estimate; in this case, the left side of (B.3.1) is 4:54 10 5 and the right side is 1:52 10 4 . Next we investigate (B.3.1) in the context of Simulation Study B. In each of the four correlation settings, we calculate the proportion of generated random matrices in which (B.3.1) is satised at each rarity level (according to our minimum eigenvalue estimates). The results are displayed in Figures B.13, B.14, B.15, and B.16. We see that (B.3.1) is not satised for any of the random matrices where rare = 10 5 , but it is satised for all of the matrices with all of the remaining values of rare regardless of correlation of the features. This suggests that assumption (3.5) is indeed reasonable for rare small enough. B.3.3 ProofofTheorem3.2.4 Before we proceed with the proof of Theorem 3.2.4, we state a lemma with inequalities we will use. 277 0.00 0.25 0.50 0.75 1.00 1e−07 1e−06 1e−05 Rare Class Probability Mean Required Condition Correlation = 0 Figure B.13: Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0 with all of the others. 0.00 0.25 0.50 0.75 1.00 1e−07 1e−06 1e−05 Rare Class Probability Mean Required Condition Correlation = 0.25 Figure B.14: Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.25 with all of the others. 278 0.00 0.25 0.50 0.75 1.00 1e−07 1e−06 1e−05 Rare Class Probability Mean Required Condition Correlation = 0.5 Figure B.15: Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.5 with all of the others. 0.00 0.25 0.50 0.75 1.00 1e−07 1e−06 1e−05 Rare Class Probability Mean Required Condition Correlation = 0.75 Figure B.16: Plot of the proportion of random matrices in Simulation Study B (as described in Section B.3.1) in which (B.3.1) is satised in the setting where each column ofX has correlation 0.75 with all of the others. 279 LemmaB.3.1. The following inequalities hold under the assumptions of Theorem 3.2.4: I 2 2 rare I 1 1 1 1=4 2 ; (B.3.2) I 2 2 rare 1 + rare 1=2 ; (B.3.3) jI 1 2 jI 2 2 ; (B.3.4) kI 2 k 2 MI 2 2 ; (B.3.5) kI 2 k 2 2 I 2 2 M 2 rare 1 + rare 1=2 ; and (B.3.6) kI 1 k 2 2 I 1 1 M 2 4 : (B.3.7) Proof. Provided immediately after the proof of Theorem 3.2.4, in Section B.3.4. By an argument analogous to the one used at the end of the proof of Theorem 3.2.2, it is enough to upper-bound Cov lim n!1 p n h ^ prop. odds i op = Asym:Cov p n ^ prop. odds op : Using Lemma B.2.1 and the block matrix inversion formula, Asym:Cov p n ^ prop. odds = I I > I 1 I 1 =) Asym:Cov p n ^ prop. odds op = 1 min I I > I 1 I ; (B.3.8) where min () is the minimum eigenvalue of andkk op is the operator norm.I is a 2 2 matrix, so I 1 = 1 det (I ) 0 B B @ I 2 2 I 1 2 I 2 1 I 1 1 1 C C A ; 280 and I > I 1 I = 1 det (I ) I 1 I 2 0 B B @ I 2 2 I 1 2 I 2 1 I 1 1 1 C C A 0 B B @ I 1 I 2 1 C C A = 1 det (I ) I 2 2 I 1 I > 1 +I 1 1 I 2 I > 2 I 1 2 I 1 I > 2 I 1 2 I 2 I > 1 = 1 det (I ) I 2 2 I 1 I > 1 +I 1 1 I 2 I > 2 +jI 1 2 jI 1 I > 2 +jI 1 2 jI 2 I > 1 ; (B.3.9) where in the last step we used thatI 1 2 = ~ M 2 < 0, which is clear from (B.2.2) and (B.2.5). Because we know from Lemma B.2.1 thatI (and therefore alsoI 1 ) is positive denite, by Observation 7.1.6 in Horn and Johnson [88]I > I 1 I is positive denite as well. Therefore we can use an upper bound on 1= det (I ) to upper boundI > I 1 I . We have det (I ) =I 1 1 I 2 2 I 2 1 2 (a) I 1 1 I 2 2 I 2 2 2 =I 2 2 (I 1 1 I 2 2 ) (b) I 2 2 I 1 1 rare I 1 1 1 1=4 2 = 1 rare 1=4 2 I 1 1 I 2 2 (c) 1 2 I 1 1 I 2 2 ; 281 where in (a) we used (B.3.4), in (b) we used (B.3.2), and (c) uses that from the upper bound (3.5) for rare we have rare 1 2 1 2 1 2 + = 1 2 1 4 2 () rare 1=4 2 1 2 () 1 rare 1=4 2 1 2 : Now we can boundI > I 1 I using (B.3.9): I > I 1 I 2 I 1 1 I 2 2 I 2 2 I 1 I > 1 +I 1 1 I 2 I > 2 +jI 1 2 jI 1 I > 2 +jI 1 2 jI 2 I > 1 = 2 I 1 I > 1 I 1 1 + I 2 I > 2 I 2 2 + jI 1 2 j I 1 1 I 2 2 I 1 I > 2 +I 2 I > 1 ! ; and I I > I 1 I I 2 I 1 I > 1 I 1 1 2 I 2 I > 2 I 2 2 + jI 1 2 j I 1 1 I 2 2 I 1 I > 2 +I 2 I > 1 ! Note thatI ,I 1 I > 1 ,I 2 I > 2 , andI 1 I > 2 +I 2 I > 1 are all symmetric. By Weyl’s theorem, it holds that for symmetric matrices with matching dimensionsA andB, min (AB) min (A)kBk op ; because for anyv (AB)v =AvBv min (A)vkBk op v = min (A)kBk op v: 282 So min I I > I 1 I min I 2 I 1 I > 1 I 1 1 ! 2 I 2 2 I 2 I > 2 op 2 jI 1 2 j I 1 1 I 2 2 I 1 I > 2 +I 2 I > 1 op (a) min I 2 I 1 I > 1 I 1 1 ! 2 I 2 2 I 2 I > 2 op 2 jI 1 2 j I 1 1 I 2 2 I 1 I > 2 op + I 2 I > 1 op (b) min I 2 I 1 I > 1 I 1 1 ! 2 kI 2 k 2 2 I 2 2 2 jI 1 2 j I 1 1 I 2 2 2kI 1 k 2 kI 2 k 2 (c) min I 2 I 1 I > 1 I 1 1 ! 2M 2 rare 1 + rare 1=2 2 I 2 2 I 1 1 I 2 2 2kI 1 k 2 MI 2 2 = min I 2 I 1 I > 1 I 1 1 ! 2M M rare 1 + rare 1=2 + 2 I 2 2 I 1 1 kI 1 k 2 (d) min I 2 I 1 I > 1 I 1 1 ! 2M M rare 1 + rare 1=2 + 2 1 I 1 1 kI 1 k 2 rare 1 + rare 1=2 = min I 2 I 1 I > 1 I 1 1 ! 2M rare 1 + rare 1=2 M + 2 kI 1 k 2 I 1 1 (e) min I 2 I 1 I > 1 I 1 1 ! 2M rare 1 + 1 2 M + M 2 2 (f) 1 2 min I 2 I 1 I > 1 I 1 1 ! ; where in (a) we used the triangle inequality, in (b) we used that for anya;b2R n it holds thatkab > k op = jb > ajkak 2 kbk 2 (note thatab > is rank one with eigenvectora) as well as the triangle inequality, in (c) we used (B.3.4), (B.3.5), and (B.3.6), in (d) we used (B.3.3), (e) follows from (B.3.7) and rare 1=2 1=2(1=2 )(1=2 + ) 1=2 = 1 2 1 2 + 1 2 1 2 + 1 2 1 2 283 (since 1=2), and in (f) we used our assumptions that min I 2 I 1 I > 1 I 1 1 ! > 0 and rare min I 2 I 1 I > 1 I 1 1 3M 2 (2 +M) () 3M rare M + M 2 2 1 2 min I 2 I 1 I > 1 I 1 1 ! : Substituting this into (B.3.8) shows that Cov lim n!1 p n h ^ prop. odds i op 2 min I 2 I 1 I > 1 I 1 1 : B.3.4 ProofofLemmaB.3.1 We omit x in integrals, when it is clear (e.g., 1 stands for 1 (x)). Proofof (B.3.2): From (B.2.2) and (B.2.4), using 1 (x) + 2 (x) + 3 (x) = 1 we have 284 I 2 2 =M 2 = Z [ 3 (x)(1 3 (x))] 2 1 2 (x) + 1 3 (x) dF (x) = Z [ 3 (x)(1 3 (x))] 2 2 (x) + 3 (x) 2 (x) 3 (x) dF (x) = Z 3 (x)[1 3 (x)] 2 1 1 (x) 2 (x) dF (x) (B.3.10) rare Z (1 3 ) 2 (1 1 )= 2 dF = rare Z 1 (1 1 ) 2 (1 3 ) 2 2 1 1 (1 1 ) dF (a) rare Z 1 (1 1 ) 2 1 3 2 1 (1=2 )(1=2 + ) dF = rare Z 2 1 (1 1 ) 2 1 + 2 1 2 1 1=4 2 = rare Z 2 1 (1 1 ) 2 1 1 + 1 2 1 1=4 2 = rare M 1 1 1=4 2 = rare I 1 1 1 1=4 2 ; where in (a) we used the fact thatt7! 1=[t(1t)] is nonincreasing int fort2 (0; 1=2], so it is largest whent is small as possible, and by assumption inf x2S f 1 ^ 1 1 g 1=2 . Proofof (B.3.3): Using (B.3.10) we have I 2 2 = Z 3 (x)[1 3 (x)] 2 2 (x) + 3 (x) 2 (x) dF (x) = Z 3 (1 3 ) 2 (1 + 3 = 2 )dF rare Z (1 0) 2 1 + rare 1=2 dF: 285 Proofof (B.3.4): Using (B.3.10) along with (B.2.1) and (B.2.5) we have I 2 2 = Z 3 (x)[1 3 (x)] 2 1 1 (x) 2 (x) dF (x) = Z 3 (x)[1 3 (x)][1 1 (x)] 1 (x) + 2 (x) 2 (x) dF (x) = Z 3 (1 3 )(1 1 ) 1 2 dF + Z 3 (1 3 )(1 1 )dF = ~ M 2 + Z 3 (1 3 )(1 1 )dF =jI 1 2 j + Z 3 (1 3 )(1 1 )dF jI 1 2 j: Proofof (B.3.5): From (B.2.2) and (B.2.11) we have 286 kI 2 k 2 = Z x [ 2 (x) + 3 (x)] 3 (x)[1 3 (x)]dF (x) 2 = Z x [1 1 (x)] 3 (x)[1 3 (x)]dF (x) 2 Z kxk 2 3 (x)[1 3 (x)][1 1 (x)]dF M Z 3 (x)[1 3 (x)][1 1 (x)]dF M Z 3 (x)[1 3 (x)][1 1 (x)] 1 + 3 (x) 2 (x) dF (x) =M Z 3 (x)[1 3 (x)][1 1 (x)] 2 (x) + 3 (x) 2 (x) dF (x) =M Z 3 (x)[1 3 (x)][1 1 (x)] 1 1 (x) 2 (x) dF (x) M Z 3 (x)[1 3 (x)] 2 1 1 (x) 2 (x) dF (x) =MI 2 2 ; where in the last inequality we used 3 (x) rare < 1=2 1 (x) for allx2S and in the last step we used (B.3.10). Proofof (B.3.6): This follows from (B.3.3) and (B.3.5). Proofof (B.3.7): Using from (B.2.2) kI 1 k 2 = Z x 2 1 (x)[1 1 (x)]dF (x) + Z x 2 (x) 1 (x)[1 1 (x)]dF (x) 2 = Z x[ 1 (x) + 2 (x)] 1 (x)[1 1 (x)]dF (x) 2 Z kxk 2 [ 1 (x) + 2 (x)] 1 (x)[1 1 (x)]dF (x) M Z [ 1 (x) + 2 (x)] 1 (x)[1 1 (x)]dF (x) 287 and from (B.2.1) I 1 1 =M 1 = Z ( 1 (x)[1 1 (x)]) 2 1 1 (x) + 1 2 (x) dF (x) = Z ( 1 (x)[1 1 (x)]) 2 1 (x) + 2 (x) 1 (x) 2 (x) dF (x) = Z 1 (x)[1 1 (x)][ 1 (x) + 2 (x)] 1 1 (x) 2 (x) dF (x) Z 1 (x)[1 1 (x)][ 1 (x) + 2 (x)]dF (x); we have kI 1 k 2 2 I 1 1 M 2 Z [ 1 (x) + 2 (x)] 1 (x)[1 1 (x)]dF (x)M 2 Z 1 4 [ 1 (x) + 2 (x)]dF (x) M 2 4 ; where in the second-to-last step we used thatt7!t(1t) 1=4 for allt2 [0; 1]. B.4 ProofofLemmaB.2.1 First we will calculate the Fisher information matrices, then we will show the convergence results. Since the logistic regression model can be considered a special case of the proportional odds model withK = 2 categories, we will mostly focus our calculations on the proportional odds model. B.4.1 CalculatingtheLogLikelihoodandGradients In the proportional odds model, the likelihood can be expressed as n Y i=1 K Y k=1 1 1 + exp ( k + > x i ) 1 1 + exp ( k1 + > x i ) ! 1fy i =kg ; (B.4.1) 288 so the log likelihood can be expressed as L prop. odds (;) = n X i=1 K X k=1 1fy i =kg log 1 1 + exp ( k + > x i ) 1 1 + exp ( k1 + > x i ) ! = n X i=1 K X k=1 1fy i =kg log (p ik p i;k1 ) = n X i=1 K X k=1 1fy i =kg log ( ik ); where p ik :=p k (x i ) =P(y i kjx i ); ik :=P(y i =kjx i ) =p ik p i;k1 =p k (x i )p k1 (x i ); 0 :=1, and K :=1 (while := ( 1 ;:::; K1 ) > are parameters to be estimated). Using @ @t 1 1 + expftg = 1 1 + expftg 1 1 1 + expftg ; the gradient has entries corresponding to equal to r L prop. odds (;) = n X i=1 K X k=1 1fy i =kgx i p ik (1p ik )p i;k1 (1p i;k1 ) p ik p i;k1 = n X i=1 K X k=1 1fy i =kgx i 0 @ p ik p i;k1 p 2 ik p 2 i;k1 p ik p i;k1 1 A = n X i=1 K X k=1 1fy i =kgx i (1p ik p i;k1 ); 289 and using @ @ k ik = @ @ k (p ik p i;k1 ) =p ik (1p ik ) and @ @ k i;k+1 = @ @ k (p i;k+1 p ik ) =p ik (1p ik ); the entries corresponding to equal @ @ k L prop. odds (;) = n X i=1 @ @ k (1fy i =kg log ( ik ) + 1fy i =k + 1g log ( i;k+1 )) = n X i=1 1fy i =kg p ik (1p ik ) ik 1fy i =k + 1g p ik (1p ik ) i;k+1 =) r L prop. odds (;) = n X i=1 e k p ik (1p ik ) 1fy i =kg ik 1fy i =k + 1g i;k+1 ; wheree k 2f0; 1g K1 has thek th entry equal to 1 and the rest equal to 0. (Note that since 0 =1, p i0 = 0, and similarlyp iK = 1 as expected.) B.4.2 CalculatingtheHessianMatrices The entries of the Hessian corresponding to,H prop. odds are r 2 L prop. odds (;) = n X i=1 K X k=1 1fy i =kgx i x > i [p ik (1p ik ) +p i;k1 (1p i;k1 )] = n X i=1 x i x > i K X k=1 (1fy i =k + 1g + 1fy i =kg) Var (1fy i kg): Using @ @ k p ik (1p ik ) =p ik (1p ik ) 2p 2 ik (1p ik ) =p ik (1p ik ) (1 2p ik ); 290 the entries corresponding to the block of the HessianH prop. odds are as follows: @ 2 @ 2 k L prop. odds (;) = n X i=1 @ @ k p ik (1p ik ) 1fy i =kg ik 1fy i =k + 1g i;k+1 = n X i=1 1fy i =kg ik p ik (1p ik ) (1 2p ik )p 2 ik (1p ik ) 2 2 ik 1fy i =k + 1g i;k+1 p ik (1p ik ) (1 2p ik ) +p 2 ik (1p ik ) 2 2 i;k+1 = n X i=1 p ik (1p ik ) 1fy i =kg ik (1 2p ik )p ik (1p ik ) 2 ik 1fy i =k + 1g i;k+1 (1 2p ik ) +p ik (1p ik ) 2 i;k+1 ; @ 2 @ k @ k1 L prop. odds (;) = n X i=1 p ik (1p ik ) @ @ k1 1fy i =kg ik 1fy i =k + 1g i;k+1 = n X i=1 1fy i =kgp ik (1p ik ) p i;k1 (1p i;k1 ) 2 ik = n X i=1 1fy i =kg p ik (1p ik )p i;k1 (1p i;k1 ) 2 ik ; and @ 2 @ k @ k 0 L prop. odds (;) = 0; all otherk6=k 0 ; where i;K+1 = 0. (Note that @ 2 @ k @ k+1 L prop. odds (;) is nonzero as well, but it matches the expression for @ 2 @ ~ k @ ~ k1 L prop. odds (;) with ~ k :=k + 1.) Finally, the entries corresponding to the and mixed blocksH prop. odds of the Hessian are @ @ k r L prop. odds (;) = n X i=1 1fy i =kgx i @ @ k (1p ik p i;k1 ) + 1fy i =k + 1gx i @ @ k (1p i;k+1 p i;k ) = n X i=1 x i p ik (1p ik ) (1fy i =kg + 1fy i =k + 1g); k2 [K 1]: 291 B.4.3 CalculationoftheFisherInformationMatrices Now we nd the Fisher information matrices I prop. odds (;) = 0 B B @ I prop. odds I prop. odds I prop. odds > I prop. odds 1 C C A and I logistic (;) = 0 B B @ I logistic I logistic > I logistic I logistic 1 C C A by taking the negative expectation of each block (using a single observation). For the block, we have E @ 2 @ 2 k L prop. odds (;) =E p ik (1p ik ) 1fy i =kg ik (1 2p ik )p ik (1p ik ) 2 ik 1fy i =k + 1g i;k+1 (1 2p ik ) +p ik (1p ik ) 2 i;k+1 =E p ik (1p ik ) ik (1 2p ik )p ik (1p ik ) ik i;k+1 (1 2p ik ) +p ik (1p ik ) i;k+1 =E p 2 ik (1p ik ) 2 1 ik + 1 i;k+1 =M k ; k2 [K 1]; E @ 2 @ k @ k1 L prop. odds (;) =E 1fy i =kg p ik (1p ik )p i;k1 (1p i;k1 ) 2 ik =E p ik (1p ik )p i;k1 (1p i;k1 ) ik = ~ M k ; k2f2;:::;K 1g; and E @ 2 @ k @ k 0 L prop. odds (;) = 0; all otherk6=k 0 ; 292 where we used the denitions ofM k and ~ M k in (B.2.4) and (B.2.5). Therefore I prop. odds (;)2R (K1)(K1) has tridiagonal form I prop. odds (;) = 0 B B B B B B B B B B B B B B B B B B B @ M 1 ~ M 2 0 0 0 ~ M 2 M 2 ~ M 3 0 0 0 ~ M 3 M 3 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 M K2 ~ M K1 0 0 0 ~ M K1 M K1 1 C C C C C C C C C C C C C C C C C C C A ; verifying (B.2.1). Observe that in the case of logistic regression (K = 2), we have I logistic ( 1 ;) =M 1 = Z [p 1 (x)(1p 1 (x))] 2 1 1 (x) + 1 2 (x) dF (x) = Z [ 1 (x)(1 1 (x))] 2 1 1 (x) + 1 1 1 (x) dF (x) = Z [ 1 (x)(1 1 (x))] 2 1 1 (x) (1 1 (x)) dF (x) = Z 1 (x)(1 1 (x))dF (x) =M logistic 1 ; which is (B.2.7). (This is equivalent to a logistic regression predicting whethery i = 1 regardless ofK.) 293 B.4.4 MixedBlock For the- mixed block, we have for allk2f1;:::;K 1g E @ 2 @ k @ L prop. odds (;) =E [X 1 p k (X 1 )(1p k (X 1 ))( k (X 1 ) + k+1 (X 1 ))]: Then E @ 2 @ k @ L prop. odds (;) =J x k + ~ J x k+1 ; k2f1;:::;K 1g; soI prop. odds (;)2R (K1)p has the form I prop. odds (;) = 0 B B B B B B B B B B @ J x 1 + ~ J x 2 J x 2 + ~ J x 3 . . . J x K1 + ~ J x K 1 C C C C C C C C C C A ; as in (B.2.2). In the case of logistic regression predicting whethery i = 1, I logistic ( 1 ;) =J x; logistic 1 + ~ J x; logistic 2 = Z x 1 (x)[1 1 (x)]dF (x); verifying (B.2.8). 294 B.4.5 BetaBlock Finally, for the block, we have I prop. odds (;) =E r 2 L prop. odds (;) =E " E " X 1 X > 1 K X k=1 (1fy i =k + 1g + 1fy i =kg)p k (X 1 )(1p k (X 1 ))jX ## =E " X 1 X > 1 K X k=1 ( k+1 (X 1 ) + k (X 1 ))p k (X 1 )(1p k (X 1 )) # = K X k=1 J xx > k + ~ J xx > k ; as in (B.2.3) (recall thatJ xx > K = 0 and ~ J xx > 1 = 0 for allx). In the case of logistic regression predicting whethery i = 1 (for anyK), I logistic ( 1 ;) =J xx > ; logistic 1 + ~ J xx > ; logistic 2 = Z xx > 1 (x)[1 1 (x)]dF (x); matching (B.2.9). B.4.6 VerifyingtheAsymptoticDistributionofEachEstimator By standard maximum likelihood theory (for example, the theorem on p. 145 of Sering [156, Section 4.2.2, multidimensional generalization on p. 148], or Theorem 13.1 in [193]), the result holds if we can verify three regularity conditions. Before we do, we will dene the set of feasible parameters (;). Since 1 < 2 <:::< K1 ; where the strict inequality follows from our assumption that no class has probability 0 for anyx2 [1; 1], dene the setAR K1 of points satisfying this constraint. Then dene :=AR p , so that (;)2 . 295 Now we state and verify the needed regularity conditions. 1. (R1)Thethirdderivativesoftheloglikelihoodwithrespecttoeachparameter (;)existforallx2S. This condition holds for both the proportional odds model and logistic regression because every entry of the Hessian matrices in Section (B.4.2) are dierentiable in every parameter for anyK 2. 2. (R2) For each ( 0 ; 0 )2 , for all (;) in a neighborhood of ( 0 ; 0 ) it holds that (i) the element- wiseabsolutevaluesofthegradientsandHessiansofthelikelihoodareboundedbyfunctionsofxwith nite integral overx2S, and (ii) the element-wise absolute values of the third derivatives of the log likelihood is bounded by a function ofx with nite expectation with respect toX. BecauseX has bounded support, for these integrals and expectations to be nite it is enough for the bounding functions to overS to be nite constants—that is, it is enough to nd upper bounds on the absolute values of the gradients and Hessians of the likelihoods and the third derivatives of the log likelihoods. The logistic regression likelihood n Y i=1 exp 1fy i = 1g ( 1 + > x i ) 1 + expf( 1 + > x i )g has continuous second derivatives and therefore its gradient and Hessian both have nite element- wise absolute value on the bounded support. The same is true of the proportional odds likelihood (B.4.1) when all outcomes have positive probability for allx2S, that is, 1 <:::< K1 . Finally, examining again the Hessian matrices in Section B.4.2, we see that they have continuous derivatives in every parameter for anyK 2, so the third derivatives of the log likelihoods are bounded for any ( 0 ; 0 )2 for allx2S. 3. (R3)TheFisherinformationmatricesexistandareniteandpositivedenite. One can see that both of the Fisher information matrices are nite entrywise for all (;)2 by examining the matrices 296 and noting that the probabilities for all of the classes are strictly greater than 0 overS by assumption. To verify the positive deniteness of the Fisher information matrices E @ 2 @ > L() ; it is enough to show that the log likelihood for each model is strictly concave, which implies that @ 2 @ > L() is almost surely negative denite (since the log likelihood is twice dierentiable). Strict concavity of the logistic regression log likelihood n X i=1 h 1fy i = 1g ( 1 + > x i ) log 1 +e ( 1 + > x i ) i is easily seen, and Pratt [142] provides a proof that the log likelihood for the proportional odds model is strictly concave when the intercepts 1 ;:::; K1 are not equal. Lastly, the asymptoticO(1=n) bias is a consequence of standard maximum likelihood theory; see, for example, Cordeiro and McCullagh [45]. B.5 ProofofTheorem3.3.1 We prove Theorem 3.3.1 in Section B.5.1, and Section B.5.2 contains proofs of the supporting lemmas. B.5.1 ProofofTheorem3.3.1 The proof will proceed as follows. First we will show that PRESTO is a member of a class of models described by Ekvall and Bottai [56], which means we can bound the estimation error of the parameters of PRESTO in a nite sample under their Theorem 3 once we show that its assumptions are satised. Their result depends on the probability of a particular eventC ;n;pn(K1) , and in Proposition B.5.1 we prove a 297 lower bound onP C ;n;pn(K1) that tends towards 1 as n!1. This establishes the consistency of PRESTO. In the notation of Ekvall and Bottai [56], we can express the objective function for thePRESTO estima- tor from (3.7) asRfb(y i ;x i ;)gRfa(y i ;x i ;)g + n kk 1 , whereR() =F (), the logistic cumulative distribution function; (a(y i ;x i ;);b(y i ;x i ;)) > =Z > i +m i where Z i := 0 B B B B B B B B B B @ 1fy i 2gx i 1fy i K 1gx i 1fy i 3gx i 1f2y i K 1gx i . . . . . . 1fy i =Kgx i 1fy i =K 1gx i 1 C C C C C C C C C C A 2R pn(K1)2 ; := 0 B B B B B B B B B B @ 1 2 . . . K1 1 C C C C C C C C C C A 2R pn(K1) ; and m i := 0 B B B B B B B B B B B B B B @ 8 > > > > < > > > > : 1; y i = 1; k1 ; y i =k (k 2) 8 > > > > < > > > > : k ; y i =k (k<K 1) 1; y i =K 1 C C C C C C C C C C C C C C A 2R 2 ; whereR :=R[f1;1g denotes the extended real number system; and, as elsewhere in the paper, k = k k1 fork2f2;:::;K1g. Observe that this is a special case of model (1) of Ekvall and Bottai [56]. Now, since =R pn(K1) is open and the standard logistic densityr(t) = expftg= (1 + expftg) 2 is strictly log-concave, strictly positive, and continuously dierentiable onR, assumptions (a) and (b) of 298 Theorem 3 in Ekvall and Bottai [56] are satised, assumptionS(C 4 ) is sucient for assumption (c), and assumption (e) is satised forC 3 =c 2 . We now discuss Assumption 1 from Ekvall and Bottai [56]. Note that a decision boundary crosses at somex2S if and only if for somek2f2;:::;K 1g it holds that F k + > k x F k1 + > k1 x 0 () b(k;x;)a(k;x;) 0: Observe that fork = 1 it holds thatm i1 =1 and fork =K it holds thatm i2 =1; that is, an element ofm i is innite. So Assumption T (C 4 ) is sucient for it to either hold thatZ > i +m i 2 E where Eft2R 2 :t 1 <t 2 g or an element ofm i is innite. To satisfy Assumption 1, it only remains to show that there exists such anE that is compact. Note thatk k 1 is bounded under AssumptionS(s;C 4 ) and k k 0 s under our sparsity assumption, sok k 1 is bounded due to Hölder’s inequality. Lastlym i is bounded because max k2f1;:::;K1g j k j C 4 under AssumptionT (C 4 ), so we have shown that we can nd such anE that is bounded. Choose one that is closed and we haveE compact by Bolzano-Weierstrass. We have shown that the assumptions of Theorem 3 of Ekvall and Bottai [56] are satised, so we have that forn large enough thatp n K 1, with probability at leastP C ;n;pn(K1) [p n (K 1)] c 3 P C ;n;pn(K1) (p 2 n ) c 3 ^ n 1 c 5 r log(p n (K 1)) n c 5 r log(p 2 n ) n =c 5 p 2 r logp n n ; (B.5.1) 299 wherec 3 andc 5 are constants from Ekvall and Bottai [56], andC ;n;pn(K1) is dened as follows. For a set Af1;:::;p n (K 1)g, dene A 2R pn(K1) to havej th entry ( A ) j := 8 > > > > < > > > > : j ; j2A; 0; j = 2A; and dene A c := A . Then for ans-sparse (in the sense of AssumptionS(s;c)) with support set S, dene C(S) := n 2R pn(K1) :k S ck 1 3k S k 1 o : We interpretC(S) to be a set of approximatelys-sparse vectors (the vectors would be exactlys-sparse if k S ck 1 = 0). Then for> 0 andn2N, dene C ;n;pn(K1) := ( (X;y) : inf f2C(S):kk 2 =1g ( > 1 n n X i=1 Z i Z > i ! ) ) : If the set of over which this condition must hold wereR pn(K1) , this would be a minimum eigenvalue condition on 1 n P n i=1 Z i Z > i . This condition is sometimes called a restricted eigenvalue condition [18]. We boundP C ;n;pn(K1) in Proposition B.5.1. PropositionB.5.1. Suppose the assumptions of Theorem 3.3.1 hold. Let rare,min := inf x2S;k2f1;:::;Kg fP(y i =kjx)g; and observe that rare,min > 0 under AssumptionsX(R pn ),S(s;C 4 ), andT (C 4 ). Assumen is large enough so that n rare, min > max ( 2 C p p n + r a min 2 ; 2 ) (B.5.2) 300 for somea> 0 (recall from the statement of Theorem 3.3.1 that we assumed min := min k2f1;:::;Kg min E h x i x > i jy i =k i >b for a xedb> 0). ThenP C arare,min=4;n;pn(K1) is at least 1 2K exp 1 2 2 rare,min n 2K exp ( c r n rare,min 2 C p p n r a min 2 ) : Proof. Provided in Section B.5.2. Observe that sincep n C 1 n C 2 for someC 1 > 0;C 2 2 (0; 1) this probability tends to 1 asn!1. Lemma B.5.2, below, along with (B.5.1) then shows that with probability at least 1p C 5 n 2K exp 1 2 2 rare,min n 2K exp ( c r n rare,min 2 C p p n r a min 2 ) (B.5.3) it holds that ^ n 2 ^ n 1 C 6 r logp n n ; whereC 5 := 2c 3 andC 6 :=c 5 p 2(K 1) (wherec 5 depends on the sparsity levels). LemmaB.5.2. ^ n 2 ^ n 1 (K 1) ^ n 1 . Proof. Provided in Section B.5.2. 301 Finally, we can now show consistency by showing that the random variable ^ n 2 converges in probability to 0. For any> 0, lim n!1 P ^ n 2 < () lim n!1 P ^ n 2 <C 6 r logp n n ! lim n!1 1p C 5 n 2K exp 1 2 2 rare,min n 2K exp ( c r n rare,min 2 C p p n r a min 2 )! = 1; where () follows because for large enoughn, > C 6 p log(p n )=n. This establishes Theorem 3.3.1. All that remains is to provide proofs for the supporting lemmas and proposition, which we do in the following section. Remark B.5.3. Observe that for niten we have that forn large enough, with probability at least equal to the expression in (B.5.3) it holds that p n ^ n 2 p n ^ n 1 C 6 p logp n : This establishes a high-probability nite sample relationship between our theoretical guarantee and the asymptotic mean squared error from Denition 3.2.1. Also note that for any> 0, there exists ann such that the probability in (B.5.3) is less than for all nn . Since there also exists ann 2 large enough so that for allnn 2 it holds that C 6 r logp n n ; 302 we have for any> 0, for alln maxfn ;n 2 g P 0 B @ ^ n 1 p log(p n )=n C 6 1 C A 1p C 5 n 2K exp 1 2 2 rare,min n 2K exp ( c r n rare,min 2 C p p n r a min 2 ) <; that is, ^ n 1 2O p ( p log(p n )=n), and likewise for ^ n 2 . Again, we emphasize that C 6 depends on the sparsitys (though the above holds since we have assumed thats is xed). B.5.2 SupportingResultsforProofofTheorem3.3.1 Proof of Proposition B.5.1. We will show that there is a high probability event such that fora> 0 it holds that inf f2C(S):kk 2 =1g ( > 1 n n X i=1 Z i Z > i ! ) > 1 2 a rare,min : Let A (1) = 0 K1 e 1 2R (K1)2 ; A (k) = 0 B B B B B B @ 1 k1 1 k1 0 1 0 Kk1 0 Kk1 1 C C C C C C A = P k1 k 0 =1 e k 0 P k k 0 =1 e k 0 2R (K1)2 ; k2f2;:::;K 1g; and A (K) = 1 K1 0 K1 = P K1 k 0 =1 e k 0 0 K1 2R (K1)2 ; 303 where 0 n and 1 n aren-vectors of zeroes and ones (respectively) ande k is the standard basis vector in R K1 with a 1 in thek th entry and zeroes elsewhere, andA (k) 2R (K1)2 for allk. Note that Z i =A (y i ) x i : (B.5.4) Let B = 0 B B B B B B B B B B B B B B @ 1 1 1 1 0 1 1 1 0 0 1 1 . . . . . . . . . . . . . . . 0 0 0 1 1 C C C C C C C C C C C C C C A = e 1 P 2 k 0 =1 e k 0 P K1 k 0 =1 e k 0 2R (K1)(K1) ; that is, the columns ofB areB k = P k `=1 e ` . We will make use of the following lemmas, which we prove later in this section: LemmaB.5.4. K X k=1 n k n A (k) A (k) > =BDB > ; where D := diag n 1 +n 2 n ;:::; n K1 +n K n 2R (K1)(K1) : LemmaB.5.5. 2 min (B) 1=2. LemmaB.5.6. For arbitrary matricesA,B, andC, ifBC andA 0 thenA BA C. Lemma B.5.7. With probability at least 1 K exp 1 2 2 rare,min n it holds that minfn 1 ;:::;n K g > rare,min n=2, wheren k := P n i=1 1fy i =kg is the number of observations in classk. 304 LemmaB.5.8. Under the assumptions of Proposition B.5.1, there exist constantsc;C > 0 such that min 0 @ 1 n k X i:y i =k x i x > i 1 A a for allk2f1;:::;Kg with probability at least 1 2K exp ( c r n rare,min 2 C p p n r a min 2 ) K exp 1 2 2 rare,min n ; wheren k is the number of observations in classk and min := min k2f1;:::;Kg min E X > Xjy =k . Using a union bound, there exists an event with probability at least 1 2K exp 1 2 2 rare,min n 2K exp ( c r n rare,min 2 C p p n r a min 2 ) 305 on which the conclusions of all of the above lemmas hold forn large enough. On this event we have forn suciently large inf f2C(S):kk 2 =1g ( > 1 n n X i=1 Z i Z > i ! ) inf f2R pn(K1) :kk 2 =1g ( > 1 n n X i=1 Z i Z > i ! ) = min 1 n n X i=1 Z i Z > i ! (a) = min 1 n n X i=1 A (y i ) x i A (y i ) x i > ! = min 1 n n X i=1 A (y i ) A (y i ) > x i x > i ! = min 0 @ 1 n K X k=1 X i:y i =k A (y i ) A (y i ) > x i x > i 1 A = min 0 @ 1 n K X k=1 A (k) A (k) > 0 @ X i:y i =k x i x > i 1 A 1 A = min 0 @ K X k=1 n k n A (k) A (k) > 0 @ 1 n k X i:y i =k x i x > i 1 A 1 A (b) min K X k=1 n k n A (k) A (k) > (aI p ) ! (c) = min K X k=1 n k n A (k) A (k) > ! min (aI p ) (d) = min BDB > min (aI p ) (e) a min k2f1;:::;K1g n k +n k+1 n min BB > (f) a 4 min k2f1;:::;K1g n k +n k+1 n (g) > a rare,min 4 ; where in (a) we used (B.5.4), (b) holds on the high probability event by Lemmas B.5.6 and B.5.8 because min 1 n k P i:y i =k x i x > i a implies 1 n k P i:y i =k x i x > i aI p for allk, in (c) we used the fact that for matricesM;N the eigenvalues ofM N are the products of the eigenvalues ofM andN and both of 306 the factor matrices are positive semidenite, in (d) we applied Lemma B.5.4, (e) follows fromBDB > min k2f1;:::;K1g n n k +n k+1 n o BB > , in (f) we applied Lemma B.5.5, and (g) follows from Lemma B.5.7. That is, on this high probability event it holds that inf f2C(S):kk 2 =1g ( > 1 n n X i=1 Z i Z > i ! ) > a rare,min 4 ; and the conclusion follows. Proof of Lemma B.5.2. For convenience, denote 1 = 1 and k = k ,k2f2;:::;K 1g. Let ^ n = ^ n ;1 > ; ^ n ;2 > ;:::; ^ n ;K1 > > := ^ n 1 1 > ; ^ n 2 2 > ;:::; ^ n K1 K1 > > : Note that k = P k 0 k k 0. Let := ( > 1 ; > 2 ;:::; > K1 ) > , and denote by ^ n k := X k 0 k ^ k 0; k2f1;:::;K 1g the estimates of yielded from the estimates of. Let ^ n = ^ n ;1 > ; ^ n ;2 > ;:::; ^ n ;K1 > > := ^ n 1 1 > ; ^ n 2 2 > ;:::; ^ n K1 K1 > > ; 307 and observe that ^ n = ^ n 1 1 > ; ^ n 2 2 > ;:::; ^ n K1 K1 > > = 0 B @ ^ n 1 1 > ; 0 @ X k 0 2 ^ k 0 X k 0 2 k 0 1 A > ;:::; 0 @ X k 0 K1 ^ k 0 X k 0 K1 k 0 1 A > 1 C A > = 0 B @ ^ n ;1 > ; 0 @ X k 0 2 ^ n ;k 0 1 A > ;:::; 0 @ X k 0 K1 ^ n ;k 0 1 A > 1 C A > : (B.5.5) Then ^ n ;1 1 = K1 X k=1 ^ n ;k 1 (a) = K1 X k=1 X k 0 k ^ n ;k 0 1 (b) K1 X k=1 X k 0 k ^ n ;k 0 1 = K1 X k=1 (Kk) ^ n ;k 1 (K 1) K1 X k=1 ^ n ;k 1 = (K 1) ^ n 1 : where in (a) we used (B.5.5) and in (b) we used the triangle inequality. Lastly, ^ n 2 ^ n 1 is a property of the` 1 and` 2 norms. 308 Proof of Lemma B.5.4. K X k=1 n k n A (k) A (k) > = n 1 n A (1) A (1) > + K1 X k=2 n k n A (k) A (k) > + n K n A (K) A (K) > = n 1 n 0 K1 e 1 0 B B @ 0 > K1 e > 1 1 C C A + K1 X k=2 n k n P k1 k 0 =1 e k 0 P k k 0 =1 e k 0 0 B B @ P k1 k 0 =1 e > k 0 P k k 0 =1 e > k 0 1 C C A + n K n P K1 k 0 =1 e k 0 0 K1 0 B B @ P K1 k 0 =1 e > k 0 0 > K1 1 C C A = n 1 n 0 K1 0 > K1 +e 1 e > 1 + K1 X k=2 n k n k1 X k 0 =1 e k 0 k1 X k 0 =1 e > k 0 + k X k 0 =1 e k 0 k X k 0 =1 e > k 0 ! + n K n K1 X k 0 =1 e k 0 K1 X k 0 =1 e > k 0 + 0 K1 0 > K1 ! = n 1 n B 1 B > 1 + K1 X k=2 n k n B k1 B > k1 +B k B > k + n K n B K1 B > K1 = K1 X k=1 n k +n k+1 n B k B > k =BDB > : Proof of Lemma B.5.5. B is full rank with inverse B 1 = 0 B B B B B B B B B B B B B B @ 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 . . . . . . . . . . . . . . . . . . 0 0 0 0 1 1 C C C C C C C C C C C C C C A : 309 We have kB 1 k op q kB 1 k 1 kB 1 k 1 = p 2 2 = 2; where we have used that the` 1 norm of each column and row is at most 2. Then the result follows from min (B) = 1=kB 1 k op . Proof of Lemma B.5.6. The matrixA (BC) is the Kronecker product of two positive semidenite matrices and is therefore positive semidenite. Proof of Lemma B.5.7. First we state a lemma we will use. LemmaB.5.9. Suppose that for allx2S andk2f1;:::;Kg it holds thatP(y i =kjx) rare,min . Let n k be the number of observations in classk from a data set of sizen. Then for anyq2f1;:::;bn rare,min cg, P K \ k=1 fn k >qg ! 1K exp 2n rare,min q n 2 : Proof. Provided later in this section. Note that n rare,min =2bn rare,min c because x=2bxc for all x 2 and n rare,min 2 due to assumption (B.5.2). So the assumptions of Lemma B.5.9 are satised forq = n rare,min =2, and there are more thann rare,min =2 observations in each class with probability at least 1K exp n 2n ( rare,min rare,min =2g) 2 o = 1K exp 1 2 2 rare,min n : Proof of Lemma B.5.8. We will prove the result by using a concentration inequality on the minimum sin- gular value of a random matrix (which will correspond to the square root of the minimum eigenvalue of 310 1 n k P i:y i =k x i x > i ; recall that the eigenvalues ofA > A are the squares of the singular values ofA). How- ever, the result we use applies only to random matrices with isotropic second moment matrices, so we need to standardize 1 n k P i:y i =k x i x > i by its inverse square root second moment matrix. We can relate this quantity to the minimum eigenvalue of 1 n k P i:y i =k x i x > i by our claim (that we will verify later) that for a symmetric positive semidenite matrixS and symmetric positive denite it holds that min (S) min 1=2 S 1=2 min (): (B.5.6) Then substitutingS = 1 n k P i:y i =k x i x > i and = k = E x i x > i jy i =k (which we assumed has a strictly postive minimum eigenvalue and is therefore invertible) into (B.5.6) yield that on an event where n k = P n i=1 1fy i =kg 1, we have that almost surely min 0 @ 1 n k X i:y i =k x i x > i 1 A min 0 @ 1 n k 1=2 k X i:y i =k x i x > i 1=2 k 1 A min ( k ) min 0 @ 1 n k 1=2 k X i:y i =k x i x > i 1=2 k 1 A min ; (B.5.7) where the second line uses our assumption from the statement of Theorem 3.3.1 and the fact that 1 n k 1=2 k X i:y i =k x i x > i 1=2 k is almost surely positive semidenite. Next we lower-bound the minimum eigenvalue of the random matrix 1 n k 1=2 k P i:y i =k x i x > i 1=2 k using a concentration inequality from Vershynin [181]. Observe that (as 311 [180] points out in Example 2.5.8(c)) the 2 norm of a bounded random vectorT 2 R p can be upper- bounded as follows: kTk 2 = sup v:v2R p ;kvk 2 =1 n kT > vk 2 o = sup v:v2R p ;kvk 2 =1 ( inf ( t> 0 :E exp T > v 2 t 2 ! 2 )) inf t> 0 : exp kTk 2 1 t 2 2 = kTk 1 p log 2 : Therefore under our assumptionsx i j y is bounded and therefore subgaussian, with 2 norm at most kx i k 2 kxk 1 = p log 2 =C 4 = p log 2 for allk. So for anyk2f1;:::;Kg, we have from Theorem 5.39 in Vershynin [181] that for anyt 0 there existsc> 0 such that the event P 0 @ min 0 @ 1 n k X i:y i =k 1=2 k x i 1 A p n k C p p n tjy 1 A 1 2 expfct 2 g (B.5.8) holds almost surely forC =C 2 4 =(log 2) p log(9)=c 1 , whereC 4 is as dened in the statement of Theorem 3.3.1, c 1 is a constant from Vershynin [181], min () denotes the minimum singular value, and if the set fi :y i =kg is empty we dene 1 n k P i:y i =k 1=2 k x i to equal 0 p (and note that the inequality then trivially holds with probability one in this case, becausen k = 0 so the right side is nonpositive). Fork2f1;:::;Kg andt 0, dene the eventE k (t) by E k (t) := 8 < : min 0 @ 1 n k X i:y i =k 1=2 k x i 1 A p n k C p p n t 9 = ; ; and observe that P(E k (t)) 1 2 expfct 2 g (B.5.9) 312 by marginalizing (B.5.8) overy. Now we consider a particular choice oft. From our assumption (B.5.2) we have n rare, min > 2 C p p n + r a min 2 () r n rare,min 2 C p p n r a min > 0; so we can chooset = p n rare,min =2C p p n q a min , yielding that on E k r n rare,min 2 C p p n r a min \fn k 1g we have min 0 @ 1 n k X i:y i =k 1=2 k x i 1 A p n k C p p n r n rare,min 2 +C p p n + r a min = p n k r n rare,min 2 + r a min : (B.5.10) That is, on this event we can lower-bound the minimum eigenvalue of 1 n k 1=2 k P i:y i =k x i x > i 1=2 k pro- vided thatn k is at least 1 and large enough that the right side of (B.5.10) is nonnegative. Next we work on lower-boundingn k with high probability. Consider the event N := n n k n rare,min 2 8k2f1;:::;Kg o : Notice that onN we have that the lower bound in (B.5.10) is at least p a= min andn k 1 for allk due to (B.5.2). That is, for anyk, onN\E k p n rare,min =2C p p n q a min inequality (B.5.10) holds and 313 yields min 1 n k P i:y i =k 1=2 k x i p a= min . Since the eigenvalues of 1 n k P i:y i =k 1=2 k x i x > i 1=2 k are the squares of the singular values of 1 n k P i:y i =k 1=2 k x i , on N\E k p n rare,min =2C p p n r a min we have min 0 @ 1 n k X i:y i =k 1=2 k x i x > i 1=2 k 1 A a min > 0: Finally, substituting this into (B.5.7) we have that on T K k=1 E k p n rare,min =2C p p n q a min \N min 0 @ 1 n k X i:y i =k x i x > i 1 A a min min =a 8k2f1;:::;Kg: Using a union bound, this holds with probability at least P K \ k=1 E k r n rare,min 2 C p p n r a min ! \N ! 1 K X k=1 P E c k r n rare,min 2 C p p n r a min P (N c ) 1 2K exp ( c r n rare,min 2 C p p n r a min 2 ) K exp 1 2 2 rare,min n ; where in the last step we applied (B.5.9) and Lemma B.5.7. 314 Finally, we show (B.5.6). Observe that for anyv withkvk 2 = 1 v > Sv = 1=2 v > 1=2 S 1=2 1=2 v min 1=2 S 1=2 k 1=2 vk 2 2 = min 1=2 S 1=2 v > v min 1=2 S 1=2 min (); proving the claim. Proof of Lemma B.5.9. By Hoeding’s inequality we have that for any k 2 f1;:::;Kg and any q 2 f1;:::;bn rare,min cg,P (n k q) exp n 2n rare,min q n 2 o . Then using a union bound, we have P K \ k=1 fn k >qg ! = 1P K [ k=1 fn k qg ! 1 K X k=1 P (n k q) 1K exp 2n rare,min q n 2 : B.6 EstimatingPRESTO Code generating all plots and tables that appear in the paper is available in the public GitHub repo at https://github.com/gregfaletto/presto. Executing the code in that repo generates the exact plots that appear in this version of the paper. 315 In all data experiments, we implemented PRESTO by slightly modifying the code for the RordinalNet package (version 2.12), as we discuss below. The real data experiments from Section 3.4.3 and Appendix 3.4.4 were conducted in R Version 4.3.0 running on macOS Ventura 13.3.1 on a MacBook Pro with a 2.3 GHz Quad-Core Intel Core i5 processor and 16 GB or RAM. We used the R packages MASS [179, version 7.3.58.1], simulator [19, version 0.2.4], ggplot2 [188, version 3.3.6],cowplot [190, version 1.1.1], andstargazer [84, version 5.2.3], all available for download on CRAN, as well as the base parallel package (version 4.3.0). As mentioned earlier, the data for the real data experiment from Section 3.4.3 is the soup data set from the R ordinal package (version 2022.11.16), and the data for the real data experiment from Appendix 3.4.4 is the PreDiabetes data set from the RMLDataR package (version 1.0.1) The synthetic data experiments from Sections 3.4.1 and 3.4.2, as well as Simulation Studies A and B in Appendix B.3, were conducted in R Version 4.2.2 running on macOS 10.15.7 on an iMac with a 3.5 GHz Quad-Core Intel Core i7 processor and 32 GB or RAM. We used the R packages MASS (version 7.3.58.1), simulator (version 0.2.4),ggplot2 (version 3.3.6),cowplot (version 1.1.1), andstargazer (version 5.2.3), all available for download on CRAN, as well as the baseparallel package (version 4.2.2). In the remainder of this section, we describe our implementation of PRESTO. We will use the notation and terminology of Wurm, Rathouz, and Hanlon [195], with the exception of continuing to use our conven- tion ofK total categories. We tPRESTO by reparameterizing the ecient coordinate descent algorithm used to estimate` 1 -penalized ordinal regression models in the RordinalNet [195] package, in much the same way that the generalized lasso can be implemented using reparameterization; see Tibshirani and Taylor [172, Section 3] or Section 4.5.1.1 of Hastie, Tibshirani, and Wainwright [79] for a textbook-level discussion. The ordinalNet package implements elastic net penalized ordinal regression, including an` 1 penal- ized (or ` 2 -penalized, as we explored in Section B.1.1) relaxation of the proportional odds model. The 316 parameters we seek to model are i 2 R K1 , which model the probabilities of theK outcomes by the relation i = g(p i ), wherep i 2 S K1 (whereS K1 := fp : p 2 (0; 1) K1 ;kpk 1 < 1g are the probabilities of outcomesf1;:::;K 1g; in particular, p ik = P(y i = k) fork2f1;:::;K 1g and P(y i = K) = 1 P K1 k=1 p ik ) and g :S K1 ! R K1 is an invertible function. For our model, the forwards cumulative probability model, wherep ik =P(y i k), g(p i ) k = log P k k 0 =1 p ik 0 1 P k k 0 =1 p ik 0 ! ; k2f1;:::;Kg: We choose the nonparallel model i =c +B > x i , wherex i is thei th row ofX,c2R K1 is a vector of intercepts, and B = B 1 B K1 is ap (K 1) matrix of coecients. Observe that we can simply write i = ~ X i , for2R (K1)(p+1) dened by = 0 B B B B B B B B B B @ c B 1 . . . B K1 1 C C C C C C C C C C A and ~ X i 2R (K1)(K1)(p+1) dened by ~ X i = 0 B B B B B B B B B B B B B B @ x > i 0 > p 0 > p 0 > p 0 > p x > i 0 > p 0 > p I K1 . . . . . . . . . . . . . . . 0 > p 0 > p x > i 0 > p 0 > p 0 > p 0 > p x > i 1 C C C C C C C C C C C C C C A = I K1 I K1 x > i : 317 The RordinalNet package solves the convex [196, 142] optimization problem arg min 8 < : 1 n n X i=1 ` i g 1 ~ X > i + (K1)(p+1) X q=K j q j 9 = ; where ` i (p i ) = K1 X k=1 1fy i =kg logp ik + 1fy i =Kg log 1 K1 X k=1 p ik ! : We would like to place an` 1 penalty on the rst dierencesB j;k+1 B jk for allk2f1;:::;K 2g. We can do this through the parameterization 2R p(K1) dened by jk = 8 > > > > < > > > > : B j1 ; j2 [p];k = 1; B jk B j;k1 ; j2 [p];k2f2;:::;K 1g: Observe that these matrices are related by B jk = k X k 0 =1 jk 0; so ik =c 0k +B > k x i =c 0k + k X k 0 =1 > k 0x i ; k2f1;:::;K 1g: Therefore we can simply write i = ~ X 0 i 0 ; 318 for 0 2R (K1)(p+1) dened by 0 = 0 B B B B B B B B B B @ c 1 . . . K1 1 C C C C C C C C C C A and ~ X 0 i 2R (K1)(K1)(p+1) dened by ~ X 0 i = 0 B B B B B B B B B B B B B B @ x > i 0 > p 0 > p 0 > p x > i x > i 0 > p 0 > p I K1 . . . . . . . . . . . . . . . x > i x > i x > i 0 > p x > i x > i x > i x > i 1 C C C C C C C C C C C C C C A : (B.6.1) We then seek to solve the slightly modied optimization problem (modication highlighted) arg min 0 8 > > < > > : 1 n n X i=1 ` i 0 B B @ g 1 0 B B @ ~ X 0 i > | {z } () 0 1 C C A 1 C C A + (K1)(p+1) X q=K 0 q : 9 > > = > > ; Though this can not be implemented within the framework of the existing ordinalNet package, the above modication only requires changing a handful of lines of the publicly available source code in the ordinalNet package. Though our implementation is simple, using the modied design matrix (the change above) could cause convergence of parameter estimation to be slow, because the resulting lasso problem eectively has highly correlated features, which slows the convergence of coordinate descent. See Section 4.5.1.1 of Hastie, Tibshirani, and Wainwright [79] for a textbook-level discussion of this point. Though this is a convenient way to estimate PRESTO, it is not the best approach for estimating PRESTO in practice, and 319 the limitations of our estimation approach are not limitations of PRESTO itself. Known ecient strate- gies for solving optimization problems like PRESTO are discussed in Hastie, Tibshirani, and Wainwright [79, Section 4.5] and Arnold and Tibshirani [8]. Though PRESTO does not strictly lie within the class of optimization problems considered by Xin et al. [198], some of their ideas might also be helpful for better estimating PRESTO. An ADMM approach like the one proposed by [206] for generalized lasso problems may also be useful for estimating PRESTO. Ko, Yu, and Won [105] propose their own methods for solving similar optimization problems. 320 AppendixC FusedExtendedTwo-WayFixedEectsforDierence-in-Dierences Appendix In Appendix C.1 we present additional results from the synthetic experiment in Section 4.6. We state Theorem C.2.1, which our main results all depend on, in Appendix C.2, and we prove it in Appendix C.4. We prove Theorems 4.5.1, 4.5.2, and 4.5.3 in Appendix C.3. In Appendix C.5 we prove supporting results required to prove Theorems 4.5.1, 4.5.2, and 4.5.3, and in Appendix C.6 we prove supporting results for Theorem C.2.1. Finally, we prove some technical lemmas in Appendix C.7. C.1 AdditionalResultsForSyntheticExperiments Table C.1 contains results for the cohort-specic squared estimation errors for the cohort-specic average treatment eects ATT (r) as dened in (4.18). We see that FETWFE outperforms every method for every cohort. Table C.2 containsp-values for paired one-tailedt-tests of the alternative hypothesis that FETWFE has lower squared estimation error than the competitor methods, showing that the dierences are statisti- cally signicant at the 0:05 level in all settings except the comparison to ETWFE for the cohort 5 averages, where the dierence is just shy of signicant at the 0:05 level. Figures C.1, C.2, C.3, C.4, and C.5 display boxplots of these squared errors, similar to Figure 4.1 for the average treatment eect. These plots also indicate that FETWFE typically outperforms the other methods. 321 1e−06 1e−03 1e+00 bridge_etwfe etwfe fetwfe twfe_covs Method Cohort 2 ATT Squared Error Figure C.1: Boxplots of squared errors for each method’s estimate of ATT (2) across all 280 simulations. Vertical axis is on a log scale. 322 1e−06 1e−03 1e+00 bridge_etwfe etwfe fetwfe twfe_covs Method Cohort 3 ATT Squared Error Figure C.2: Boxplots of squared errors for each method’s estimate of ATT (3) across all 280 simulations. Vertical axis is on a log scale. 323 1e−06 1e−03 1e+00 bridge_etwfe etwfe fetwfe twfe_covs Method Cohort 4 ATT Squared Error Figure C.3: Boxplots of squared errors for each method’s estimate of ATT (4) across all 280 simulations. Vertical axis is on a log scale. 324 1e−06 1e−03 1e+00 bridge_etwfe etwfe fetwfe twfe_covs Method Cohort 5 ATT Squared Error Figure C.4: Boxplots of squared errors for each method’s estimate of ATT (5) across all 280 simulations. Vertical axis is on a log scale. 325 1e−06 1e−03 1e+00 bridge_etwfe etwfe fetwfe twfe_covs Method Cohort 6 ATT Squared Error Figure C.5: Boxplots of squared errors for each method’s estimate of ATT (6) across all 280 simulations. Vertical axis is on a log scale. 326 Table C.1: Means and standard errors for the squared errors of estimates of ATT (r) by each method for each cohort in the simulation study from Section 4.6. FETWFE ETWFE BETWFE TWFE_COVS Cohort 2 0.002 (0.001) 0.134 (0.013) 0.018 (0.005) 1.548 (0.095) Cohort 3 0.004 (0) 0.203 (0.086) 0.432 (0.037) 1.225 (0.096) Cohort 4 0.014 (0.001) 0.126 (0.03) 1.598 (0.088) 4.594 (0.266) Cohort 5 0.01 (0.001) 0.306 (0.198) 1.412 (0.085) 1.705 (0.147) Cohort 6 0.009 (0.001) 0.167 (0.071) 2.17 (0.097) 2.145 (0.168) Table C.2: p-values for paired one-tailedt-tests of the alternative hypothesis that the FETWFE estimate of ATT (r) has lower squared error than the competitor methods for each cohort in the simulation study from Section 4.6. ETWFE BETWFE TWFE_COVS Cohort 2 3.59e-21 0.000504 2.23e-42 Cohort 3 0.0107 2.41e-25 9.98e-30 Cohort 4 0.000131 7.93e-49 5.17e-46 Cohort 5 0.0676 1.11e-43 1.03e-25 Cohort 6 0.0136 4.71e-64 6.34e-30 C.2 TheoremC.2.1 Theorem C.2.1 connects the bridge regression theory of Kock [106] to the dierence-in-dierences theory of Wooldridge [194]. It also makes use of some extensions of the theory of Kock [106] (Proposition C.4.4, Proposition C.4.6). Equipped with Theorem C.2.1, Theorems 4.5.1, 4.5.2, and 4.5.3 are relatively straight- forward to prove. TheoremC.2.1. 1. (Correctspecication.) SupposeAssumptions(CNAS),(CCTS),and(LINS)hold. Then for alli2 [N],t2 [T ],r2R, E[y (it) jW i = 0;X i ] = + t +X > i ( + t ) and E[y (it) jW i =r;X i ] = + t +X > i ( + t + r ) + r + 1ftrg( rt + _ X > (ir) rt ); 327 where _ X (ir) =X i E[X i jW i =r], for identiability we dene the time-dependent parameters to equal 0 att = 1, and recall that ; t , etc. are the regression estimands dened in (4.22). This implies that CATT (r;x;t) = rt + _ x r rt ; r2R;tr: (C.2.1) 2. (Estimation consistency.) In addition to the assumptions from part (a), suppose Assumptions (F1), (F2), S(s N ), and (R1) - (R3) hold. Then for anyq > 0,k ^ (q) N k 2 =O P (minfh N ;h 0 N g), whereh N is dened in (4.31) andh 0 N is dened in (4.30). 3. (Selection consistency.) Dene N :=D N N (C.2.2) and ^ (q) := D N ^ (q) . In addition to the previous assumptions, assume that q 2 (0; 1) and As- sumptions (R4) - (R5) hold. Then lim N!1 P ^ (q) S c = 0 = 1. (This also implies that ^ (q) S c p ! 0, lim N!1 P p NT ^ (q) S c = 0 = 1, and lim N!1 P p NTv > N ^ (q) S c = 0 = 1 for any sequence of - nite vectorsv N 2R p N .) Further,let ^ S N :=fj : ^ (q) j 6= 0g,andsimilarlyletS N [p N ]bethesetofcomponentsof N thatare nonzero. Then lim N!1 P ^ S N =S N = 1: 4. (Asymptotic normality.) In addition to the previous assumptions, assume Assumption (R6) holds and Assumption S(s) holds for a xeds N = s. Letf N g 1 N=1 be a sequence of real-valued vectors where for each N 2R p N , ( N ) S =2R s is xed, nite, and entrywise nonzero and ( N ) [p N ]nS =b N , wherefb N g 1 N=1 isanysequenceofconstantswhereeachb N 2R p N s containsallniteentries. Then p NT > N ( ^ (q) N ) d !N 0; 2 > Cov G N ZD 1 N ()S 1 ; 328 where Cov G N ZD 1 N ()S is the columns and rows of the covariance matrix corresponding to the nonzero components of N . 5. (Asymptotic normality whenS is estimated.) Maintain the assumptions from part (d). For a random matrixA2R NTp N , dene d Cov(A) := 1 NT (AEA) > (AEA) (C.2.3) to be the estimated covariance matrix of A. For anyA N [p N ] such thatA N 6= ;, consider d Cov (G N ZD 1 N ) ()A N ,theestimatedcovariancematrixofthecolumnsofG N ZD 1 N corresponding toA N . For sets (A N ) 1 N=1 , dene(A N )2R jA N j to be a xed, nite, nonzero vector for each unique A N . Dene the sequence of vectorsf N (A N )g 1 N=1 to have components inA N equal to(A N ) and componentsb N (A N )inthecomponentscorrespondingto [p N ]nA N thatareallniteforallN. Dene the random variableU(A N ) to equal 1 v u u t NT (A N ) > d Cov (G N ZD 1 N ) ()A N 1 (A N ) N (A N ) > ( ^ N ) ifA N 6=; and d Cov (G N ZD 1 N ) ()A N is invertible and 0 otherwise. Then if at least one of the components of N (S) corresponds to a nonzero component of N , U( ^ S N ) d !N (0; 1): Proof. Provided in Section C.4. 329 C.3 ProofsofMainResults Proof of Theorem 4.5.1. We will rst prove consistency of the average treatment eects, then the condi- tional average treatment eects. Observe that under our assumptions our model is correctly specied due to Theorem C.2.1(1), so using (4.17) and (C.2.1) we have ATT (r;t) =E [ CATT (r;X i ;t)] = rt + ( rt ) > E h X (r) (it) E X (it) jW i =r i = rt + ( rt ) > E E X (it) X (it) jW i =r = rt 8i2 [N]; (C.3.1) where we used that _ X (r) (it) = E X (it) jW i =r . Examining our estimator ^ ATT (r;t) for the average treatment eect for a unit in cohortr at timet from (4.26), we see that we can then apply Theorem C.2.1(2) to immediately obtain that for everyr2R andtr, j^ ATT (r;t) ATT (r;t)j = ^ (q) rt rt k ^ (q) N k 2 =O P minfh N ;h 0 N g : (C.3.2) Similarly, for the average treatment eects within cohorts (4.18), using (4.27), (C.2.1), and (C.3.1) we have for eachr2R j^ ATT (r) ATT (r)j = 1 Tr + 1 T X t=r ^ (q) rt rt 1 Tr + 1 T X t=r ^ (q) rt rt =O P minfh N ;h 0 N g ; 330 where in the last step we used (C.3.2) and the fact thatT is xed. Finally, for our estimator (4.28) of ATT , (4.16), we have j^ ATT ATT j = X r2R 1 Tr + 1 N r N ^ ATT (r) ~ r ATT (r) = X r2R 1 Tr + 1 N r N (^ ATT (r) ATT (r)) ~ r N r N ATT (r) X r2R 1 Tr + 1 N r N j^ ATT (r) ATT (r)j + X r2R ~ r N r N j ATT (r)j =O P (minfh N ;h 0 N g) + X r2R ~ r N r N 1 Tr + 1 T X t=r rt ; where we used that N r =N 2 [0; 1] for all r and the result from the previous part. (Recall that N r = P N i=1 1fW i =rg the total number of units in cohortr andN =N 1 P N i=1 1fW i 6= 0g = P r 0 2R N r 0 is the total number of treated units.) We will show that the second term in this sum converges in probability to 0. By the weak law of large numbers, N r 1 N ; N r 2 N ;:::; N r R N p ! ( r 1 ; r 2 ;:::; r R ); where r =E[ r (X)] are the marginal probabilities of treatment assignments, as dened in (4.34). By the continuous mapping theorem, this also implies that N N = 1 N R X k=1 N r k p ! R X k=1 r k =: : Therefore by the continuous mapping theorem and using the fact that under Assumption (F2) r > 0 for allr, for allr2R ~ r N r N = ~ r N r =N N =N p ! ~ r r =j~ r ~ r j = 0: 331 This fact is enough to show that P r2R ~ r Nr N j rt j p ! 0 since 1 Tr + 1 T X t=r rt is nite under our assumptions due to Lemma C.3.1: LemmaC.3.1. Under Assumption (R3), max j2p N fj( N ) j jg (T 1) 2 b 1 . Proof. Provided in Section C.5. Now we consider the conditional average treatment eects. Again, using (C.2.1) (and see also Equation 4.12), CATT (x;r;t) = rt + ( rt ) > xE X (1t) jW i =r = rt + ( rt ) > (x r ); where r :=E X (it) jW i =r . Using this and (4.23) we have j^ CATT (r;x;t) CATT (r;x;t)j = ^ (q) rt rt + (xX r ) > ^ rt (x r ) > rt = ^ (q) rt rt + (x r ) > (^ rt rt ) + r X r > (^ rt rt ) X r r > rt ^ (q) rt rt + (x r ) > (^ rt rt ) + r X r > (^ rt rt ) + X r r > rt (a) ^ (q) rt rt +kx r k 2 k^ rt rt k 2 + r X r 2 k^ rt rt k 2 + X r r 2 k rt k 2 (b) =O P (minfh N ;h 0 N g) +O P (minfh N ;h 0 N g) + r X r 2 O P minfh N ;h 0 N g + X r r 2 k rt k 2 (c) =O P (minfh N ;h 0 N g) +O P (minfh N ;h 0 N g) +O P minfh N ;h 0 N g p N +O P 1 p N =O P minfh N ;h 0 N g_ 1 p N (d) =O P minfh N ;h 0 N g ; 332 where in (a) we applied Cauchy-Schwarz; (b) follows from Theorem C.2.1, Proposition 1.8(v) from García- Portugués [69], the fact thatk rt k 2 is xed and bounded using Lemma C.3.1, and the fact thatd is xed; (c) follows because N r converges to innity at a rate proportional to N since r > 0 is xed, and if p NX N d ! ~ ZN (0; ) thenk p NX N ~ Zk 2 = o P (1), so p NkX N k 2 k ~ Zk 2 +o P (1) =O P (1), so kXk 2 =O P (1= p N); and (d) follows because 1= p N =O(h N ) sincee 1N is upper-bounded by a constant andp N is nondecreasing. Next, using (4.24) and (4.13) we have j^ CATT (r;x) CATT (r;x)j = 1 Tr + 1 T X t=r ^ CATT (r;x;t) CATT (r;x;t) 1 Tr + 1 T X t=r j^ CATT (r;x;t) CATT (r;x;t)j =O P minfh N ;h 0 N g : Finally, from (4.14) and (4.25), j^ CATT (x) CATT (x)j = X r2R ^ r (x) P r 0 2R ^ r 0(x) ^ CATT (r;x) ~ r (x) CATT (r;x) = X r2R ^ r (x) P r 0 2R ^ r 0(x) (^ CATT (r;x) CATT (r;x)) ~ r (x) ^ r (x) P r 0 2R ^ r 0(x) CATT (r;x) X r2R ^ r (x) P r 0 2R ^ r 0(x) j^ CATT (r;x) CATT (r;x)j + X r2R ~ r (x) ^ r (x) P r 0 2R ^ r 0(x) j CATT (r;x)j X r2R j^ CATT (r;x) CATT (r;x)j + X r2R ~ r (x) ^ r (x) P r 0 2R ^ r 0(x) j CATT (r;x)j =O P minfh N ;h 0 N g + X r2R ~ r (x) ^ r (x) P r 0 2R ^ r 0(x) j CATT (r;x)j: 333 Similarly to the average treatment eect case above, since ^ r (x) p ! r (x) for any xedx for all r2 f0g[R and r (x)> 0 for allr2f0g[R by Assumption (F2), the continuous mapping theorem implies ^ r (x) P r 0 2R ^ r 0(x) p ! r (x) P r 0 2R r 0(x) = ~ r (x) for allr2R. Then the fact that CATT (r;x) = 1 Tr+1 P T t=r ( rt +x > rt ) is bounded for any xedx for allr2R due to Lemma C.3.1 implies that X r2R ~ r (x) ^ r (x) P r 0 2R ^ r 0(x) j CATT (r;x)j p ! 0; proving the result. Proof of Theorem 4.5.2. This is immediate from Theorem C.2.1(3). Proof of Theorem 4.5.3. The result will follow by applying Theorem C.2.1(5) with particular choices of vec- tors N . The oracle properties in the case where the treatment eects are nonzero follow from the same arguments that we present below, but applying Theorem C.2.1(4) rather than (5). The oracle properties persist even in the case that the treatment eects equal 0 because, as noted in Theorem 4.5.2, in this case the random variables scaled by p NT converge in probability (and distribution) to 0. Again, by the same reasoning used in the proof of Theorem 4.5.1, the average treatment eect ATT (r;t) is equal to the regression estimand rt . Leti(r;t) be the index of the row ofD N corresponding to ^ (q) rt (that is, the column ofZ corresponding to the indicator variable for this regression coecient). Next we will show that N = (D 1 N ) i(r;t); > , thei(r;t) th row ofD 1 N , satises the requirements of N imposed by Theorem C.2.1(5). To do so, we will make use of the following lemma: 334 LemmaC.3.2. After re-ordering the columns ofZ, we can writeD N as diag D (1) (R);D (1) (T1);I d N ;I d N D (1) (R);I d N D (1) (T1);D (2) (R);I d N D (2) (R) ; (C.3.3) whereD (1) (t)2 R tt is dened in (C.5.1) andD (2) (R)2 R WW is dened in (C.5.2). Further,D N is invertible, andD 1 N can be written as diag D (1) (R) 1 ; ((D (1) (T 1)) 1 ;I d N ;I d N D (1) (R) 1 ;I d N ((D (1) (T 1)) 1 ; (D (2) (R)) 1 ;I d N (D (2) (R)) 1 : (C.3.4) Finally, bothD N andD 1 N consists of blocks of size at most (T 1) 2 (T 1) 2 . ForD N the entries of these blocks are either1, 0, or 1; andD 1 N the entries of these blocks are either 0 or 1. Both matrices equal 0 elsewhere. Proof. Provided in Appendix C.6. Since each row ofD 1 N contains at most (T 1) 2 nonzero entries, all of the nonzero entries equal 1, and the entries corresponding to the s relevant features will stay xed even if p N !1, the vector N = (D 1 N ) i(r;t); > is a valid choice for Theorem C.2.1(5). We therefore have that ^ (q) rt is asymptotically normal when appropriately scaled if at least one of the nonzero components of (D 1 N ) i corresponds to one of thes features inS: 1 v u u t NT (D 1 N ) i(r;t); ^ S d Cov (G N ZD 1 N ) () ^ S 1 ((D 1 N ) i(r;t); ^ S ) > (D 1 N ) i(r;t); ( ^ (q) N ) d !N (0; 1) () s NT ^ v ATT;r;t N (^ ATT (r;t) ATT (r;t)) d !N (0; 1); 335 where (D 1 N ) i(r;t); ^ S 2 R j ^ Sj is a row vector that contains thej ^ Sj components of thei(r;t) row ofD 1 N corresponding to the selected features and ^ v ATT;r;t N := 2 (D 1 N ) i(r;t); ^ S d Cov (G N ZD 1 N ) () ^ S 1 ((D 1 N ) i(r;t); ^ S ) > ; (C.3.5) where d Cov() is dened in (C.2.3). Similarly, since ATT (r) = 1 Tr + 1 T X t=r rt = 1 Tr + 1 T X t=r (D 1 N ) i(r;t); N ; we can choose (ATT;r) N := 1 Tr+1 P T t=r (D 1 N ) i(r;t); ^ S (which has no more than (Tr+1)(T1) 2 nonzero entries) to establish that 1 v u u t NT ( (ATT;r) N ) > d Cov (G N ZD 1 N ) () ^ S 1 (ATT;r) N ( (ATT;r) N ) > ( ^ (q) N ) d !N (0; 1) () s NT ^ v ATT;r N (^ ATT (r) ATT (r)) d !N (0; 1); where ^ v ATT;r N := 2 (ATT;r) N > d Cov (G N ZD 1 N ) () ^ S 1 (ATT;r) N : (C.3.6) C.4 ProofofTheoremC.2.1 Before we prove the individual statements, we outline our general approach. Notice that the assumptions leading to equation (6.33) in Wooldridge [194, Section 6.3] are satised, so we have for anyi2 [N] E h y (i) jW i ;X i i =Z (i) N ; (C.4.1) 336 where the subscript (i) refers to the T observations corresponding to unit i. (This veries part (1) of Theorem C.2.1.) We propose estimating the regression (4.21), ^ (q) = arg min 2R p N kG N _ yG N Zk 2 2 + N kD N k q q : We proved in Lemma C.3.2 that the dierences matrixD N is invertible. We can therefore solve (4.21) through the reparameterization ^ (q) =D N ^ (q) , solving for ^ (q) = arg min 2R p N kG N _ yG N ZD 1 N k 2 2 + N kk q q using standard bridge regression. Then ^ (q) =D 1 N ^ (q) is identical to the solution to (4.21) [172, Section 3]. 337 Having cast (4.21) as a standard bridge regression optimization problem, our results will follow from Theorem 1, Lemma 3, and Theorem 2(ii) from Kock [106] if we show the assumptions are met in our setting where ~ Y iN =y (i) ; ~ X iN =Z (i) D 1 N ; 0 = N ; c iN =c i 1 T ; ~ iN =u (i) ; Y iN = 1=2 y (i) ; X iN = 1=2 Z (i) D 1 N ; X N =G N ZD 1 N ; W N = (G N ZD 1 N ) ()S ; 1N = ^ ((G N ZD 1 N ) ()S ); iN = 1=2 (i) ; =q; and k N =s N ; where the left side of each equation uses the notation from Kock [106] and the right side uses the notation in our setting. We focus on the random eects setting of [106] because we cannot perform the forward orthogonal deviations transform sinceZ (andZD 1 N ) contain (linear combinations of) time-invariant covariates and cohort indicators. Now we are prepared to prove each statement. 338 1. As mentioned earlier, under our assumptions this is immediate from Equation (6.33) in Wooldridge [194, Section 6.3]. We also provide a derivation here in our notation. Observe that for anyi2 [N], E y (i1) jW i ;X i =x = 1fW i = 0gE y (i1) (0)jW i = 0;X i =x + X r2R 1fW i =rgE y (i1) (r)jW i =r;X i =x (a) = 1fW i = 0gE y (i1) (0)jW i = 0;X i =x + X r2R 1fW i =rgE y (i1) (0)jW i =r;X i =x (b) = 1fW i = 0g +x > + X r2R 1fW i =rg + r +x > ( + r ) = +x > + X r2R 1fW i =rg r +x > r ; (C.4.2) where in (a) we used Assumption (CNAS) and in (b) we used (4.7) and (4.8) from (LINS). Next, for anyt2f2;:::;Tg andi such thatW i = 0, E y (it) jW i = 0;X i =E y (it) (0)y (i1) (0)jW i = 0;X i =x +E y (i1) (0)jW i = 0;X i =x (c) =E y (it) (0)y (i1) (0)jX i =x +E y (i1) (0)jW i = 0;X i =x (d) = + t +x > ( t + ); 339 where in (c) we used (CCTS) and in (d) we used (LINS) and (C.4.2). Then for anyi such thatW i = r2Rnf2g andt2f2;:::;r 1g, E y (it) jW i =r;X i =x =E y (i1) jW i =r;X i =x +E y (it) (r)y (i1) (r)jW i =r;X i =x (e) =E y (i1) jW i =r;X i =x +E y (it) (r)y (i1) (r)jX i =x (f) = E y (i1) jW i =r;X i =x +E y (it) (0)y (i1) (0)jX i =x (g) = + r +x > ( + r + t ) + t ; (C.4.3) where in (e) we used (CCTS), in (f) we used (CNAS), and in (g) we used (C.4.2) and (4.7) and (4.8) from (LINS). Finally, for anyi such thatW i =r2R andt2fr;:::;Tg, E y (it) jW i =r;X i =x =E y (i1) jW i =r;X i =x +E y (it) (0)y (i1) (0)jW i = 0;X i =x +E y (it) (r)y (i1) (r)jW i =r;X i =x E y (it) (0)y (i1) (0)jW i = 0;X i =x (h) = + r +x > ( + r + t ) + t + rt + _ x r rt ; where in (h) we used (4.8), (4.9), and (4.10) from (LINS), (C.4.3), and (CCTS). Therefore regression (4.4) is correctly specied and (C.4.1) holds. 2. First we will show that Assumptions (FE1) - (FE3), (RE4), (A1), (A3), (A5), and (A6) from Kock [106] are satised. Then we will show that the conclusion of Theorem 1 in Kock [106] about N leads to our statement about N . We will use three lemmas: 340 LemmaC.4.1. Under Assumption (R1) it holds that E h (ZD 1 N ) 4 (it)j i 4 8i2 [N];j2 [p N ];t2 [T ] for some nite 4 > 0. Further, there exists a nite 2 > 0 such that E h (G N ZD 1 N ) 2 (it)j i 2 8i2 [N];j2 [p N ];t2 [T ]: LemmaC.4.2. ThenumberoftreatmenteectsWisatmost (T1) 2 . Further,thesmallestandlargest singular values ofD N are bounded as follows: max (D N ) 3 and min (D N ) 1 T p 2T . LemmaC.4.3. Under the assumptions of Theorem C.2.1(2), 1N = min ^ G N ZD 1 N e 1N 2 9(T 2 c + 2 ) and 2N = max ^ G N ZD 1 N e 2N 8T 3 : The proofs of all three of these lemmas are provided in Appendix C.6. Now we will verify one assumption at a time. • In our notation, Assumption (FE1) from Kock [106] requires that (Z (i) D 1 N ;c i ;u (i) ) N i=1 are iid. This is satised by our Assumption (F2) sinceZ (i) is a deterministic, invertible function of (W i ;X i ), andD 1 N is also deterministic and invertible conditional on (W i ;X i ). • Assumption (FE2) requires that E[(ZD 1 N ) 4 (it)j ] and E[u 4 (it) ] are nite for all i 2 [N];j 2 [d N ];t2 [T ]. We assumed the latter in Assumption (R1). We proved that the former is satised in the rst part of Lemma C.4.1. 341 • Next, Assumption (FE3) requires that E u (i) jZ (i) D 1 N ;c i = 0 and Var u (i) jZ (i) D 1 N ;c i = 2 I T ; i2 [N]; and Assumption (RE4) requires E c i jZ (i) D 1 N = 0; Var c i jZ (i) D 1 N = 2 c ; and that and 2 c are known and nite. Our Assumption (F1) is sucient for this because conditioning onW i ;X i is the same as conditioning onZ (i) D 1 N by the following argument. BecauseD 1 N is invertible, the-algebra generated byZ (i) D 1 N ,(Z (i) D 1 N ), contains the same information as(Z (i) ). Similarly, for each (W i ;X i ) there is only one validZ (i) , and every validZ (i) corresponds to a unique (W i ;X i )—that is, the mapping from (W i ;X i ) to Z (i) is also invertible, so(W i ;X i ) =(Z (i) ). • Our Assumptions (R1), (R2), and (R3) exactly match assumptions (A1), (A3), and (A5), respec- tively, from Kock [106] (in our setting where we assumeT is xed). • Finally, Assumption (A6) in Kock [106] requires that p N + N s N N 1N a:s: ! 0; where 1N is the smallest eigenvalue of the random matrix ^ G N ZD 1 N . Under Lemma C.4.3 the smallest eigenvalue of ^ (Z),e 1N , is within a constant factor of 1N , so our Assump- tion (R2) suces to satisfy this condition. 342 We have shown that the assumptions of Theorem 1 in Kock [106] are satised, so we havek ^ (q) N k 2 =O P (minfh N ;h 0 N g). Finally, part (2) is proven because ^ (q) N 2 is within a constant factor ofk ^ (q) N k 2 : k ^ (q) N k 2 = D 1 N ^ (q) N 2 D 1 N op ^ (q) N 2 T p 2T ^ (q) N 2 ; where in the last step we applied Lemma C.4.2. 3. In light of the results from part (1), part (3) follows from Proposition C.4.4 below, which extends Lemma 3 from Kock [106], under our assumptions. In particular, we will show that our Assumptions (R4) and (R5) suce for Kock [106]’s Assumptions (A4) and (A7), as well as the needed part of Assumption (A2) for Lemma 3, in Kock [106]. PropositionC.4.4 (Extension of Lemma 3 from [106]). UndertheassumptionsofLemma3fromKock [106], lim N!1 P ^ S =S = 1: Proof. Provided in Appendix C.6. Remark C.4.5. It appears that the strategy for the proof of of Proposition C.4.4 could be used to analogously extend Lemma 2, and Theorem 2(i), in [90]. Again we verify one assumption at a time. • Although Lemma 3 from Kock [106] requires their Assumption (A2) in the statement of the theorem, examining the proof we see that a lower bound on the minimum eigenvalue is not needed—the only thing that is needed is an almost sure upper bound on the maximum eigen- value of ^ (G N ZD 1 N ) ()S , the empirical Gram matrix of the columns ofG N ZD 1 N cor- responding to the relevant features in N . (In the proof, note the use of 2 on p. 140, and 343 note that 1 is not needed.) Our Assumption (R4) is sucient for this, since we assume the maximum eigenvalue of the empirical Gram matrix forZ, e 2N is nite, which ensures that the maximum eigenvalue of the Gram matrix for the fullG N ZD 1 N is nite due to Lemma C.4.3. This maximum eigenvalue upper-bounds the maximum eigenvalue of the submatrix ^ (G N ZD 1 N ) ()S . • Our Assumption (R5) matches Assumption (A4) in Kock [106] in our setting whereT N =T is xed. • Finally, we show that Assumption (A7) in Kock [106] is satised under our Assumption (R4). We need to show that e 1N r e 2N p N =O p (1): But sincee 1N e 2N , almost surely e 1N r e 2N p N s e 3 2N p N s e 3 max p N =O 1 p p N ; soe 1N p e 2N =p N =O p (1). So the rst part of the result is now immediate. It only remains to show that lim N!1 P p NT ^ (q) S c = 0 = 1: From part (3), we have that lim n!1 P ^ (q) S C = 0 = 1, so usingf p NT ^ (q) S c = 0g =f ^ (q) S c = 0g, lim N!1 P p NT ^ (q) S c = 0 = lim N!1 P ^ (q) S c = 0 = 1: 344 4. To prove parts (4) and (5), we will present an extension of Theorem 2(ii) from Kock [106] that yields our result when in applied in our setting. Then we will show that the needed assumptions are satised to conclude the proof. In particular, our extension (a) obtains the asymptotic convergence of linear combinations of the coecients of both relevant and irrelevant features and (2) establishes the asymptotic convergence of a statistic formed using the estimated Gram matrix on the selected set of coecients, not the population covariance matrix of the true active set. PropositionC.4.6 (Extension of Theorem 2(ii) from [106]). SupposetheassumptionsofTheorem2(ii) from Kock [106] are satised. (a) Letf N g 1 N=1 beasequenceofreal-valuedvectorswhere N 2R p N hasthestructure ( > ;b > N ) > where2R k is a xed, nite, nonzero vector andfb N g 1 N=1 is any sequence of constants where eachb N 2R p N k contains all nite entries. Then 1 s NT > 1 1N > N ( ^ N 0 ) d !N (0; 1): (b) Dene a function that maps anyA 2 P(N) to a xed vector(A) 2 R jAj . Dene the sequenceofvectorsf N (A N )g 1 N=1 tohavecomponentsinA N equalto(A N )andcomponents fb N (A N )g 1 N=1 in the components corresponding to [p N ]nA N that are all nite for allN (but are otherwise arbitrary). Dene the random variable U(A) := 8 > > > > < > > > > : 1 r NT (A) > ( ^ (X ()A )) 1 (A) N (A) > ( ^ 0 ); A6=;; ^ (X ()A ) invertible, 0; otherwise, 345 where ^ isthebridgeestimatorofKock[106,Section3]. Let ^ S N :=fj : ^ Nj 6= 0gbetheestimated selectedsetonadatasetofsizeN. ThenU( ^ S N )convergesindistributiontoastandardGaussian random variable. Proof. Provided in Appendix C.6. Remark C.4.7. It appears that the strategy for the proof of of Proposition C.4.6 could be used to analogously extend Theorem 2(ii) in [90]. It only remains to show that the assumptions of Proposition C.4.6 are satised in our setting. First, we note that Kock [106] states on p. 123 that a xedT is enough to satisfy the last two assumptions of Theorem 2(ii). We have already shown that Assumptions (FE1), (FE2), (FE3), (RE4), (A1), (A3), (A4), (A5), (A6), and (A7) in Kock [106] are satised. Due to Theorem 3(i) from Kock [106], our xedT setting ensures that the uniform integrability assumption of Theorem 2(ii) from Kock [106] is satised. We have assumed xed sparsity in Assumption S(s), and on p. 122 Kock [106] explains why our xedT setting satises all of the remaining assumptions except for (A2). We conclude by showing that the conditions implied by Assumption (A2) in Kock [106] are satised. Kock [106] makes an assumption that the minimum eigenvalue of the empirical Gram matrix of the columns ofG N ZD 1 N corresponding to the nonzero features in N , 1 NT (G N ZD 1 N ) ()S > (G N ZD 1 N ) ()S ; is bounded away from 0 almost surely. Call this minimum eigenvaluee 1SN . This assumption is used in two places, both in the proof of Theorem 2(ii). 346 (a) On p. 142, this assumption is used to establish that q 2 1 e 1SN b 0 2 q1 N r s N NT p ! 0 () 1 e 1SN N p N p ! 0 () 1 e 1SN 2o P p N N ! ; where the rst equivalence follows because the removed quantities are constant under our assumptions (specically, thatT ands N are xed). Note that our assumption (4.33) ensures that this condition holds for the full empirical Gram matrix due to Lemma C.4.3. The minimum eigenvalue of the full empirical Gram matrix lower boundse 1SN , so (4.33) is sucient for this condition to hold. (b) The assumption is used again on p. 144 of Kock [106], where it is sucient for the minimum eigenvalue of the (population) matrixE h ^ G N ZD 1 N ()S i to be bounded away from 0. This is a submatrix of (4.32), so the minimum eigenvalue of (4.32) is a lower bound for the minimum eigenvalue ofE h ^ G N ZD 1 N ()S i . Therefore our assumption in (R6) that the minimum eigenvalue of (4.32) is bounded away from 0 is enough to satisfy this assumption. So our Assumption (R6), in combination with all of the previous assumptions, is enough to apply Proposition C.4.6, and the proof is complete. 347 C.5 ProofsofLemmasforMainTheorems Proof of Lemma C.3.1. By Lemma C.3.2,D 1 N has a block diagonal structure with blocks of size at most (T1) 2 (T1) 2 containing entries that equal either 0 or 1, sokD 1 N k 1 (T1) 2 , where for a matrix kk 1 denotes the maximum absolute row sum. Therefore k N k 1 = D 1 N N 1 D 1 N 1 k N k 1 (T 1) 2 b 1 ; where in the last step we used Assumption (R3). Proof of Lemma C.3.2. For anyt2N, dene D (1) (t) := 0 B B B B B B B B B B B B B B B B B B B @ 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 . . . . . . . . . . . . . . . . . . . . . 0 0 0 0 1 1 0 0 0 0 0 1 1 C C C C C C C C C C C C C C C C C C C A 2R tt : (C.5.1) Note that this is the linear transformation we apply to several groups of coecients in. For example, for the vector of cohort xed eects if2R R , D (1) (R) = 0 B B B B B B B B B B @ 2 1 . . . R R1 R 1 C C C C C C C C C C A ; 348 which are the transformed coecients to which we apply an` q penalty. We similarly penalize as well as j , the coecients for the interactions of the cohort xed eects and covariatej (separately for each of thed covariates) and j , the coecients for the interactions of the time xed eects and covariatej. So if we rearrange the columns ofZ corresponding to the cohort-covariate interactions so that all of the cohort interactions with the base eect of featurej = 1 come rst, then the cohort interactions with feature 2, and so on, we can write thisRd N Rd N block diagonal part ofD N asI d N D (1) (R), where denotes the Kronecker product. We can likewise rearrange the columns ofZ and write the part ofD N corresponding to the time-covariate interactions asI d N D (1) (T 1). We penalize the coecients directly, so this part ofD can be represented simply asI d N . Next we consider the matrixD (2) (R) we use to transform the base treatment eects for` 1 penaliza- tion. The matrixD (2) (R) has a similar structure toD (1) (t), except that we directly penalize each cohort’s rst treatment eect, not the last one, and we penalize each cohort’s rst treatment eect towards the previous cohort’s rst treatment eect. So denotingR = (r 1 ;r 2 ;:::;r R ),D (2) (R) has a nearly block diagonal structure D (2) (R) := 0 B B B B B B B B B B B B B B @ D (1) (r 1 ) > 0 0 0 0 U(r 1 ) D (1) (r 2 ) > 0 0 0 0 U(r 2 ) D (1) (r 3 ) > 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 U(r R ) D (1) (r R ) > 1 C C C C C C C C C C C C C C A 2R WW ; (C.5.2) where U(w) := 0 B B @ 1 0 1(Tw) 0 (Tw)1 0 (Tw)(Tw) 1 C C A 2R (Tw+1)(Tw+1) :x 349 ThenD (2) (R) yields the transformed treatment eects to which we apply an` q penalty, and similarly, for everyj2 [d N ],D (2) (R) j yields the transformed coecients of the interactions between thej th centered covariate and the treatment eects. Putting this all together, after re-ordering the columns appropriately we can writeD N as (C.3.3). Finally, using the properties of block diagonal matrices and the fact thatD (1) (t) andD (2) (R) are invertible (which one can see by multiplying these matrices by their inverses that we provide in the proof of Lemma C.4.2 later), we have from (C.3.3) thatD 1 N is diag D (1) (R) 1 ; ((D (1) (T 1)) 1 ;I d N ;I d N D (1) (R) 1 ;I d N ((D (1) (T 1)) 1 ; (D (2) (R)) 1 ;I d N (D (2) (R)) 1 ; matching the expression from (C.3.4). Finally, we show the claimed block diagonal structure ofD N andD 1 N . Examining the expressions forD N andD 1 N , it is clear that they are both block diagonal with blocks of sizeRT 1,T 1, and W (T 1) 2 (from Lemma C.4.2). We have shown in this proof that all of the blocks contain entries equal to 0, -1, or 1. C.6 ProofsofLemmasFromProofofTheoremC.2.1 Proof of Lemma C.4.1. We will rst show thatX having nite fourth moments under Assumption (R1) is enough forZ to have nite fourth moments as well. We will show that each entry ofZD 1 N is a bounded linear combination of a nite number of elements ofZ, and likewise forG N ZD 1 N . Then the nite moments will follow using our Assumption (R1), Cauchy-Schwarz, and Hölder’s inequality. 350 BesidesX, the entries ofZ consist of indicator variables and interactions of the indicator variables with themselves andX. All of these terms are upper-bounded by the greater of 1 orX. Using this, our assumption thatE[X 4 (1)j ] is nite for allj2 [p N ], and the fact thatX is time-invariant, we have that for any xedN, max j2[d N ];t2[T ] fE[Z 4 (1t)j ]g is nite. (Because the units are identically distributed, it is enough to consider only the rst unit.) Now we consider the entries ofZD 1 N 2R NTp N . The columns ofZD 1 N are linear combinations of the columns ofZ; in particular, thej th column ofZD 1 N isZ(D 1 N ) j 2R NT , where (D 1 N ) j is thej th column ofD 1 N . From (C.3.4) and the expressions in the proof of Lemma C.4.2 we see thatD 1 N contains only entries that equal 1 or 0, so for anyt2 [T ] andj2 [p N ], (ZD 1 N ) 4 (1t)j = h Z(D 1 N ) j (1t) i 4 = 0 B @ X fj 0 :(D 1 N ) j 0 j 6=0g Z (1t)j 0 1 C A 4 : From (C.3.4) and the expressions in the proof of Lemma C.4.2 we have thatD 1 N has a block diagonal structure with blocks of size at mostW (T 1) 2 (an upper bound we established in Lemma C.4.2), so this sum contains at most (T 1) 2 terms before expansion. By the multinomial theorem, the number of terms after expansion is therefore at most 4 + (T 1) 2 1 (T 1) 2 1 : T is xed and nite, so the number of terms after expansion is nite as well. Therefore if we can show that each summand after expansion has nite 4 th moment, we have shown that each entry ofZD 1 N has nite 4 th moment. This sum when expanded contains terms of the formZ 4 (1t)j 0 ,Z 2 (1t)j 0 Z 2 (1t)j 00 forj 0 6=j 00 ,Z 3 (1t)j 0 Z (1t)j 00, Z 2 (1t)j 0 Z (1t)j 00Z (1t)j 000 and Z (1t)j 0Z (1t)j 00Z (1t)j 000Z (1t)j 0000 (disregarding constant factors). We demonstrate that each of these terms has nite fourth moment below using Cauchy-Schwarz and Hölder’s inequality: 351 • We have already shown thatE[Z 4 (1t)j 0 ] is nite. • E[Z 2 (1t)j 0 Z 2 (1t)j 00 ] q E[Z 4 (1t)j 0 ]E[Z 4 (1t)j 00 ], and terms of this form are nite using the nite fourth moments ofZ. • Recall that by Hölder’s inequality, for anyp;q> 1 such that 1=p + 1=q = 1, ifX andY are random variables with (EjXj p ) 1=p and (EjYj q ) 1=q nite then EjXYj (EjXj p ) 1=p (EjYj q ) 1=q : Choosingp = 4=3 andq = 4, we have E[Z 3 (1t)j 0Z (1t)j 00]E[jZ (1t)j 0j 3 jZ (1t)j 00j] E h (jZ (1t)j 0j 3 ) 4=3 i 3=4 E jZ (1t)j 00j 4 1=4 = EjZ (1t)j 0j 4 3=4 EjZ (1t)j 00j 4 1=4 ; which is nite. • E[Z 2 (1t)j 0 Z (1t)j 00Z (1t)j 000] q E[Z 4 (1t)j 0 ] q E[Z 2 (1t)j 00 Z 2 (1t)j 000 ], and we already showed that terms of this form are nite. • E[Z (1t)j 0Z (1t)j 00Z (1t)j 000Z (1t)j 0000] q E[Z 2 (1t)j 0 Z 2 (1t)j 00 ]E[Z 2 (1t)j 000 Z 2 (1t)j 0000 ], and we already showed that terms of this form are nite. 352 Finally, we consider the columns ofG N ZD 1 N = [I N 1=2 ]ZD 1 N . Observe that for any j 2 [p N ], t2 [T ], by writing the matrix multiplication operation in summation notation similarly to before we can write each entry of 1=2 ZD 1 N (1) as (G N ZD 1 N ) 2 (1t)j = 2 T X t 0 =1 1=2 t;t 0 (ZD 1 N ) (1t 0 )j ! 2 = 2 0 B @ T X t 0 =1 1=2 t;t 0 X fj 0 :(D 1 N ) j 0 j 6=0g Z (1t 0 )j 0 1 C A 2 = 2 0 B @ X fj 0 :(D 1 N ) j 0 j 6=0g T X t 0 =1 1=2 t;t 0 Z (1t 0 )j 0 ! 1 C A 2 : Again, this is a sum over at mostWT (T 1) 2 T terms before expansion, and at most 2 +T (T 1) 2 1 T (T 1) 2 1 : terms after expansion. Each 1=2 t;t 0 is nite under Assumption (F1), so it is enough to show that each individual summand after expansion has nite second moment. We see that expanding will yield terms of the formZ 2 (1t 0 )j 0 ,Z (1t 0 )j 0Z (1t 0 )j 00,Z (1t 0 )j 0Z (1t 00 )j 0, andZ (1t 0 )j 0Z (1t 00 )j 00 forj 0 6=j 00 ,t 0 6=t 00 (again disregarding constant factors). We have already shown thatE[Z 2 (1t)j 0 ] is nite for allj 0 2 [p N ], and for anyj 0 6=j 00 , by Cauchy-Schwarz E Z (1t)j 0Z (1t)j 00 r E h Z 2 (1t)j 0 i E h Z 2 (1t)j 00 i which is also nite. The expectations of the terms with dieringt 0 andt 00 are nite by the same argument. Proof of Lemma C.4.2. Notice from (C.3.3) and the expressions in the proof of Lemma C.3.2 that the` 1 norm of each row or column ofD N is at most 3, so we have max (D N ) p kD N k 1 kD N k 1 p 3 3 = 3; 353 where for matriceskk 1 denotes the maximum absolute column sum of a matrix andkk 1 denotes the maximum absolute row sum. Now we lower bound min (D N ). We have that for any matrixA min (A) 1 q kA 1 k 1 kA 1 k 1 : (C.6.1) Recall our expression forD 1 N from (C.3.4). The singular values of a block diagonal matrix are the union of the singular values of the individual matrices on the diagonal, so it is enough to lower bound the singular values of each of these individual matrices using (C.6.1) and then take the minimum of all of the lower bounds. First we will nd the inverses. One can show that D (1) (t) 1 = 0 B B B B B B B B B B B B B B @ 1 1 1 1 0 1 1 1 0 0 1 1 . . . . . . . . . . . . . . . 0 0 0 1 1 C C C C C C C C C C C C C C A 2R tt ; so using (C.6.1), kD (1) (t) 1 k 1 kD (1) (t) 1 k 1 t 2 =) min D (1) (t) 1 1=t: (C.6.2) 354 Next, by multiplication one can verify that (D (2) (R)) 1 = 0 B B B B B B B B B B @ ~ D(1) > 0 0 0 ~ U(r 1 ;r 2 ) ~ D(2) > 0 0 . . . . . . . . . . . . . . . ~ U(r 1 ;r R ) ~ U(r 2 ;r R ) ~ U(r R1 ;r R ) ~ D(R) > 1 C C C C C C C C C C A ; where ~ D(x) :=D (1) (Tr x + 1) 1 and ~ U(w 1 ;w 2 ) = 0 B B B B B B B B B B @ 1 0 0 0 1 0 0 0 . . . . . . . . . . . . . . . 1 0 0 0 1 C C C C C C C C C C A 2R (Tw 2 +1)(Tw 1 +1) : Therefore the maximum column` 1 norm of (D (2) (R)) 1 isW, the` 1 norm of the rst column of (D (2) (R)) 1 : The maximum row` 1 norm, in the bottom block of (D (2) (R)) 1 , is at mostR 1 +Tr R + 1R + 1, because each of theR 1 blocks of ~ U matrices has a single 1 in each row and D (1) (Tr R + 1) 1 > has a row ofTr R + 1 ones. So k(D (2) (R)) 1 k 1 k(D (2) (R)) 1 k 1 W(T +R) =) min (D (2) (R)) 1 1 p W(T +R) : 355 Since treatment eects are interactions between theR T 1 cohort indicators and (at most)T 1 treatment times,W (T 1) 2 and we have min (D (2) (R)) 1 1 (T 1) p T +R 1 (T 1) p 2T 1 : (C.6.3) Finally, using (C.6.2) and (C.6.3),T 2, and the fact that the smallest singular value of the identity matrix is 1, we have min (D N ) min 1 R ; 1 T 1 ; 1; 1 (T 1) p 2T 1 1 (T 1) p 2T 1 1 T p 2T : Proof of Lemma C.4.3. Recall that for any matrixA, the eigenvalues ofA > A are the squares of the singular values ofA. So min 1 NT G N ZD 1 N > G N ZD 1 N = min G N ZD 1 N 2 =(NT ) and max 1 NT G N ZD 1 N > G N ZD 1 N = max G N ZD 1 N 2 =(NT ); and it is enough to bound the singular values ofG N ZD 1 N , which we can do with Lemma C.6.1: LemmaC.6.1. LetA andB be conformable real matrices whereB has full column rank. 1. (Theorem 2 in [74].) min (AB) min (A) min (B): 2. (Theorem 3.3.4 in Horn and Johnson [89].) max (AB) max (A) max (B): 356 We bound the singular values of the deterministicG N andD 1 N in Lemmas C.6.2 and C.4.2: Lemma C.6.2. The smallest and largest singular values ofG N are bounded as follows: max (G N ) p 2 and min (G N )= p T 2 c + 2 . Proof. Provided in Section C.7. Putting this together (and using that 1 NT min (Z) 2 = min 1 NT Z > Z =e 1N , we have min 1 NT G N ZD 1 N > G N ZD 1 N 1 NT min (G N ) 2 min (Z) 2 min D 1 N 2 p T 2 c + 2 ! 2 min 1 NT Z > Z 1 3 2 e 1N 2 9(T 2 c + 2 ) ; where we used that from Lemma C.4.2, max (D N ) 3 () min (D 1 N ) 1 3 : Similarly, max 1 NT G N ZD 1 N > G N ZD 1 N p 2 2 max 1 NT Z > Z T p 2T 2 e 2N 8T 3 ; where we used that from Lemma C.4.2, min (D N ) 1 T p 2T () max (D 1 N )T p 2T: 357 Proof of Proposition C.4.4. For clarity, we use the notation of Kock [106]. To extend Lemma 3 from Kock [106], we use a similar strategy to the one used to prove Theorem 11.3(d) in Hastie, Tibshirani, and Wain- wright [79]. Since under Assumption (A5) from Kock [106] we have min 1jk N fj 10jj gb 0 , it is enough to show that with high probability ^ 2N = 0 andk ^ N N k 1 b 0 =2. That is, n ^ 2N = 0 o \ ^ N N 1 b 0 2 n ^ S =S o =) n ^ S6=S o n ^ 2N 6= 0 o [ ^ N N 1 > b 0 2 (a) =) P ^ S6=S P n ^ 2N 6= 0 o [ ^ N N 2 > b 0 2 P ^ 2N 6= 0 +P ^ N N 2 > b 0 2 ; where in (a) we usedkk 1 kk 2 . Lemma 3 from Kock [106] establishes that lim N!1 P ^ 2N 6= 0 = 0; and Theorem 1 in Kock [106] (the assumptions of which are satised under the assumptions of Lemma 3 in [106]) establishes the consistency of ^ N , which is sucient for lim N!1 P ^ N N 2 > b 0 2 = 0: Proof of Proposition C.4.6. 1. Again, for clarity we use the notation of Kock [106]. Since 1 s NT N > 1 1N > N ( ^ N 0 ) = 1 s NT N > 1 1N > ( ^ 1N 10 ) + 1 s NT N > 1 1N b > N ( ^ 2N 20 ) = 1 s NT N > 1 1N > ( ^ 1N 10 ) + 1 s NT N > 1 1N b > N ^ 2N ; 358 if we can show that 1 s NT N > 1 1N b > N ^ 2N p ! 0 then the result follows from Theorem 2(ii) from Kock [106] and Slutsky’s theorem. Theorem 2(i) from Kock [106] establishes that lim N!1 P ^ 2N = 0 = 1: Since for anyN n ^ 2N = 0 o 8 < : 1 s NT N > 1 1N p N k X j=1 (b N ) j ( ^ 2N ) j = 0 9 = ; ; we have P ^ 2N = 0 P 0 @ 1 s NT N > 1 1N p N k X j=1 (b N ) j ( ^ 2N ) j = 0 1 A ; so 1 lim N!1 P 0 @ 1 s NT N > 1 1N p N k X j=1 (b N ) j ^ 2N = 0 1 A lim N!1 P ^ 2N = 0 = 1; establishing lim N!1 P 0 @ 1 s NT N > 1 1N p N k X j=1 (b N ) j ( ^ 2N ) j = 0 1 A = 1: Since this notion of convergence is stronger than convergence in probability, it follows that 1 s NT N > 1 1N p N k X j=1 (b N ) j ( ^ 2N ) j p ! 0; proving the result. 359 2. Using the denition of convergence in distribution (see, for example, Denition 5.5.10 in [39]), it is enough to show that for anyt2R, lim N!1 P U( ^ S N )t = (t); where () is the standard Gaussian cumulative distribution function. By part (1), we have that U(S) converges in distribution to a standard Gaussian random variable: for everyt2R, lim N!1 P (U(S)t) = (t): (C.6.4) We have for anyN and anyt2R, P U( ^ S N )>t P fU(S)>tg\ n ^ S N =S o () P U( ^ S N )t P fU(S)tg[ n ^ S N 6=S o P (U(S)t) +P ^ S N 6=S ; where in the last step we used the union bound. Similarly, P U( ^ S N )t P fU(S)tg\ n ^ S N =S o () P U( ^ S N )>t P fU(S)>tg[ n ^ S N 6=S o P (U(S)>t) +P ^ S N 6=S () 1P U( ^ S N )t 1P (U(S)t) +P ^ S N 6=S () P U( ^ S N )t P (U(S)t)P ^ S N 6=S : 360 So P (U(S)t)P ^ S N 6=S P U( ^ S N )t P (U(S)t) +P ^ S N 6=S : Then using (C.6.4) and Proposition C.4.4, we have that for anyt2R, lim N!1 P U( ^ S N )t = (t); proving the result. C.7 ProofsofTechnicalLemmas LemmaC.7.1. Suppose one treated cohort has less thand N + 1 units. ThenZ is rank-decient. Proof of Lemma C.7.1. The treatment eect dummies can be seen as interaction eects between the cohort dummies and (a subset of) the time dummies. In particular, the treatment eect for cohortr at timet is equal to the column for ther th cohort dummy, but with entries for all times exceptt set equal to 0. The Tr + 1 treatment dummies for cohortr are orthogonal and span a (Tr + 1)-dimensional subspace, and they occupy anN r (Tr + 1)-dimensional subspace, whereN r is the number of units in cohortr. Because the covariates (like the cohort indicators) are time-invariant, the time-covariate interactions are closely related to the treatment eect dummies. The covariate-time interactions for cohort r lie in the sameN r (Tr + 1)-dimensional subspace as the treatment eects for cohortr and are either linearly dependent (in which caseZ is rank decient) or span a subspace of dimensiond N (Tr+1). Therefore in total, the treatment dummies and covariate-time interactions for cohortr span a subspace with dimension 361 at most (d N + 1)(Tr + 1) within thisN r (Tr + 1)-dimensional subspace. So ifN r <d N + 1, these columns must be linearly dependent, andZ is rank-decient. Lemma C.7.2. Supposee 1N is bounded away from 0 with high probability: for some niteN ;c > 0, for allN >N it holds thatP(e 1N <c) 0 there exists a niteM (1) > 0 andN (1) > 0 such that for allN >N (1) , P e 1N X N p p N =N >M (1) ! <: 362 Now for allN > maxfN ;N (1) =2 g, P 0 @ X N p p N =N > M (1) =2 c 1 A =P 0 @ X N p p N =N > M (1) =2 c je 1N c 1 A P(e 1N c) +P 0 @ X N p p N =N > M (1) =2 c je 1N <c 1 A P(e 1N <c) P 0 @ X N p p N =N > M (1) =2 c je 1N c 1 A P(e 1N c) +P(e 1N <c) P 0 @ X N p p N =N > M (1) =2 e 1N je 1N c 1 A P(e 1N c) +P(e 1N <c) =P e 1N X N p p N =N >M (1) =2 je 1N c ! P(e 1N c) +P(e 1N <c) P e 1N X N p p N =N >M (1) =2 je 1N c ! P(e 1N c) +P e 1N X N p p N =N >M (1) =2 je 1N <c ! P(e 1N <c) +P(e 1N <c) =P e 1N X N p p N =N >M (1) =2 ! +P(e 1N <c) < 2 +a N : Since a N is vanishing, there exists a nite N (2) so that a N =2 for all N > N (2) . Then we have established that for any > 0, forM := M (1) =2 =c andN := maxfN ;N (1) =2 ;N (2) g, for allN > N it holds that P X N p p N =N >M ! <: 363 Proof of Lemma C.6.2. We will prove the result in a similar way to the proof of Lemma C.4.2. By properties of the Kronecker product, it is enough to bound the singular values of . We have max ( ) p k k 1 k k 1 p [T 2 c + 2 ] 2 =T 2 c + 2 ; where for matriceskk 1 denotes the maximum absolute column sum of a matrix andkk 1 denotes the maximum absolute row sum. We will use (C.6.1) to lower-bound min ( ). First we will nd 1 . Observe that we can express as = 2 I T +vv > wherev = ( c ;:::; c )2R T . Therefore we can invert using the Sherman-Morrison formula: 1 = 1 2 I T 1 2 I T vv > 1 2 I T 1 +v > 1 2 I T v = 1 2 I T 1 2 vv > 1 + 1 2 v > v ! = 1 2 I T vv > 2 +T 2 c = 1 2 0 B B B B B B B B B B @ 1 2 c 2 +T 2 c 2 c 2 +T 2 c 2 c 2 +T 2 c 2 c 2 +T 2 c 1 2 c 2 +T 2 c 2 c 2 +T 2 c . . . . . . . . . . . . 2 c 2 +T 2 c 2 c 2 +T 2 c 1 2 c 2 +T 2 c 1 C C C C C C C C C C A : Therefore min ( ) 1 q k 1 k 1 k 1 k 1 2 1 +T 2 c 2 +T 2 c 1 = 2 2 + 2T 2 c 2 +T 2 c 1 = 2 2 +T 2 c 2 + 2T 2 c 2 2 +T 2 c 2 ( 2 +T 2 c ) = 2 2 : 364 Finally, using these results we have min 1=2 1= p T 2 c + 2 and max 1=2 p 2=. By properties of the Kronecker product, these bounds hold forG N =I T 1=2 as well after multiplying by. 365
Abstract (if available)
Abstract
This dissertation presents three novel contributions to the fields of machine learning and causal inference. The unifying theme is leveraging sparsity to improve the interpretability and predictive accuracy of machine learning models, with applications to feature selection, classification with class imbalance, and causal inference. Cluster stability selection generalizes stability selection to overcome the "vote-splitting" problem that arises in data with highly correlated features. We prove for the first time the long-noted tendency of the lasso to select a single feature from a highly correlated cluster and provide theoretical guarantees for cluster stability selection. We also demonstrate cluster stability selection's performance in simulation studies and a GWAS data application. The second contribution, PRESTO, is a relaxation of the proportional odds model that flexibly leverages more common intermediate outcomes to estimate probabilities of rare events better than logistic regression or the proportional odds model. We justify PRESTO theoretically, prove that PRESTO consistently estimates decision boundary weights under a sparsity assumption, and show in synthetic and real data experiments that PRESTO estimates rare event probabilities better than logistic regression and the proportional odds model. Finally, the third study addresses the bias/variance tradeoff in treatment effect estimation using the difference-in-differences method. We propose a bridge fusion-penalized method to obtain data-driven estimates of treatment effects, allowing for model selection, estimation, and valid inference in a single step without data splitting. We prove the asymptotic normality of our estimated treatment effects under sparsity and demonstrate our method through simulation studies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Conformalized post-selection inference and structured prediction
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Essays on causal inference
PDF
Robust causal inference with machine learning on observational data
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Crohn’s disease: health outcomes and resource utilization in the biologic era
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
A genome wide association study of multiple sclerosis (MS) in Hispanics
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Mutual information estimation and its applications to machine learning
PDF
Efficiently learning human preferences for proactive robot assistance in assembly tasks
PDF
Essays on improving human interactions with humans, algorithms, and technologies for better healthcare outcomes
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Examining the associations of respiratory problems with psychological functioning and the moderating role of engagement in pleasurable activities during late adolescence
Asset Metadata
Creator
Faletto, Gregory Michael
(author)
Core Title
Leveraging sparsity in theoretical and applied machine learning and causal inference
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2023-08
Publication Date
02/04/2024
Defense Date
06/14/2023
Publisher
University of Southern California. Libraries
(digital)
Tag
causal inference,class imbalance,classification,difference-in-differences,feature selection,fused lasso,lasso,logistic regression,OAI-PMH Harvest,panel data,proportional odds,stability
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bien, Jacob (
committee chair
), Minsker, Stanislav (
committee member
), Sesia, Matteo (
committee member
)
Creator Email
gfaletto@gmail.com,gregory.faletto@marshall.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113296614
Unique identifier
UC113296614
Identifier
etd-FalettoGre-12199.pdf (filename)
Legacy Identifier
etd-FalettoGre-12199
Document Type
Dissertation
Rights
Faletto, Gregory Michael
Internet Media Type
application/pdf
Type
texts
Source
20230807-usctheses-batch-1079
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
causal inference
class imbalance
difference-in-differences
feature selection
fused lasso
lasso
logistic regression
panel data
proportional odds
stability