Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
On information captured by neural networks: connections with memorization and generalization
(USC Thesis Other)
On information captured by neural networks: connections with memorization and generalization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
On Information Captured by Neural Networks: Connections with Memorization and Generalization by Hrayr Harutyunyan A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2023 Copyright 2023 Hrayr Harutyunyan Acknowledgements First and foremost, I would like to express my deepest gratitude to my advisors, Aram Galstyan and Greg Ver Steeg. This work would not exist without their invaluable guidance and unwavering support. I am immensely thankful to them for cultivating an environment with complete academic freedom. I would also like to thank Bistra Dilkina, Haipeng Luo, and Mahdi Soltanolkotabi for their insightful comments as members of various committees during my Ph.D. journey. I extend my heartfelt appreciation to my fellow former or present Ph.D. students Sami Abu-El-Haija, Shushan Arakelyan, Rob Brekelmans, Aaron Ferber, Palash Goyal, Umang Gupta, David Kale, Neal Lawton, Myrl Marmarelis, Daniel Moyer, Kyle Reing, for their friendship, stimulating discussions, and collaborative spirit. I would like to thank my friends and collaborators at YerevaNN, who warmly welcomed me in their office whenever I was in Armenia. I would like to express my sincere appreciation to my collaborators Alessandro Achille, Rahul Bhotika, Sanjiv Kumar, Orchid Majumder, Aditya Krishna Menon, Giovanni Paolini, Maxim Raginsky, Avinash Ravichandran, Ankit Singh Rawat, Stefano Soatto, and Seungyeon Kim for their invaluable contributions and support throughout our collaborative endeavors. I am thankful to Alessandro Achille and Avinash Ravichandran for hosting two fruitful summer internships at Amazon Web Services; and to Ankit Singh Rawat and Aditya Krishna Menon for the productive internship at Google Research. I am indebted to the family of Avanesyans, who treated me like a family member and provided much- needed family support when I was away from my home country. Finally, I would like to express my deepest gratitude to my parents and brother, whose unconditional love and sacrifices have been the foundation of my academic pursuits. I acknowledge support from the USC Annenberg Fellowship. Works described in Chapters 2 and 4 were based partly on research sponsored by Air Force Research Laboratory under agreement number FA8750-19-1-1000. ii TableofContents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Memorization in deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The generalization puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Notation and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Chapter2: ImprovingGeneralizationbyControllingLabelNoiseInformationinNeural NetworkWeights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Label noise information in weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Decreasing label noise information reduces memorization . . . . . . . . . . . . . . 9 2.2.2 Decreasing label noise information improves generalization . . . . . . . . . . . . . 11 2.3 Methods limiting label information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Penalizing information in gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Variational bounds on gradient information . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Predicting gradients without label information . . . . . . . . . . . . . . . . . . . . 13 2.3.4 Reducing overfitting in gradient prediction . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 MNIST with uniform label noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 CIFAR-10 with uniform and pair noise . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.3 CIFAR-100 with uniform label noise . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.4 Clothing1M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.5 Detecting mislabeled examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter3: EstimatingInformativenessofSampleswithSmoothUniqueInformation . . 23 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Unique information of a sample in the weights . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Approximating unique information with leave-one-out KL divergence . . . . . . . 24 3.2.2 Smoothed sample information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Unique information in the predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iii 3.4 Exact solution for linearized networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.1 Accuracy of the linearized network approximation . . . . . . . . . . . . . . . . . . 30 3.5.2 Which examples are informative? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.3 Data summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.4 How much does sample information depend on algorithm? . . . . . . . . . . . . . 34 3.6 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter4: Information-theoreticgeneralizationboundsforblack-boxlearningalgorithms 39 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Weight-based generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.1 Generalization bounds with input-output mutual information . . . . . . . . . . . . 40 4.2.2 Generalization bounds with conditional mutual information . . . . . . . . . . . . . 42 4.3 Functional conditional mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1 Ensembling algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.2 Binary classification with finite VC dimension . . . . . . . . . . . . . . . . . . . . . 47 4.4.3 Stable deterministic or stochastic algorithms . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter5: Formallimitationsofsample-wiseinformation-theoreticgeneralizationbounds 53 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1.2 Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 A useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 A counterexample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Implications for other bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.5 The case ofm=2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter6: SupervisionComplexityanditsRoleinKnowledgeDistillation . . . . . . . . . 64 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Supervision complexity and generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1 Supervision complexity controls kernel machine generalization . . . . . . . . . . . 66 6.2.2 Extensions: multiclass classification and neural networks . . . . . . . . . . . . . . 68 6.3 Knowledge distillation: a supervision complexity lens . . . . . . . . . . . . . . . . . . . . . 69 6.3.1 Trade-off between teacher accuracy, margin, and complexity . . . . . . . . . . . . . 69 6.3.2 From offline to online knowledge distillation . . . . . . . . . . . . . . . . . . . . . 70 6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.2 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 iv 6.6 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 G Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.1 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 G.2 Proof of Proposition 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 G.3 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 G.4 Proof of Proposition 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 G.5 Proof of Lemma 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 G.6 Proof of Theorem 4.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 G.7 Proof of Proposition 4.2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 G.8 Proof of Theorem 4.2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 G.9 Proof of Theorem 4.2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 G.10 Proof of Proposition 4.2.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 G.11 Proof of Theorem 4.2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 G.12 Proof of Theorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 G.13 Proof of Proposition 4.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 G.14 Proof of Theorem 4.3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 G.15 Proof of Theorem 4.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 G.16 Proof of Proposition 4.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 G.17 Proof of Theorem 4.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 G.18 Proof of Theorem 5.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 G.19 Proof of Theorem 5.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 G.20 Proof of Theorem 5.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 G.21 Proof of Proposition 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 G.22 Proof of Theorem 6.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 G.23 Proof of Theorem 6.2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 G.24 Proof of Proposition 6.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 H Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 H.1 SGD noise covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 H.2 Steady-state covariance of SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 H.3 Fast update of NTK inverse after data removal . . . . . . . . . . . . . . . . . . . . . 126 H.4 Adding weight decay to linearized neural network training . . . . . . . . . . . . . 127 I Additional experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 J Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 v ListofTables 1.1 A mapping from chapters to papers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Architecture of a convolutional neural network with 5 hidden layers. . . . . . . . . . . . . 16 2.2 Test accuracy comparison on multiple versions of MNIST corrupted with uniform label noise. The error bars are standard deviations computed over 5 random training/validation splits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Test accuracy comparison on CIFAR-10, corrupted with uniform label noise (top) and pair label noise (bottom). The error bars are standard deviations computed by bootstrapping the test set 1000 times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Test accuracy comparison on CIFAR-100 with 40% uniform label noise and on Clothing1M dataset. The error bars are standard deviations computed by bootstrapping the test set 1000 times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Pearson correlations of weight change ∥w − w − i ∥ 2 2 and test prediction change ∥f w (x test )− f w − i (x test )∥ 2 2 norms computed with influence functions and linearized neural networks with their corresponding measures computed for standard neural networks with retraining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Details of experiments presented in Table 3.1. For influence functions, we add a dumping term with magnitude 0.01 wheneverℓ 2 regularization is not used (λ =0). . . . . . . . . . . 30 6.1 Knowledge distillation results on CIFAR-10. Every second line is an MSE student. . . . . . 71 6.2 Knowledge distillation results on CIFAR-100. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.3 Knowledge distillation results on Tiny ImageNet. . . . . . . . . . . . . . . . . . . . . . . . 72 6.4 Initial learning rates for different dataset and model pairs. . . . . . . . . . . . . . . . . . . 73 6.5 Knowledge distillation results on binary CIFAR-100. Every second line is an MSE student. 74 6.6 Knowledge distillation results on CIFAR-100 with varying loss mixture coefficient α . . . . 78 6.7 Experimental details for MNIST 4 vs 9 classification in the case of standard training. . . . . 128 6.8 Experimental details for MNIST 4 vs 9 classification in the case of SGLD training. . . . . . 128 6.9 Experimental details for CIFAR-10 classification using fine-tuned ResNet-50 networks. . . 129 6.10 Experimental details for the CIFAR-10 classification experiment where a pretrained ResNet-50 is fine-tuned using SGLD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 vi ListofFigures 2.1 Neural networks tend to memorize labels when trained with noisy labels (80% noise in this case), even when dropout or weight decay are applied. Our training approach limits label noise information in neural network weights, avoiding memorization of labels and improving generalization. See Section 2.2.1 for more details. . . . . . . . . . . . . . . . . . 8 2.2 The lower boundr 0 on the rate of training errorsr Theorem 2.2.1 establishes for varying values ofI(W;Y |X), in the case when label noise is uniform and probability of a label being incorrect isp. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Smoothed training and testing accuracy plots of various approaches on MNIST with 80% uniform noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Training and testing accuracies of “LIMIT G - S” and “LIMIT L - S” instances with varying values ofβ on MNIST with 80% uniform label noise. The curves are smoothed for better presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Histograms of the norm of predicted gradients for examples with correct and incorrect labels. The gradient predictions are done using the best instances of LIMIT. . . . . . . . . . 20 2.6 Histograms of the distance between predicted and actual gradient for examples with correct and incorrect labels. The gradient predictions are done using the best instances of LIMIT. . 20 2.7 Most mislabeled examples in MNIST, CIFAR-10, and Clothing1M datasets, according to the distance between predicted and cross-entropy gradients. . . . . . . . . . . . . . . . . . . . 21 3.1 A toy dataset and key distributions involved in upper bounding the unique sample information with leave-one-out KL divergence. . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Functional sample information of samples in MNIST 4 vs 9 classification task (top), Kaggle cats vs dogs (middle), and iCassava (bottom) classification tasks. A: histogram of sample informations,B: 10 least informative samples,C: 10 most informative samples. . . . . . . . 31 3.3 Comparison of functional sample information of examples with correct and incorrect labels. 32 3.4 Comparison of functional sample information of MNIST and SVHN examples in the context of a joint digit classification task with equal number of examples per dataset. . . . . . . . . 33 3.5 Histogram of the functional sample information of samples of the Kaggle cats vs dogs classification task, where 10% of examples are adversarial. . . . . . . . . . . . . . . . . . . 33 3.6 Histogram of the functional sample information of examples of the 3 subpopulations of the pets vs deer dataset. Since cats are under-represented, cat images tend to be more informative on average compared to dog images. . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Dataset summarization without training. Test accuracy as a function of the ratio of removed training examples for different strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vii 3.8 Correlations between functional sample information scores computed for different architectures and training lengths. Ontheleft: correlations between F-SI scores of the 4 pretrained networks, all computed with settingt = 1000 andη = 0.001. Ontheright: correlations between F-SI scores computed for pretrained ResNet-18s, with learning rate η =0.001, but varying training lengthst. All reported correlations are averages over 10 different runs. The training dataset consists of 1000 examples from the Kaggle cats vs dogs classification task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.9 Top 10 most informative examples from Kaggle cats vs dogs classification task for 3 pretrained networks: ResNet-18, ResNet-50, and DenseNet-121. . . . . . . . . . . . . . . . 36 3.10 Testing how much F-SI scores computed for different networks are qualitatively different. Ontheleft: the MNIST vs SVHN experiment with a pretrained DesneNet-121 instead of a pretrained ResNet-18. On the right: Data summarization for the MNIST 4 vs 9 classification task, where the F-SI scores are computed for a one-hidden-layer network, but a two-hidden-layer network is trained to produce the test accuracies. . . . . . . . . . . . . 37 4.1 Comparison of expected generalization gap andf-CMI bound for MNIST 4 vs 9 classification with a 5-layer CNN. Panel (a) shows the results for the fixed-seed deterministic algorithm. Panel (b) repeats the experiment of panel (a) while modifying the network to have 4 times more neurons at each hidden layer. Panel (c) repeats the experiment of panel (a) while making the training algorithm stochastic by randomizing the seed. . . . . . . . . . . . . . 50 4.2 Comparison of expected generalization gap andf-CMI bound for a pretrained ResNet-50 fine-tuned on CIFAR-10 in a standard fashion. . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Comparison of expected generalization gap, Negrea et al. [2019] SGLD bound andf-CMI bound in case of a pretrained ResNet-50 fine-tuned with SGLD on a subset of CIFAR-10 of sizen=20000. The figure on the right is the zoomed-in version of the figure in the middle. 51 6.1 Online vs. online distillation. Figures (a) and (b) illustrate possible teacher and student function trajectories in offline and offline knowledge distillations respectively. The yellow dotted lines indicate knowledge distillation. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Adjusted supervision complexity of various targets with respect to NTKs at different stages of training. The underlying dataset is binary CIFAR-100. Panels (b) and (d) plot adjusted supervision complexity normalized by norm of the targets. Note the y-axes of ResNet-20 plots are in logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Condition number of the NTK matrix of a LeNet5x8 (a) and ResNet-20 (b) students trained with MSE loss on binary CIFAR-100. The NTK matrices are computed on2 12 test examples. 75 6.4 Adjusted supervision complexities* of various targets with respect to a LeNet-5x8 network at different stages of its training. The experimental setup of the left and right plots matches that of Figure 6.2a and Figure 6.2b respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.5 Adjusted supervision complexity* of dataset labels measured on a subset of2 12 training examples of binary CIFAR-100. Complexities are measured with respect to either a LeNet-5x8 (on the left) or ResNet-20 (on the right) models trained with MSE loss and without knowledge distillation. Note that the plot on the right is in logarithmic scale. . . . 76 6.6 Adjusted supervision complexity for various targets. One the left: The effect of temperature on the supervision complexity of an offline teacher for a LeNet-5x8 after training for 25 epochs. On the right: The effect of averaging teacher predictions. . . . . . . . . . . . . . . . 76 6.7 Relationship between test accuracy, train NTK similarity, and train fidelity for CIFAR-100 students training with either ResNet-56 teacher (panels (a) and (c)) or ResNet-110 (panel (b)). 77 6.8 Online KD results for a LeNet-5x8 student on CIFAR-100 with varying frequency of a ResNet-56 teacher checkpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 viii 6.9 Most confusing 8 labels per class in the MNIST (on the left) and CIFAR-10 (on the right) datasets, according to the distance between predicted and cross-entropy gradients. The gradient predictions are done using the best instances of LIMIT. . . . . . . . . . . . . . . . 130 6.10 Most confusing 16 labels per class in the Clothing1M dataset, according to the distance between predicted and cross-entropy gradients. The gradient predictions are done using the best instance of LIMIT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.11 Relationship between test accuracy, train NTK similarity, and train fidelity for various teacher, student, and dataset configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.12 Relationship between test accuracy, train NTK similarity, and train fidelity for various teacher, student, and dataset configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 133 ix Abstract Despite the popularity and success of deep learning, there is limited understanding of when, how, and why neural networks generalize to unseen examples. Since learning can be seen as extracting information from data, we formally study information captured by neural networks during training. Specifically, we start with viewing learning in presence of noisy labels from an information-theoretic perspective and derive a learning algorithm that limits label noise information in weights. We then define a notion of unique information that an individual sample provides to the training of a deep network, shedding some light on the behavior of neural networks on examples that are atypical, ambiguous, or belong to underrepresented subpopulations. We relate example informativeness to generalization by deriving nonvacuous generalization gap bounds. Finally, by studying knowledge distillation, we highlight the important role of data and label complexity in generalization. Overall, our findings contribute to a deeper understanding of the mechanisms underlying neural network generalization. x Chapter1 Introduction Over the past decade, deep learning has achieved remarkable success in a wide range of applications, including computer vision, natural language processing, speech recognition, robotics, and generative modeling. Large neural networks trained with variants of stochastic gradient descent demonstrate excellent generalization capability, despite having enough capacity to memorize their training set [Zhang et al., 2017]. Although some progress has been made toward understanding deep learning, a comprehensive understanding of when, why, and how neural networks generalize remains elusive. 1.1 Memorizationindeeplearning One aspect of deep learning that needs to be understood better is memorization. In a broad sense, it is unclear what information neural networks memorize; what types of memorization occur during training; which are harmful and which are helpful to generalization; and how to measure and control various types of memorization. The fact that the term “memorization” has many definitions and interpretations indicates that memorization can come in different flavors. The simplest form of memorization is memorizing incorrect labels or label noise, which has been the subject of many studies [Frenay and Verleysen, 2014, Song et al., 2022]. In specific learning scenarios, memorizing noisy labels does not significantly affect test error [Liang and Rakhlin, 2020, Bartlett et al., 2020, Hastie et al., 2022, Frei et al., 2022, Cao et al., 2022]. However, label noise memorization significantly degrades test performance in typical deep learning settings [Zhang et al., 2017, Chen et al., 2019, Mallinar et al., 2022]. Given that real-world labeled datasets often have mislabeled examples due to ambiguities, labeling errors, or measurement errors, there is a great need for methods of measuring memorization and for training algorithms that are robust to label noise. Memorization of noisy labels is also important because it provides insights into understanding the behavior of neural networks. It has been observed that neural networks learn simple and generalized patterns first [Arpit et al., 2017] and generalize well in the early training epochs [Li et al., 2020]. Even when trained on a dataset with noisy labels, neural networks still learn useful representations, especially in early layers [Dosovitskiy et al., 2014, Pondenkandath et al., 2018, Maennel et al., 2020, Anagnostidis et al., 2023]. Besides label noise, neural networks can memorize other information about their training set. For example, an image classifier can memorize individual training examples [Feldman and Zhang, 2020]. Given a white-box or black-box access to a neural network, it is possible to extract membership information, as shown by Shokri et al. [2017] and Nasr et al. [2019]. Neural networks trained for language modeling memorize sensitive personal information [Carlini et al., 2019], factual knowledge [Petroni et al., 2019], long sequences of words from training data verbatim [Carlini et al., 2021, Tirumala et al., 2022, Carlini et al., 2023b], and idioms [Haviv et al., 2023]. A diffusion model can memorize specific training examples [Carlini et al., 2023a]. Some of these instances of memorization are undesirable due to privacy concerns. Nevertheless, from the generalization perspective, some types of memorization can be beneficial. Indeed, memorization is an optimal strategy in some settings [Feldman, 2020, Brown et al., 2021]. There is currently limited 1 knowledge regarding when neural networks transition from generalizing to memorizing [Cohen et al., 2018, Zhang et al., 2020]. 1.2 Thegeneralizationpuzzle Arguably the most important question in deep learning is that of generalization. As demonstrated by Zhang et al. [2017], overparameterized neural networks generalize well despite being able to memorize the training set mechanically. Classical learning theory results based on various notions of hypothesis set complexity do not explain this phenomenon – the same class of neural networks can generalize well for one training data but fail for another. Furthermore, the modern practice of training neural networks goes against the conventional bias-variance trade-off wisdom in that neural networks are often trained to interpolation without explicit generalization [Belkin et al., 2019]. These observations have sparked a search for effective notions of complexity, implicit biases, and for data- and algorithm-dependent generalization bounds [Bartlett et al., 2021, Jiang* et al., 2020, Dziugaite et al., 2020]. It has been well-established that number of parameters does not serve as a good notion of complexity, and other alternatives were proposed, such as size-based measures [Bartlett, 1998, Neyshabur et al., 2015, Bartlett et al., 2017, Golowich et al., 2018]. The good generalization of stochastic gradient descent has been attributed to implicit biases such as finding flat minima [Keskar et al., 2017], finding minimum norm solutions [Soudry et al., 2018], spectral bias [Rahaman et al., 2019], simplicity bias [Kalimeris et al., 2019], representation compression [Shwartz-Ziv and Tishby, 2017], and stability [Hardt et al., 2016], among others. Nevertheless, it has been challenging to find generalization bounds that produce good quantitative results [Neyshabur et al., 2017, Nagarajan and Kolter, 2019, Dziugaite et al., 2020]. Exceptions include some PAC-Bayes bounds [Dziugaite and Roy, 2017, Zhou et al., 2019] that apply to modified learning algorithms. While the role of learning algorithms and architectures have been studied extensively, the same cannot be said for the role of data distribution. Evidently, a good learning algorithm and neural network architecture alone do not necessarily result in good generalization. Therefore, an adequate generalization theory should also make some assumptions about the data distribution. For most data-dependent generalization bounds, this is done by assuming access to training data, limiting the explanatory power. Ideally, a theory should predict good generalization without making strong assumptions about the data distribution. For example, Yang et al. [2022] show that the eigenspectrum of the input correlation matrix of typical datasets has a certain property that effectively induces a capacity control for a neural network. Arora et al. [2019] and Ortiz-Jiménez et al. [2021] show that a good alignment between the neural tangent kernel and labels ensures fast learning and good generalization. Finding sufficient data properties that are characteristic of real-world datasets and enable good generalization is thus a key direction in explaining the success of deep learning. Overall, besides satisfying scientific curiosity and providing generalization guarantees, understanding generalization in deep learning is also a practically fruitful research direction. We expect that an adequate generalization theory will lead to improved training algorithms, neural network architectures, and better scaling with respect to data and model size. Some instances of confirming this expectation already exist, such as training algorithms Entropy-SGD [Chaudhari et al., 2019] and sharpness-aware minimization [Foret et al., 2020]. We present another such development in this dissertation (Chapter 6). 1.3 Ourcontributions The results presented in this dissertation can be seen as contributions to the problems of memorization and generalization in deep learning. We view training a neural network as extracting information from samples in a dataset and storing it in the weights of the network so that it may be used in future inference or prediction. Consequently, we formally study information captured by a neural network during training 2 Table 1.1: A mapping from chapters to papers. Chapter Paper Chapter 2 Harutyunyan et al. [2020] Chapter 3 Harutyunyan et al. [2021a] Chapter 4 Harutyunyan et al. [2021b] Chapter 5 Harutyunyan et al. [2022] Chapter 6 Harutyunyan et al. [2023] to understand the complex interplay between data, learning algorithm, and hypothesis class. As we shall see, this allows us to define principled notions of memorization, shed some light on the behavior of neural networks, derive nonvacuous generalization gap bounds, and guide the training algorithm design process. Table 1.1 presents the mapping between the chapters and papers. Labelnoisememorization In Chapter 2, we start with studying the simplest form of memorization conceptually: memorizing label noise. As a fundamental measure of label noise memorization, we consider the Shannon mutual information I(W;Y | Y) between the learned weights W and the training labelsY , given the training inputsX. We show that reducing the training error beyond the noise level results in a large value ofI(W;Y |X). Furthermore, we show that learning with constrained label noise information provides a certain level of label noise robustness. Starting with this constrained problem, we derive a new learning algorithm in which the main classifier is trained with gradient updates predicted by another neural network that does not access the dataset labels. In addition to good results on standard benchmark tasks, the proposed algorithm provides a partial justification for the well-known co-teaching approach [Han et al., 2018]. Amoregeneralnotionofexamplememorization In Chapter 3, we go beyond label noise memorization and aim to quantify how much information about a particular example is captured during the training of a neural network. Due to combinatorial challenges related to redundancies, synergies, and high-order dependencies, we focus on measuringunique information, which can be seen as a measure of example memorization (not necessarily harmful to generalization). We define, both in weight space and function space, a notion of unique information that an example provides to the training of a neural network. While rooted in information theory, these quantities capture some aspects of stability theory and influence functions. The proposed unique information measures apply to even deterministic training algorithms and can be approximated efficiently for wide or pretrained neural networks using a linearization of the model. Apart from having important applications, such as data valuation, active learning, data summarization, and guiding the data collection process, measuring unique information provides insights into how neural networks generalize. We find that typically only a small portion of examples are informative. These are usually atypical, hard, ambiguous, mislabeled, or underrepresented examples. In some cases, one can remove up to 90% of uninformative examples without degrading test set performance. Conversely, removing highly memorized examples decreases test accuracy substantially, indicating that memorization is sometimes needed for good generalization. Furthermore, we find that some uninformative examples can become highly informative when some other examples are removed from the training set. Our findings add to the increasing amount of research focused on revealing the function of example memorization in generalization [Koh and Liang, 2017, Feldman, 2020, Feldman and Zhang, 2020, Paul et al., 2021, Sorscher et al., 2022, Carlini et al., 2022]. 3 Information-theoreticgeneralizationbounds Information about the training set captured by a neural network is also helpful for studying generalization. In their seminal work, Xu and Raginsky [2017] introduce a generalization gap bound depending on the Shannon informationI(W;S) between the learned weightsW and the training setS. This result confirms the intuition that a learner will generalize well if it does not memorize the training set. Unfortunately, the bound is vacuous in realistic settings and gives only qualitative insights. Many better bounds were introduced subsequently, but two problems remained: (a) the bounds were vacuous in practical settings, and (b) they were hard to estimate due to challenges in estimating mutual information between high-dimensional variables. In Chapter 4, building on the conditional mutual information bounds of Steinke and Zakynthinou [2020] and the sample-wise bounds of Bu et al. [2020], we derive expected generalization bounds for supervised learning algorithms based on information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, apply to a wider range of algorithms, give meaningful results for deterministic algorithms, and are significantly easier to estimate. Furthermore, some classical learning theory results, such as expected generalization gap bounds based on Vapnik–Chervonenkis (VC) dimension [Vapnik, 1998] and algorithmic stability [Bousquet and Elisseeff, 2002], follow directly from our results. More importantly, the bounds are nonvacuous in practical scenarios for deep learning. In one case, for a neural network with 3M parameters that reaches 9% test error with just 75 MNIST training examples, the estimated test error bound is 22%. An essential ingredient in recent improvements of information-theoretic generalization bounds (in- cluding in our work presented in Chapter 4) is the introduction of sample-wise information bounds by Bu et al. [2020] that depend on the average amount of information the learned hypothesis has about a single training example. In particular, for a training set of n examples S = (Z 1 ,...,Z n ), their bound depends on1/n P n i=1 I(W;Z i ) rather thanI(W;S)/n. However, these sample-wise bounds were derived only for the expected generalization gap, where the expectation is taken over both the training set and stochasticity of the training algorithm. While PAC-Bayes and information-theoretic bounds are intimately related, the same technique did not work for deriving sample-wise single-draw or PAC-Bayes generalization gap bounds. In Chapter 5,we show that sample-wise bounds are generally possible only for the expected generalization gap (the weakest form of generalization guarantee). In other words, sample-wise single-draw and PAC-Bayes generalization bounds are impossible unless additional assumptions are made. Surprisingly, we also find that single-draw and PAC-Bayes bounds with information captured in pairs of examples, 1 n(n− 1) P i̸=j I(W;Z i ,Z j ), are possible without additional assumptions. Theroleofsupervisioncomplexityingeneralization One drawback of information-theoretic generalization bounds is that they depend too much on training data and data distribution. While these bounds can give concrete generalization guarantees and help design better learning algorithms, they do not specify what data properties are desirable for good generalization. Their strong data-dependent nature enables differentiating cases like learning with ground truth labels and learning with random labels but comes at the cost of reduced explanatory power. There are cases when subtle differences in data result in significantly different generalization performances. One can argue that even the tightest information-theoretic bounds do not explain why these subtle differences have such effects. One such prominent case arises in knowledge distillation [Buciluˇ a et al., 2006, Hinton et al., 2015], which is a popular method of transferring knowledge from a large “teacher” model to a more compact “student” model. In the most basic form of knowledge distillation, the student is trained to fit the teacher’s predicted label distribution (also called soft labels) for each training example. It has been well-established that distilled students usually perform better than students trained on raw dataset labels [Hinton et al., 2015, 4 Furlanello et al., 2018, Stanton et al., 2021, Gou et al., 2021]. Notably, the teacher itself is usually trained on the same inputs but with original hard labels. From a purely information-theoretic perspective, this phenomenon is quite surprising because, by the data processing inequality, the teacher predictions do not add any new information that was not presented in the original training dataset. Clearly, the distillation dataset must satisfy some desired property that enables better generalization. Several works have attempted to uncover why knowledge distillation can improve the student per- formance. Some prominent observations are that (self-)distillation induces certain favorable optimization biases in the training objective [Phuong and Lampert, 2019, Ji and Zhu, 2020], lowers variance of the objective [Menon et al., 2021, Dao et al., 2021, Ren et al., 2022], increases regularization towards learning “simpler” functions [Mobahi et al., 2020], transfers information from different data views [Allen-Zhu and Li, 2023], and scales per-example gradients based on the teacher’s confidence [Furlanello et al., 2018, Tang et al., 2020]. Nevertheless, there are still no compelling answers to why knowledge distillation works, what the exact role of temperature scaling is, what effects the teacher-student capacity gap has, and what makes a good teacher ultimately. In Chapter 6, we provide a new perspective on knowledge distillation through the lens of supervision complexity. To put it concisely, supervision complexity quantifies why certain targets (e.g., temperature- scaled teacher probabilities) may be “easier” for a student model to learn compared to others (e.g., raw one-hot labels), owing to better alignment with the student’s neural tangent kernel (NTK) [Jacot et al., 2018, Lee et al., 2019]. We derive a new generalization bound for distillation that highlights how student generalization is controlled by a balance of the teacher generalization, the student’s margin with respect to the soft labels, and the supervision complexity of the soft labels. We show that both temperature scaling and early stopping reduce the supervision complexity at the expense of lowering the classification margin. Based on our analysis, we advocate using a simpleonlinedistillation algorithm, wherein the student receives progressively more complex soft labels corresponding to teacher predictions at various checkpoints during its training. Online distillation improves significantly over standard distillation and is especially successful for students with weak inductive biases, for which the final teacher predictions are often as complex as dataset labels, particularly during the early stages of training. 1.4 Notationandpreliminaries Before proceeding to our contributions, we first introduce some basic notation, describe an abstract learning setting, and provide an overview of some information-theoretic quantities used in this dissertation. Notation. We use capital letters (X,Y ,Z, etc) for random variables, corresponding lowercase letters (x, y,z, etc) for their values, and calligraphic letters for their domains (X ,Y,Z, etc). We useE P X [f(X)]= R fdP X to denote expectations. Whenever the distribution over which the expectation is taken is clear from the context, we simply writeE X [f(X)] orE[f(X)]. A random variableX is calledσ -subgaussian ifEexp(t(X− EX))≤ exp(σ 2 t 2 /2), ∀t∈R. For example, a random variable that takes values in[a,b] almost surely, is(b− a)/2-subgaussian. Throughout this dissertation,[n] denotes the set{1,2,...,n}. IfA=(a 1 ,...,a n ) is a collection, then A − i ≜ (a 1 ,...,a i− 1 ,a i+1 ,...,a n ) andA 1:k ≜{a 1 ,...,a k }. ForJ ∈{0,1} n , ¯ J ≜ (1− J 1 ,...,1− J n ) is the negation ofJ. When there can be confusion about whether a variable refers to a collection of items or a single item, we use bold symbols to denote the former. For a pair of integersn≥ m≥ 0, n m = n! m!(n− m)! denotes the binomial coefficient. If y(x)∈R m andx∈R n , then the Jacobian ∂y ∂x is anm× n matrix. The gradient∇ x y is ann× m matrix denoting the transpose of the Jacobian. This convention in convenient when working with gradient-descent-like algorithms. In particular, whenw∈R d is a column vector and L(w) is a scalar,∇ w L(w) is also a column vector. 5 Learningsetup. In the subsequent chapters, we consider the following abstract learning setting or an instance of it. We will consider a standard learning setting. There is an unknown data distributionP Z on an input spaceZ. The learner observes a collection ofn i.i.d examplesS =(Z 1 ,...,Z n ) sampled fromP Z and outputs a hypothesis (possibly random) belonging to a hypothesis spaceW. We will treat the learning algorithm as a probability kernelQ W|S , which given a training sets outputs a hypothesisW sampled from the distributionQ W|S=s . For deterministic algorithms,Q W|S=s is a point mass distribution. Together with P S , the algorithmQ W|S induces a joint probability distributionP W,S =P S Q W|S onW×Z n . The performance of a hypothesis w ∈ W on an example z ∈ Z is measured with a loss function ℓ : W ×Z → R. For a hypothesis w ∈ W, the population risk R(w) is defined as E Z ′ ∼ P Z [ℓ(w,Z ′ )], while the empirical risk is defined as r S (w)=1/n P n i=1 ℓ(w,Z i ). Note that the empirical risk is a random variable depending on S. We will often study the difference between population and empirical risks, R(W)− r S (W), which is called generalization gap or generalization error. Note that generalization gap is a random variable depending on both training setS and randomness of the training algorithmQ. Often we will be interested in supervised learning problems, whereZ =X ×Y andZ i =(X i ,Y i ). In such cases, we define X≜(X 1 ,...,X n ) andY ≜(Y 1 ,...,Y n ). Information-theoreticconcepts. The entropy of a discrete random variableX with probability mass function p(x) is defined as H(X) = − P x∈X p(x)logp(x). Analogously, the differential entropy of a continuous random variableX with probability densityp(x) is defined as H(X)=− R X p(x)logp(x)dx. Given two probability measuresP andQ defined on the same measurable space, such that P is absolutely continuous with respect toQ, the Kullback–Leibler (KL) divergence fromP toQ is defined as KL(P∥Q)= R log dP dQ dP , where dP dQ is the Radon-Nikodym derivative ofP with respect toQ. WhenX andY are random variables defined on the same probability space, we sometimes shorten KL(P X ∥P Y ) toKL(X∥Y). The Shannon mutual information between random variablesX andY isI(X;Y)=KL(P X,Y ∥P X ⊗ P Y ). The conditional variants of entropy, differential entropy, KL divergence, and Shannon mutual in- formation entail an expectation over the random variable in the condition. For example, I(X;Y | Z) = R Z KL P X,Y|Z ∥P X|Z ⊗ P Y|Z dP Z . The disintegrated variants of these quantities are denoted with a superscript indicating the condition. For example,I Z (X;Y)≜KL P X,Y|Z ∥P X|Z ⊗ P Y|Z denotes the disintegrated mutual information [Negrea et al., 2019]. Note that the disintegrated mutual information is a random variable depending onZ. In this dissertation, all information-theoretic quantities are measured in nats, unless specified otherwise. Please refer to [Cover and Thomas, 2006] for more in-detail description of the aforementioned concepts. 6 Chapter2 ImprovingGeneralizationbyControllingLabelNoiseInformationin NeuralNetworkWeights 2.1 Introduction Despite having millions of parameters, modern neural networks generalize surprisingly well. However, their training is particularly susceptible to noisy labels, as shown by Zhang et al. [2017] in their analysis of generalization error. In the presence of noisy or incorrect labels, networks start to memorize the training labels, which degrades the generalization performance [Chen et al., 2019]. At the extreme, standard architectures have the capacity to achieve 100% classification accuracy on training data, even when labels are assigned at random [Zhang et al., 2017]. Furthermore, standard explicit or implicit regularization techniques such as dropout, weight decay or data augmentation do not directly address nor completely prevent label memorization [Zhang et al., 2017, Arpit et al., 2017]. Poor generalization due to label memorization is a significant problem because many large, real-world datasets are imperfectly labeled. Label noise may be introduced when building datasets from unreliable sources of information or using crowd-sourcing resources like Amazon Mechanical Turk. A practical solution to the memorization problem is likely to be algorithmic as sanitizing labels in large datasets is costly and time consuming. Existing approaches for addressing the problem of label noise and generalization performance include deriving robust loss functions [Natarajan et al., 2013, Ghosh et al., 2017, Zhang and Sabuncu, 2018, Xu et al., 2019], loss correction techniques [Sukhbaatar et al., 2014, Xiao et al., 2015, Goldberger and Ben-Reuven, 2017, Patrini et al., 2017], re-weighting samples [Jiang et al., 2018, Ren et al., 2018], detecting incorrect samples and relabeling them [Reed et al., 2014, Tanaka et al., 2018, Ma et al., 2018], and employing two networks that select training examples for each other [Han et al., 2018, Yu et al., 2019]. We propose an information-theoretic approach that directly addresses the root of the problem. If a classifier is able to correctly predict a training label that is actually random, it must have somehow stored information about this label in the parameters of the model. To quantify this information, Achille and Soatto [2018] consider weights as a random variable,W , that depends on stochasticity in training data and parameter initialization. The entire training dataset is considered a random variable consisting of a vector of inputs,X, and a vector of labels for each input,Y . The amount of label memorization is then given by the Shannon mutual information between weights and labels conditioned on inputs,I(W;Y |X). Achille and Soatto [2018] show that this term appears in a decomposition of the commonly used expected cross-entropy loss, along with three other individually meaningful terms. Surprisingly, cross-entropy rewards large values ofI(W;Y | X), which may promote memorization if labels contain information beyond what can be inferred fromX. Such a result highlights that in addition to the network’s representational capabilities, the loss function – or more generally, the learning algorithm – plays an important role in memorization. To this end, we wish to study the utility of limitingI(W;Y |X), and how it can be used to modify training algorithms to reduce memorization. 7 Epochs 0.2 0.6 1.0 Train accuracy on noisy labels overfitting ⬆ proposed no regularization dropout 0.5 weight decay 0.003 Epochs 0.2 0.6 1.0 T est accuracy on clean labels proposed no regularization dropout 0.5 weight decay 0.003 Figure 2.1: Neural networks tend to memorize labels when trained with noisy labels (80% noise in this case), even when dropout or weight decay are applied. Our training approach limits label noise information in neural network weights, avoiding memorization of labels and improving generalization. See Section 2.2.1 for more details. Our main contributions towards this goal are as follows: 1) We show that low values ofI(W;Y |X) correspond to reduction in memorization of label noise, and lead to better generalization gap bounds. 2) We propose training methods that control memorization by regularizing label noise information in weights. When the training algorithm is a variant of stochastic gradient descent, one can achieve this by controlling label noise information in gradients. A promising way of doing this is through an additional network that tries to predict the classifier gradients without using label information. We experiment with two training procedures that incorporate gradient prediction in different ways: one which uses the auxiliary network to penalize the classifier, and another which uses predicted gradients to train it. In both approaches, we employ a regularization that penalizes the L2 norm of predicted gradients to control their capacity. The latter approach can be viewed as a search over training algorithms, as it implicitly looks for a loss function that balances training performance with label memorization. 3) Finally, we show that the auxiliary network can be used to detect incorrect or misleading labels. To illustrate the effectiveness of the proposed approaches, we apply them on corrupted versions of MNIST, CIFAR-10, CIFAR-100 with various label noise models, and on the Clothing1M dataset, which already contains noisy labels. We show that methods based on gradient prediction yield drastic improvements over standard training algorithms (like cross-entropy loss), and outperform competitive approaches designed for learning with noisy labels. 2.2 Labelnoiseinformationinweights We begin by formally introducing a measure of label noise information in weights, and discuss its connections to memorization and generalization. Consider a labeled training setS =(Z 1 ,...,Z n ) consisting ofn i.i.d. examples fromP Z , withZ i =(X i ,Y i ). LetX≜(X 1 ,...,X n ) andY ≜(Y 1 ,...,Y n ). Consider a neural networkf w (y|x) with parametersw that models the conditional distribution of labels. In practice such neural networks are trained by minimizing the empirical negative log-likelihood loss: r S (w)=− 1 n n X i=1 logf w (Y i |X i ). (2.1) For a given training algorithm Q W|S , the expected value of this empirical risk can be decomposed as follows [Achille and Soatto, 2018]: E P S E W∼ Q W|S [r S (W)]=H(Y |X)+E P X,W [KL(p(Y |X)∥f W (Y |X))]− I(W;Y |X). (2.2) 8 The first term measures inherent uncertainty of labels, which is independent of the model. The second term measures how well the learned modelf W (y|x) approximates the ground truth conditional distribution p(y|x) on training inputs. The third term is the label noise information, and enters the equation with a negative sign. The problem of minimizing this expected cross-entropy is equivalent to selecting an appropriate training algorithm. If the labels contain information beyond what can be inferred from inputs (i.e.,H(Y |X)>0), such an algorithm may do well by memorizing the labels through the third term of Eq. (2.2). Indeed, minimizing the empirical cross-entropy lossQ ERM W|S = δ (w ∗ (S)), wherew ∗ (S)∈ argmin w r S (w), does exactly that [Zhang et al., 2017]. 2.2.1 Decreasinglabelnoiseinformationreducesmemorization To demonstrate thatI(W;Y |X) is directly linked to memorization, we prove that any algorithm with smallI(W;Y |X) overfits less to label noise in the training set. Theorem2.2.1. Consider a datasetS = (Z 1 ,...,Z n ) ofn i.i.d. examples, withZ i = (X i ,Y i ). Assume that the domain of labels,Y, be a finite set of at least two elements. Let Q W|S be possibly stochastic learning algorithm producing weights for a classifier. Let b Y i denote the prediction of the classifier on the i-th example and letE i ≜1 n b Y i ̸=Y i o be a random variable corresponding to predictingY i incorrectly. Then, the following inequality holds: E " n X i=1 E i # ≥ H(Y |X)− I(W;Y |X)− P n i=1 H(E i ) log(|Y|− 1) . (2.3) This result establishes a lower bound on the expected number of prediction errors on the training set, which increases asI(W;Y |X) decreases. For example, consider a corrupted version of the MNIST dataset where each label is changed with probability0.8 to a uniformly random incorrect label. By the above bound, every algorithm for whichI(W;Y |X)=0 will make at least80% prediction errors on the training set in expectation. In contrast, if the weights retain1 bit of label noise information per example, the classifier will make at least 40.5% errors in expectation. Below we discuss the dependence of error probability onI(W;Y |X). Remark 2.2.1. Let k = |Y|, r = 1 n E[ P n i=1 E i ] denote the expected training error rate, and h(r) = − rlogr− (1− r)log(1− r) be the binary entropy function. Then we can simplify the results of Theorem 2.2.1 as follows: r≥ H(Y 1 |X 1 )− I(W;Y |X)/n− 1 n P n i=1 h(P(E i =1)) log(k− 1) (2.4) ≥ H(Y 1 |X 1 )− I(W;Y |X)/n− h 1 n P n i=1 P(E i =1) log(k− 1) (by Jensen’s inequality) (2.5) = H(Y 1 |X 1 )− I(W;Y |X)/n− h(r) log(k− 1) . (2.6) Solving this inequality for r is challenging. One can simplify the right hand side further by bounding H(E 1 )≤ 1 (assuming that entropies are measured in bits). However, this will loosen the bound. Alterna- tively, we can find the smallest r 0 for which Eq. (2.6) holds and claim thatr≥ r 0 . 9 0.0 0.5 1.0 1.5 2.0 2.5 3.0 I( W; Y ∣ X) / n 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 r 0 (the lower bound of r) p = 0.2 p = 0.4 p = 0.6 p = 0.8 Figure 2.2: The lower boundr 0 on the rate of training errorsr Theorem 2.2.1 establishes for varying values ofI(W;Y |X), in the case when label noise is uniform and probability of a label being incorrect isp. Remark2.2.2. If|Y|=2, thenlog(|Y|− 1)=0, putting which in Eq. (G.2) of Appendix G: h(r)≥ H(Y 1 |X 1 )− I(W;Y |X)/n. (2.7) Remark2.2.3. When we have uniform label noise where a label is incorrect with probabilityp (0≤ p< k− 1 k ) andI(W;Y |X)=0, the bound of Eq. (2.6) is tight, i.e., implies thatr≥ p. To see this, we note that H(Y 1 |X 1 )=h(p)+plog(k− 1), putting which in Eq. (2.6) gives us: r≥ h(p)+plog(k− 1)− h(r) log(k− 1) =p+ h(p)− h(r) log(k− 1) . (2.8) Therefore, whenr =p, the inequality holds, implying thatr 0 ≤ p. To show thatr 0 =p, we need to show that for any0≤ r − log(k− 1) when 0 ≤ p < (k− 1)/k. Eq. (2.11) directly contradicts with r < p. Therefore, Eq. (2.8) cannot hold for anyr 0, we can find the smallest r 0 by a numerical method. Figure 2.2 plotsr 0 vs I(W;Y |X) when the label noise is uniform. When the label noise is not uniform, the bound of Eq. (2.6) becomes loose as Fano’s inequality becomes loose. Theorem 2.2.1 provides theoretical guarantees that memorization of noisy labels is prevented when I(W;Y |X) is small, in contrast to standard regularization techniques – such as dropout, weight decay, and data augmentation – which only slow it down [Zhang et al., 2017, Arpit et al., 2017]. To demonstrate this empirically, we compare an algorithm that controlsI(W;Y |X) (presented in Section 2.3) against these 10 regularization techniques on the aforementioned corrupted MNIST setup. We see in Figure 2.1 that explicitly preventing memorization of label noise information leads to optimal training performance (20% training accuracy) and good generalization on a non-corrupted validation set. Other approaches quickly exceed 20% training accuracy by incorporating label noise information, and generalize poorly as a consequence. The classifier here is a fully connected neural network with 4 hidden layers each having 512 ReLU units. The rates of dropout and weight decay were selected according to the performance on a validation set. 2.2.2 Decreasinglabelnoiseinformationimprovesgeneralization The information that weights contain about a training datasetS has previously been linked to general- ization [Xu and Raginsky, 2017]. While we will explore connections between information in weights and generalization in detail in Chapters 4 and 5, it is instructive to consider here the seminal result of Xu and Raginsky [2017]. They prove that whenℓ(w,Z ′ ), withZ ′ ∼ P Z , isσ -subgaussian, we have that E P S E W∼ Q W|S |E[R(W)− r S (W)]| | {z } expected generalization gap ≤ r 2σ 2 n I(W;S). (2.12) For good test performance, learning algorithms need to have both a small generalization gap, and good training performance. The latter may require retaining more information about the training set, meaning there is a natural conflict between increasing training performance and decreasing the generalization gap bound of Eq. (2.12). Furthermore, information in weights can be decomposed as follows: I(W;S)= I(W;X)+I(W;Y |X). We claim that one needs to prioritize reducingI(W;Y |X) overI(W;X) for the following reason. When noise is present in the training labels, fitting this noise implies a non-zero value ofI(W;Y |X), which grows linearly with the number of samplesn. In such cases, the generalization gap bound of Eq. (2.12) does not improve asn increases. To get meaningful generalization bounds via Eq. (2.12) one needs to limit I(W;Y | X). We hypothesize that for efficient learning algorithms, this condition might be also sufficient. 2.3 Methodslimitinglabelinformation We now consider how to design training algorithms that controlI(W;Y |X). We assumef w (y|x)= Multinoulli(y;softmax(a)), wherea is the output of a neural networkh w (x). We consider the case when h w (x) is trained with a variant of stochastic gradient descent forT iterations. The inputs and labels of a mini-batch at iterationt are denoted byX t andY t respectively, and are selected using a deterministic procedure (such as cycling through the dataset or using pseudo-randomness). To keep the notation simple, we treat (X t ,Y t ) as a single example. The derivations below can be straightforwardly extended to the case whenX t andY t are a mini-batch of inputs and labels. LetW 0 denote the random weights at initialization, andW t denote the weights after iterationt. Let ℓ(w,(x,y)) be some classification loss function (e.g, the cross-entropy loss). Let G ℓ t ≜∇ w ℓ(W t− 1 ,(X t ,Y t )) be the gradient at iterationt. Let the update rule beW t =Ψ( W 0 ,G 1:t ), whereG t denote the gradients used to update the weights, possibly different from G ℓ t . The final weights W T are the output of the algorithm (i.e.,W =W T ). In the simplest case,W T =W 0 − P T t=1 η t G t , with some learning rate scheduleη t . To limitI(W;Y |X) the following sections will discuss two approximations which relax the computa- tional difficulty, while still providing meaningful bounds: 1) first, we show that the information in weights can be replaced by information in the gradients; 2) we introduce a variational bound on the information in gradients. The bound employs an auxiliary network that predicts gradients of the original loss without 11 label information. We then explore two ways of incorporating predicted gradients: (a) using them in a regularization term for gradients of the original loss, and (b) using them to train the classifier. 2.3.1 Penalizinginformationingradients Looking at Eq. (2.2) it is tempting to addI(W;Y |X) as a regularization toE P S,W [r S (W)] and minimize over all training algorithms: min Q W|S E P S,W [r S (W)]+I(W;Y |X). (2.13) This will become equivalent to minimizingE P X,W KL(p(Y |X)∥f W (Y |X)). Unfortunately, the opti- mization problem Eq. (2.13) is hard to solve for two major reasons. First, the optimization is over training algorithms (rather than over the weights of a classifier, as in the standard supervised learning setting). Second, the penaltyI(W;Y |X) is hard to compute. To simplify the problem of Eq. (2.13), we relate information in weights to information in gradients as follows: I(W;Y |X)≤ I(G 1:T ;Y |X)= T X t=1 I(G t ;Y |X,G <t ), (2.14) where G <t is a shorthand for the set{G 1 ,...,G t− 1 }. Hereafter, we focus on constraining I(G t ;Y | X,G <t ) at each iteration. Our task becomes choosing a loss function to optimize such that I(G t ;Y | X,G <t ) is small andf Wt (y|x) is a good classifier. One key observation is that if our task is to minimize label noise information in gradients it may be helpful to consider gradients with respect to the last layer only and compute the remaining gradients using back-propagation. As these steps of back-propagation do not use labels, by data processing inequality, subsequent gradients would have at most as much label information as the last layer gradient. To simplify information-theoretic quantities, we add a small independent Gaussian noise to the gradients of the original loss: ˜ G ℓ t ≜ G ℓ t +ξ t , whereξ t ∼N (0,σ 2 ξ I) andσ ξ is small enough to have no significant effect on training (e.g., less than 10 − 9 ). With this convention, we formulate the following regularized objective function: min w ℓ(W t− 1 ,(X t ,Y t ))+λI ( ˜ G ℓ t ;Y |X,G <t ), (2.15) whereλ > 0 is a regularization coefficient. The term I( ˜ G ℓ t ;Y | X,G <t ) is a functionΦ( W t− 1 ;X t ) of W t− 1 andX t . Computing this function would allow the optimization of Eq. (2.15) through gradient descent: G t =G ℓ t +ξ t +∇ w Φ( W t− 1 ;X t ). Importantly, label noise information is equal in bothG t and ˜ G ℓ t , as the gradient from the regularization is constant givenX andG <t : I(G t ;Y |X,G <t )=I(G ℓ t +ξ t +∇ w Φ( W t− 1 ;X t );Y |X,G <t ) (2.16) =I(G ℓ t +ξ t ;Y |X,G <t ) (2.17) =I( ˜ G ℓ t ;Y |X,G <t ). (2.18) Therefore, by minimizingI( ˜ G ℓ t ;Y |X,G <t ) in Eq. (2.15) we minimizeI(G t ;Y |X,G <t ), which is used to upper boundI(W;Y |X) in Eq. (2.14). We rewrite this regularization in terms of entropy and discard the constant term,H(ξ t ): I( ˜ G ℓ t ;Y |X,G <t )=H( ˜ G ℓ t |X,G <t )− H( ˜ G ℓ t |X,Y,G <t ) =H( ˜ G ℓ t |X,G <t )− H(ξ t ). (2.19) 12 2.3.2 Variationalboundsongradientinformation The first term in Eq. (2.19) is still challenging to compute, as we typically only have one sample from the unknown distribution p(y t | x t ). Nevertheless, we can upper bound it with the cross-entropy H p,q = − E ˜ G ℓ t h logq ϕ ( ˜ G ℓ t |X,G <t ) i , whereq ϕ (·| x,g <t ) is a variational approximation forp(˜ g ℓ t |x,g <t ): H( ˜ G ℓ t |X,G <t )≤− E ˜ G ℓ t h logq ϕ ( ˜ G ℓ t |X,G <t ) i . (2.20) This bound is correct whenϕ is a constant or a random variable that depends only onX. With this upper bound, Eq. (2.15) reduces to: min w,ϕ ℓ(W t− 1 ,(X t ,Y t ))− λ E ˜ G ℓ t h logq ϕ ( ˜ G ℓ t |X,G <t ) i . (2.21) This formalization introduces a soft constraint on the classifier by attempting to make its gradients pre- dictable without labelsY , effectively reducing I(G t ;Y |X,G <t ). Assumingb y =softmax(h w (x)) denotes the predicted class probabilities of the classifier and ℓ is the cross-entropy loss, the gradient with respect to logits a = h w (x) isb y− y (assuming y has a one-hot encoding). We have thatI( b Y t − Y t +ξ t ;Y | X,G <t ) =I(Y t +ξ t ;Y | X,G <t ). Therefore, if we only consider gradients with respect to logits in the penalty of Eq. (2.21), the resulting penalty would not serve as a meaningful regularizer, as it has no dependent onW t− 1 . Instead, we descend an additional level to consider gradients of the final layer parameters. When the final layer of h w (x) is fully connected with inputsz and weightsU (i.e.,a=Uz), the gradients with respect to its parameters is equal to(b y− y)z T . If we only consider this gradient in Eq. (2.21), we have thatI( ˜ G ℓ t ;Y |X,G <t )=I(( b Y t − Y t )Z T t +ξ t ;Y | X,G <t ) = I(Y t Z T t +ξ t ;Y | X,G <t ). There is now dependence on W t− 1 through Z t . We choose to parametrizeq ϕ (·| x,g <t ) as a Gaussian distribution with meanµ t =(softmax(a t )− softmax(r ϕ (x t )))z T t and fixed covariance σ q I, wherer ϕ (·) is another neural network. Under this assumption,H p,q becomes proportional to: E ( b Y t − Y t )Z T t +ξ t − µ t 2 2 =E ξ 2 t +E h ∥Z t ∥ 2 2 ∥Y t − softmax(r ϕ (X t ))∥ 2 2 i . (2.22) Ignoring constants and approximating the expectation above with one Monte Carlo sample computed using the labelY t , the objective of Eq. (2.21) becomes: min w ℓ(W t− 1 ,(X t ,Y t ))+λ ∥Z t ∥ 2 2 ∥Y t − softmax(r ϕ (X t ))∥ 2 2 . (2.23) While this may work in principle, in practice the dependence onw is only through the norm ofZ t , making it weak to have much effect on the overall objective. We confirm this experimentally in Section 2.4. To introduce more complex dependencies onw, one would need to model the gradients of deeper layers. 2.3.3 Predictinggradientswithoutlabelinformation An alternative approach is to use gradients predicted byq ϕ (·| X,G <t ) to update classifier weights, i.e., sampleG t ∼ q ϕ (·| X,G <t ). This is a much stricter condition, as it impliesI(G t ;Y |X,G <t )=0 (again assumingϕ is a constant or a random variable that depends only onX). Note that minimizingH p,q makes 13 the predicted gradientG t a good substitute for the cross-entropy gradients ˜ G ℓ t . Therefore, we write down the following objective function: min w,ϕ ˜ ℓ(W t− 1 ,ϕ,X t )− λ E ˜ G ℓ t h logq ϕ ( ˜ G ℓ t |X,G <t ) i , (2.24) where ˜ ℓ(w,ϕ,x ) is a probabilistic function defined implicitly such that ∇ w ˜ ℓ(w,ϕ,x )∼ q ϕ (·| x,w). We found that this approach performs significantly better than the penalizing approach of Eq. (2.21). To update w only with predicted gradients, we disable the dependence of the second term of Eq. (2.24) onw in the implementation. Additionally, one can setλ = 1 above as the first term depends only on w, while the second term depends only onϕ . We choose to predict the gradients with respect to logits only and compute the remaining gradients using backpropagation. We consider two distinct parameterizations forq ϕ –Gaussian:q ϕ (·| x,w)=N µ,σ 2 q I , and Laplace: q ϕ (· | x,w) = Q j Lap µ j ,σ q / √ 2) , with µ = softmax(a)− softmax(r ϕ (x)) and r ϕ (·) being an auxiliary neural network as before. Under these Gaussian and Laplace parameterizations,H p,q becomes proportional toE µ t − ˜ G ℓ t 2 2 andE µ t − ˜ G ℓ t 1 respectively. In the Gaussian caseϕ is updated with a mean square error loss (MSE) function, while in the Laplace case it is updated with a mean absolute error loss (MAE). The former is expected to be faster at learning, but less robust to noise [Ghosh et al., 2017]. 2.3.4 Reducingoverfittingingradientprediction In both approaches of (2.21) and (2.24), the classifier can still overfit if q ϕ (· | x,w) overfits. There are multiple ways to prevent this. One can chooseq ϕ to be parametrized with a small network, or pretrain and freeze some of its layers in an unsupervised fashion. In this work, we choose to control the L2 norm of the mean of predicted gradients,∥µ ∥ 2 2 , while keeping the varianceσ 2 q fixed. This can be viewed as limiting the capacity of gradientsG t . Proposition 2.3.1. IfG t = µ t +ϵ t , whereϵ t ∼N (0,σ 2 q I d ) is independent noise, andE µ T t µ t ≤ L 2 , then the following inequality holds: I(G t ;Y |X,G <t )≤ d 2 log 1+ L 2 dσ 2 q . (2.25) The same bound holds whenϵ t is sampled from a product ofd univariate zero-mean Laplace distributions with variance σ 2 q , since the proof relies only on ϵ t being zero-mean and having variance σ 2 q . The final objective of our main method becomes: min w,ϕ ˜ ℓ(W t− 1 ,ϕ,X t )− λ E ˜ G ℓ t h logq ϕ ( ˜ G ℓ t |X,G <t ) i +β ∥µ t ∥ 2 2 . (2.26) As before, to updatew only with predicted gradients, we disable the dependence of the second and third terms above onw in the implementation. We name this final approach LIMIT –limiting labelinformation memorization in training. We denote the variants with Gaussian and Laplace distributions as LIMIT G and LIMIT L respectively. The pseudocode of LIMIT is presented Algorithm 1. Note that in contrast to the previous approach of Eq. (2.15), this follows the spirit of Eq. (2.13), in the sense that the optimization overϕ can be seen as optimizing over training algorithms; namely, learning a loss function implicitly through gradients. With this interpretation, the gradient norm penalty can be viewed as a way to smooth the learned loss, which is a good inductive bias and facilitates learning. 14 Algorithm1 LIMIT: limiting label information memorization in training. Our implementation is available at https://github.com/hrayrhar/limit-label-memorization. 1: Input: Training datasetS. 2: Input: Gradient norm regularization coefficient β . {λ is set to1} 3: Initialize the classifier f w (y|x) and gradient predictorq ϕ (·| x,g <t ). 4: fort=1..T do 5: Fetch the next batch(X t ,Y t ) and compute the predicted logitsa t . 6: Compute the cross-entropy gradient,G ℓ t ← softmax(a t )− Y t . 7: if sampling of gradients is enabledthen 8: G t ∼ q ϕ (·| X,G <t ). 9: else 10: G t ← µ t {the mean of predicted gradient} 11: endif 12: Starting withG t , backpropagate to compute the gradient with respect tow. 13: UpdateW t− 1 toW t . 14: Updateϕ using the gradient of the following loss:− logq ϕ ( ˜ G ℓ t |X,G <t )+β ∥µ t ∥ 2 2 . 15: endfor 2.4 Experiments We set up experiments with noisy datasets to see how well the proposed methods perform for different types and amounts of label noise. The simplest baselines in our comparison are standard cross-entropy (CE) and mean absolute error (MAE) loss functions. The next baseline is the forward correction approach (FW) proposed by Patrini et al. [2017], where the label noise transition matrix is estimated and used to correct the loss function. Finally, we include the determinant mutual information (DMI) loss, which is the log-determinant of the confusion matrix between predicted and given labels [Xu et al., 2019]. Both FW and DMI baselines require initialization with the best result of the CE baseline. To avoid small experimental differences, we implement all baselines, closely following the original implementations of FW and DMI. We train all baselines except DMI using the ADAM optimizer [Kingma and Ba, 2014] with learning rate α =10 − 3 andβ 1 =0.9. As DMI is very sensitive to the learning rate, we tune it by choosing the best from the following grid of values 10 − 3 ,10 − 4 ,10 − 5 ,10 − 6 . The soft regularization approach of Eq. (2.23) has two hyperparameters: λ andβ . We selectλ from{0.001,0.01,0.03,0.1} andβ from{0.0,0.01,0.1,1.0,10.0}. The objective of LIMIT instances has two terms:λH p,q andβ ∥µ t ∥ 2 2 . Consequently, we need only one hyper- parameter instead of two. We choose to setλ =1 and selectβ from{0.0,0.1,0.3,1.0,3.0,10.0,30.0,100.0}. When sampling is enabled, we selectσ q from{0.01,0.03,0.1,0.3}. For all baselines, model selection is done by choosing the model with highest accuracy on a validation set that follows the noise model of the corresponding train set. All scores are reported on a clean test set. The implementation of the proposed method and the code for replicating the experiments is available at https://github.com/hrayrhar/limit-label-memorization. 2.4.1 MNISTwithuniformlabelnoise To compare the variants of our approach discussed earlier and see which ones work well, we do experiments on the MNIST dataset with corrupted labels. In this experiment, we use a simple uniform label noise model, where each label is set to an incorrect value uniformly at random with probabilityp. In our experiments we try 4 values ofp – 0%, 50%, 80%, 89%. We split the 60K images of MNIST into training and validation sets, containing 48K and 12K samples respectively. For each noise amount we try 3 different training set sizes – 10 3 ,10 4 , and4.8· 10 4 . All classifiers and auxiliary networks are 5-layer convolutional neural networks 15 Table 2.1: Architecture of a convolutional neural network with 5 hidden layers. Layer type Parameters Conv 32 filters, 4× 4 kernels, stride 2, padding 1, batch normalization, ReLU Conv 32 filters, 4× 4 kernels, stride 2, padding 1, batch normalization, ReLU Conv 64 filters, 3× 3 kernels, stride 2, padding 0, batch normalization, ReLU Conv 256 filters, 3× 3 kernels, stride 1, padding 0, batch normalization, ReLU FC 128 units, ReLU FC 10 units, linear activation 0 50 100 150 200 250 300 350 400 Epochs 0.25 0.50 0.75 1.00 Train accuracy CE CE + GN CE + LN MAE FW DMI Soft reg. (8) LIMIT + S LIMIT + S LIMIT - S LIMIT - S (a) Training curves 0 50 100 150 200 250 300 350 400 Epochs 0.25 0.50 0.75 1.00 T est accuracy CE CE + GN CE + LN MAE FW DMI Soft reg. (8) LIMIT + S LIMIT + S LIMIT - S LIMIT - S (b) Testing curves Figure 2.3: Smoothed training and testing accuracy plots of various approaches on MNIST with 80% uniform noise. (CNN) described in Table 2.1. We train all models for 400 epochs and terminate the training early when the best validation accuracy is not improved in the previous 100 epochs. For this experiment we include two additional baselines where additive noise (Gaussian or Laplace) is added to the gradients with respect to logits. We denote these baselines with names “CE + GN” and “CE + LN”. The comparison with these two baselines demonstrates that the proposed method does more than simply reduce information in gradients via noise. We also consider a variant of LIMIT where instead of samplingG t fromq we use the predicted meanµ t . Table 2.2 shows the average test performances and standard deviations of different approaches over 5 training/validation splits. Additionally, Figure 2.3 shows the training and testing performances of the best methods during the training whenp=0.8 and all training samples are used. Overall, variants of LIMIT produce the best results and improve significantly over standard approaches. The variants with a Laplace distribution perform better than those with a Gaussian distribution. This is likely due to the robustness of MAE. Interestingly, LIMIT works well and trains faster when the sampling ofG t inq is disabled (rows with “-S”). Thus, hereafter we consider this as our primary approach. As expected, the soft regularization approach of Eq. (2.21) and cross-entropy variants with noisy gradients perform significantly worse than LIMIT. We exclude these baselines in our future experiments. Effectivenessofgradientnormpenalty. Additionally, we test the importance of penalizing norm of predicted gradients by comparing training and testing performances of LIMIT with varying regularization strengthβ . Figure 2.4 presents the training and testing accuracy curves of LIMIT with varying values ofβ . We see that increasingβ decreases overfitting on the training set and usually results in better generalization. 16 Table 2.2: Test accuracy comparison on multiple versions of MNIST corrupted with uniform label noise. The error bars are standard deviations computed over 5 random training/validation splits. Method p=0.0 p=0.5 n=10 3 n=10 4 All n=10 3 n=10 4 All CE 94.3± 0.5 98.4± 0.2 99.2± 0.0 71.8± 4.3 93.1± 0.6 97.2± 0.2 CE + GN 89.5± 0.8 95.4± 0.5 97.1± 0.5 70.5± 3.5 92.3± 0.7 97.4± 0.5 CE + LN 90.0± 0.5 95.3± 0.6 96.7± 0.7 66.8± 1.3 92.0± 1.5 97.6± 0.1 MAE 94.6± 0.5 98.3± 0.2 99.1± 0.1 75.6± 5.0 95.7± 0.5 98.1± 0.1 FW 93.6± 0.6 98.4± 0.1 99.2± 0.1 64.3± 9.1 91.6± 2.0 97.3± 0.3 DMI 94.5± 0.5 98.5± 0.1 99.2± 0.0 79.8± 2.9 95.7± 0.3 98.3± 0.1 Soft reg. (2.23) 95.7± 0.2 98.4± 0.1 99.2± 0.0 76.4± 2.4 95.7± 0.0 98.2± 0.1 LIMIT G + S 95.6± 0.3 98.6± 0.1 99.3± 0.0 82.8± 4.6 97.0± 0.1 98.7± 0.1 LIMIT L + S 94.8± 0.3 98.6± 0.2 99.3± 0.0 88.7± 3.8 97.6± 0.1 98.9± 0.0 LIMIT G - S 95.7± 0.2 98.7± 0.1 99.3± 0.1 83.3± 2.3 97.1± 0.2 98.6± 0.1 LIMIT L - S 95.0± 0.2 98.7± 0.1 99.3± 0.1 88.2± 2.9 97.7± 0.1 99.0± 0.1 Method p=0.8 p=0.89 n=10 3 n=10 4 All n=10 3 n=10 4 All CE 27.0± 3.8 69.9± 2.6 87.2± 1.0 10.3± 1.6 13.4± 3.3 13.2± 1.8 CE + GN 25.9± 4.6 51.9± 10.5 85.3± 8.3 10.4± 4.5 10.2± 3.3 11.1± 0.4 CE + LN 30.2± 4.8 53.1± 6.4 74.5± 19.1 11.9± 3.9 8.8± 5.4 14.1± 4.3 MAE 25.1± 3.3 74.6± 2.7 93.2± 1.1 10.9± 1.4 12.1± 3.9 17.6± 8.1 FW 19.0± 4.1 61.2± 5.0 89.1± 2.1 8.7± 2.8 11.4± 1.4 12.3± 1.8 DMI 30.3± 5.1 79.0± 1.5 88.8± 0.9 10.5± 1.2 14.1± 5.1 12.5± 1.5 Soft reg. (2.23) 28.8± 2.2 67.0± 1.9 89.3± 0.6 10.3± 1.6 10.5± 0.8 12.7± 2.6 LIMIT G + S 35.9± 6.3 80.6± 2.8 93.4± 0.5 10.0± 1.0 14.3± 5.4 13.1± 4.3 LIMIT L + S 35.6± 3.2 93.3± 0.3 97.6± 0.3 10.1± 0.7 12.5± 2.1 28.3± 8.1 LIMIT G - S 37.1± 5.4 82.0± 1.5 94.7± 0.6 9.9± 1.0 12.6± 0.3 16.0± 5.9 LIMIT L - S 35.9± 4.3 93.9± 0.8 97.7± 0.2 11.1± 0.7 11.8± 1.0 28.6± 4.0 2.4.2 CIFAR-10withuniformandpairnoise Next we consider a harder dataset, CIFAR-10 [Krizhevsky et al., 2009], with two label noise models: uniform noise and pair noise. For pair noise, certain classes are confused with some other similar class. Following the setup of Xu et al. [2019] we use the following four pairs: truck→ automobile, bird→ airplane, deer → horse, cat→ dog. Note in this type of noiseH(Y |X) is much smaller than in the case of uniform noise. We split the 50K images of CIFAR-10 into training and validation sets, containing 40K and 10K samples respectively. For the CIFAR experiments we use ResNet-34 networks [He et al., 2016] that differ from the standard ResNet-34 architecture (which is used for224× 224 images) in two ways: (a) the first convolutional layer has 3x3 kernels and stride 1, and (b) the max pooling layer after it is skipped. We use standard data augmentation consisting of random horizontal flips and random 28x28 crops padded back to 32x32. We train all models for 400 epochs and terminate the training early when the best validation accuracy is not improved in the previous 100 epochs. For our proposed methods, the auxiliary networkq is ResNet-34 as well. We noticed that for more difficult datasets, it may happen that while q still learns to produce good gradients, the updates with these less informative gradients may corrupt the initialization of the classifier. For this reason, we add an additional variant of LIMIT, which initializes theq network with the best CE baseline, similar to the DMI and FW baselines. 17 0 50 100 150 200 250 300 350 400 Epoch 0.25 0.50 0.75 1.00 Train accuracy = 0.0 = 0.1 = 0.3 = 1.0 = 3.0 = 10.0 = 30.0 = 100.0 (a) Training performance of LIMIT G - S 0 50 100 150 200 250 300 350 400 Epoch 0.2 0.4 0.6 Train accuracy = 0.0 = 0.1 = 0.3 = 1.0 = 3.0 = 10.0 = 30.0 = 100.0 (b) Training performance of LIMIT L - S 0 50 100 150 200 250 300 350 400 Epoch 0.25 0.50 0.75 Test accuracy (c) Testing performance of LIMIT G - S 0 50 100 150 200 250 300 350 400 Epoch 0.4 0.6 0.8 1.0 Test accuracy (d) Testing performance LIMIT L - S Figure 2.4: Training and testing accuracies of “LIMIT G - S” and “LIMIT L - S” instances with varying values ofβ on MNIST with 80% uniform label noise. The curves are smoothed for better presentation. Table 2.3 presents the results on CIFAR-10. Again, variants of LIMIT improve significantly over standard baselines, especially in the case of uniform label noise. As expected, whenq is initialized with the best CE model (similar to FW and DMI), the results are better. As in the case of MNIST, our approach helps even when the dataset is noiseless. 2.4.3 CIFAR-100withuniformlabelnoise To test proposed methods on a classification task with many classes, we apply them on CIFAR-100 with 40% uniform noise. We use the same networks and training strategies as in the case of CIFAR-10. Results presented in Table 2.4 indicate several interesting phenomena. First, training with the MAE loss fails, which was observed by other works as well [Zhang and Sabuncu, 2018]. The gradient of MAE with respect to logits isf(x) y (b y− y). Whenf(x) y is small, there is small signal to fix the mistake. In fact, in the case of CIFAR-100,f(x) y is approximately 0.01 in the beginning, slowing down the training. The performance of FW degrades as the approximation error of noise transition matrix become large. The DMI does not give significant improvement over CE due to numerical issues with computing a determinant of a 100x100 confusion matrix. LIMIT L performs worse than other variants, as trainingq with MAE becomes challenging. However, performance improves whenq is initialized with the CE model. LIMIT G does not suffer from the mentioned problem and works with or without initialization. 2.4.4 Clothing1M Finally, as in our last experiment, we consider the Clothing1M dataset [Xiao et al., 2015], which has 1M images labeled with one of 14 possible clothing labels. The dataset has very noisy training labels, with roughly 40% of examples incorrectly labeled. More importantly, the label noise in this dataset is realistic and instance dependent. For this dataset we use ResNet-50 networks [He et al., 2016] and employ standard data augmentation, consisting of random horizontal flips and random crops of size 224x224 after resizing images to size 256x256. We train all models for 30 epochs. The results shown in the last column of Table 2.4 demonstrate that DMI and LIMIT with initialization perform the best, producing similar results. 18 Table 2.3: Test accuracy comparison on CIFAR-10, corrupted with uniform label noise (top) and pair label noise (bottom). The error bars are standard deviations computed by bootstrapping the test set 1000 times. Method uniform label noise p=0.0 p=0.2 p=0.4 p=0.6 p=0.8 CE 92.7± 0.3 85.2± 0.4 81.0± 0.4 69.0± 0.5 38.8± 0.5 MAE 84.4± 0.4 85.4± 0.4 64.6± 0.5 15.4± 0.4 12.0± 0.3 FW 92.9± 0.3 86.2± 0.3 81.4± 0.4 69.7± 0.5 34.4± 0.5 DMI 93.0± 0.3 88.3± 0.3 85.0± 0.3 72.5± 0.4 38.9± 0.5 LIMIT G 93.5± 0.2 90.7± 0.3 86.6± 0.3 73.7± 0.4 38.7± 0.5 LIMIT L 93.1± 0.3 91.5± 0.3 88.2± 0.3 75.7± 0.4 35.8± 0.5 LIMIT G + init. 93.3± 0.3 92.4± 0.3 90.3± 0.3 81.9± 0.4 44.1± 0.5 LIMIT L + init. 93.3± 0.2 92.2± 0.3 90.2± 0.3 82.9± 0.4 44.3± 0.5 Method pair label noise p=0.0 p=0.2 p=0.4 p=0.6 p=0.8 CE 92.7± 0.3 90.0± 0.3 88.1± 0.3 87.2± 0.3 81.8± 0.4 MAE 84.4± 0.4 88.6± 0.3 83.2± 0.4 72.1± 0.4 61.1± 0.5 FW 92.9± 0.3 90.1± 0.3 88.0± 0.3 86.8± 0.3 84.6± 0.3 DMI 93.0± 0.3 91.4± 0.3 90.6± 0.3 90.4± 0.3 89.6± 0.3 LIMIT G 93.5± 0.2 92.8± 0.3 91.3± 0.3 89.2± 0.3 86.0± 0.3 LIMIT L 93.1± 0.3 91.9± 0.3 91.1± 0.3 88.8± 0.3 84.2± 0.4 LIMIT G + init. 93.3± 0.3 93.3± 0.3 92.9± 0.3 90.8± 0.3 88.3± 0.3 LIMIT L + init. 93.3± 0.2 93.0± 0.2 92.3± 0.3 91.1± 0.3 90.0± 0.3 Table 2.4: Test accuracy comparison on CIFAR-100 with 40% uniform label noise and on Clothing1M dataset. The error bars are standard deviations computed by bootstrapping the test set 1000 times. Method CIFAR-100 Clothing1M 40% uniform label noise original noisy train set CE 44.9± 0.5 68.91± 0.46 MAE 1.8± 0.1 6.52± 0.23 FW 23.3± 0.4 68.70± 0.45 DMI 46.1± 0.5 71.19± 0.43 LIMIT G 58.4± 0.4 70.32± 0.42 LIMIT L 49.5± 0.5 70.35± 0.45 LIMIT G + init. 59.2± 0.5 71.39± 0.44 LIMIT L + init. 60.8± 0.5 70.53± 0.44 2.4.5 Detectingmislabeledexamples In the proposed approach, the auxiliary networkq should not be able to distinguish correct and incorrect samples, unless it overfits. In fact, Figure 2.5 shows that if we look at the norm of predicted gradients, examples with correct and incorrect labels are indistinguishable in easy cases (MNIST with 80% uniform noise and CIFAR-10 with 40% uniform noise) and have large overlap in harder cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40% uniform noise). Therefore, we hypothesize that the auxiliary network learns to utilize incorrect samples effectively by predicting “correct” gradients. Figure 2.6 confirms this intuition, demonstrating that this distance separates correct and incorrect samples perfectly in easy cases (MNIST with 80% uniform noise and CIFAR-10 with 40% uniform noise) and separates them well in harder 19 30 20 10 0 log|| || 2 2 0.00 0.02 0.04 0.06 0.08 Density incorrect correct (a) MNIST 80% uniform noise 40 20 0 log|| || 2 2 0.00 0.02 0.04 0.06 0.08 Density incorrect correct (b) CIFAR-10 40% uniform noise 40 20 0 log|| || 2 2 0.00 0.05 0.10 0.15 0.20 0.25 Density incorrect correct (c) CIFAR-10 40% pair noise 40 20 0 log|| || 2 2 0.00 0.02 0.04 0.06 0.08 Density incorrect correct (d) CIFAR-100 40% uniform noise Figure 2.5: Histograms of the norm of predicted gradients for examples with correct and incorrect labels. The gradient predictions are done using the best instances of LIMIT. 0.0 0.5 1.0 1.5 2.0 || g || 2 2 0 5 10 15 20 25 Density ROC AUC score: 0.9987 incorrect correct (a) MNIST 80% uniform noise 0.0 0.5 1.0 1.5 2.0 || g || 2 2 0 5 10 15 20 25 Density ROC AUC score: 0.9913 incorrect correct (b) CIFAR-10 40% uniform noise 0.0 0.5 1.0 1.5 2.0 || g || 2 2 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Density ROC AUC score: 0.9109 incorrect correct (c) CIFAR-10 40% pair noise 0.0 0.5 1.0 1.5 2.0 || g || 2 2 0 5 10 15 20 Density ROC AUC score: 0.9585 incorrect correct (d) CIFAR-100 40% uniform noise Figure 2.6: Histograms of the distance between predicted and actual gradient for examples with correct and incorrect labels. The gradient predictions are done using the best instances of LIMIT. cases (CIFAR-10 with 40% pair noise and CIFAR-100 with 40% uniform noise). If we interpret this distance as a score for classifying correctness of a label, we get 91.1% ROC AUC score in the hardest case: CIFAR-10 with 40% pair noise, and more than 99% score in the easier cases: MNIST with 80% uniform noise and CIFAR-10 with 40% uniform noise. Motivated by these results, we use the same technique to detect samples with incorrect or confusing labels in the original MNIST, CIFAR-10, and Clothing1M datasets. Figure 2.7 presents one incorrectly labeled or confusing example per class. More examples for each class are presented in Figures 6.9 and 6.10 of the Appendix J. 2.5 Relatedwork Our approach is related to many works that study memorization and learning with noisy labels. Our work also builds on theoretical results studying how generalization relates to information in neural network weights. In this section we present the related work and discuss the connections. Learningwithnoisylabels. Learning with noisy labels is a longstanding problem and has been studied extensively [Frenay and Verleysen, 2014]. Many works studied and proposed loss functions that are robust 20 0 1 2 3 4 5 6 7 8 9 (a) MNIST airplane car bird cat deer dog frog horse ship truck (b) CIFAR-10 T-Shirt Shirt Knitwear Chiffon Sweater Hoodie Windbreaker Jacket Downcoat Suit Shawl Dress Vest Underwear (c) Clothing1M Figure 2.7: Most mislabeled examples in MNIST, CIFAR-10, and Clothing1M datasets, according to the distance between predicted and cross-entropy gradients. to label noise. Natarajan et al. [2013] propose robust loss functions for binary classification with label- dependent noise. Ghosh et al. [2017] generalize this result for multiclass classification problem and show that the mean absolute error (MAE) loss function is tolerant to label-dependent noise. However, as seen in our experiments, training with MAE progresses slowly and performs poorly on challenging datasets. Zhang and Sabuncu [2018] propose a new loss function, called generalized cross-entropy (GCE), that interpolates between MAE and CE with a single parameter q ∈ [0,1]. Xu et al. [2019] propose a new loss function (DMI), which is equal to the log-determinant of the confusion matrix between predicted and given labels, and show that it is robust to label-dependent noise. These loss functions are robust in the sense that the best performing hypothesis on clean data and noisy data are the same in the regime of infinite data. When training on finite datasets, training with these loss functions may result in memorization of training labels. Another line of research seeks to estimate label noise and correct the loss function [Sukhbaatar et al., 2014, Xiao et al., 2015, Goldberger and Ben-Reuven, 2017, Patrini et al., 2017, Hendrycks et al., 2018, Yao et al., 2019]. Some works use meta-learning to treat the problem of noisy/incomplete labels as a decision problem in which one determines the reliability of a sample [Jiang et al., 2018, Ren et al., 2018, Shu et al., 2019]. Others seek to detect incorrect examples and relabel them [Reed et al., 2014, Tanaka et al., 2018, Ma et al., 2018, Han et al., 2019, Arazo et al., 2019]. Han et al. [2018], Yu et al. [2019] employ an approach where two networks select training examples for each other using the small-loss trick. While our approach also has a teaching component, the network uses all samples instead of filtering. Li et al. [2019] propose a meta- learning approach that optimizes a classification loss along with a consistency loss between predictions of a mean teacher and predictions of the model after a single gradient descent step on a synthetically labeled mini-batch. Some approaches assume particular label noise models, while our approach assumes thatH(Y |X)>0, which may happen because of any type of label noise or attribute noise (e.g., corrupted images or partially observed inputs). Additionally, the techniques used to derive our approach can be adopted for regression or multilabel classification tasks. Furthermore, some methods require access to small clean validation data, which is not required in our approach. Informationinweightsandgeneralization. Defining and quantifying information in neural network weights is an open challenge and has been studied by multiple authors. One approach is to relate information in weights to their description length. A simple way of measuring description length was proposed by Hinton and van Camp [1993] and reduces to the L2 norm of weights. Another way to measure it is through the intrinsic dimension of an objective landscape [Li et al., 2018, Blier and Ollivier, 2018]. Li et al. [2018] observed that the description length of neural network weights grows when they are trained with noisy labels [Li et al., 2018], indicating memorization of labels. 21 Achille and Soatto [2018] define information in weights as the KL divergence from the posterior of weights to the prior. In a subsequent study they provide generalization bounds involving the KL divergence term [Achille et al., 2019]. Similar bounds were derived in the PAC-Bayesian setup and have been shown to be nonvacuous [Dziugaite and Roy, 2017]. With an appropriate selection of prior on weights, the above KL divergence becomes the Shannon mutual information between the weights and training dataset, I(W;S). Xu and Raginsky [2017] derive generalization bounds that involve this latter quantity. Pensia et al. [2018] upper boundI(W;S) when the training algorithm consists of iterative noisy updates. They use the chain-rule of mutual information as we did in Eq. (2.14) and bound information in updates by adding independent noise. It has been observed that adding noise to gradients can help to improve generalization in certain cases [Neelakantan et al., 2015]. Another approach restricts information in gradients by clipping them [Menon et al., 2020] . Achille and Soatto [2018] also introduce the termI(W;Y |X) and show the decomposition of the cross-entropy described in Eq. (2.2). Yin et al. [2020] consider a similar term in the context of meta-learning and use it as a regularization to prevent memorization of meta-testing labels. Given a meta-learning dataset M, they consider the information in the meta-weightsθ about the labels of meta-testing tasks given the inputs of meta-testing tasks,I(θ ;Y|X). They bound this information with a variational upper bound KL(q(θ |M)∥r(θ )) and use multivariate Gaussian distributions for both. For isotropic Gaussians with equal covariances, the KL divergence reduces to∥θ − θ 0 ∥ 2 2 , which was studied by Hu et al. [2020] as a regularization to achieve robustness to label noise. Note that this bounds not onlyI(θ ;Y|X) but also I(θ ;X,Y). In contrast, we bound onlyI(W;Y |X) and work with information in gradients. 2.6 Conclusion Several theoretical works have highlighted the importance of the information about the training data that is memorized in the weights. We distinguished two components of it and demonstrated that the conditional mutual information of weights and labels given inputs is closely related to memorization of labels and generalization performance. By bounding this quantity in terms of information in gradients, we were able to derive the first practical schemes for controlling label information in the weights and demonstrated that this outperforms approaches for learning with noisy labels. 22 Chapter3 EstimatingInformativenessofSampleswithSmoothUniqueInformation 3.1 Introduction Training a deep neural network (DNN) entails extracting information from samples in a dataset and storing it in the weights of the network, so that it may be used in future inference or prediction. But how much information does a particular sample contribute to the trained model? The answer can be used to provide strong generalization bounds (if no information is used, the network is not memorizing the sample), privacy bounds (how much information the network can leak about a particular sample), and enable better interpretation of the training process and its outcome. To determine the information content of samples, we need to define and compute information. In the classical sense, information is a property of random variables, which may be degenerate for the deterministic process of computing the output of a trained DNN in response to a given input (inference). So, even posing the problem presents some technical challenges. But beyond technicalities, how can we know whether a given sample is memorized by the network and, if it is, whether it is used for inference? We propose a notion of unique sample information that, while rooted in information theory, captures some aspects of stability theory and influence functions. Unlike most information-theoretic measures, our notion of information can be approximated efficiently for large networks, especially in the case of transfer learning, which encompasses many real-world applications of deep learning. Our definition can be applied to either “weight space” or “function space.” This allows us to study the non-trivial difference between information the weights possess (weight space) and the information the network actually uses to make predictions on new samples (function space). Our method yields a valid notion of information without relying on the randomness of the training algorithm (e.g., stochastic gradient descent, SGD), and works even for deterministic training algorithms. Our main work-horse is a first-order approximation of the network. This approximation is accurate when the network is pretrained [Mu et al., 2020] — as is common in practical applications — or is randomly initialized but very wide [Lee et al., 2019], and can be used to obtain a closed-form expression of the per-sample information. In addition, our method has better scaling with respect to the number of parameters than most other information measures, which makes it applicable to massively overparametrized models such as DNNs. Our information measure can be computed without actually training the network, making it amenable to use in problems like dataset summarization. We apply our method to remove a large portion of uninformative examples from a training set with minimum impact on the accuracy of the resulting model (dataset summarization). We also apply our method to detect mislabeled samples, which we show carry more unique information. To summarize, our contributions are (1) We introduce a notion of unique information that a sample con- tributes to the training of a DNN, both in weight space and in function space, and relate it with the stability of the training algorithm; (2) We provide an efficient method to compute unique information even for large 23 −0.5 0.0 0.5 1.0 1.5 2.0 −0.5 0.0 0.5 1.0 1.5 2.0 (a) Data 0.32 0.34 0.36 −0.51 −0.50 −0.49 −0.48 −0.47 −0.46 (b)Q W|S=s 0.32 0.34 0.36 −0.51 −0.50 −0.49 −0.48 −0.47 −0.46 (c)Q W|S=s− i 0.32 0.34 0.36 −0.51 −0.50 −0.49 −0.48 −0.47 −0.46 (d)P W|S− i=s− i Figure 3.1: A toy dataset and key distributions involved in upper bounding the unique sample information with leave-one-out KL divergence. networks using a linear approximation of the DNN, and without having to train a network; (3) We show ap- plications to dataset summarization and analysis. The implementation of the proposed method and the code for reproducing the experiments is available at https://github.com/awslabs/aws-cv-unique-information. Notation. In this chapter we consider a particular instance of a labeled training datasets=(z 1 ,...,z n ), where z i = (x i ,y i ), x i ∈ X, y i ∈ R k . We consider a neural network f w : X → R k with parameters w∈R d . Throughout this chapters − i ={z 1 ,...,z i− 1 ,z i+1 ,...,z n } denotes the training set without the i-th sample;f wt is often shortened tof t ; the concatenation of all training examples is denoted byx; the concatenation of all training labels byy∈R nk ; and the concatenation of all outputs byf w (x)∈R nk . The loss on thei-th example is denoted byℓ i (w) and is equal to 1 2 ∥f w (x i )− y i ∥ 2 2 , unless specified otherwise. This choice is useful when dealing with linearized models and is justified by Hui and Belkin [2021], who show that the mean-squared error (MSE) loss is as effective as cross-entropy for classification tasks. The training loss isL(w)= P n i=1 ℓ i (w)+ λ 2 ∥w− w 0 ∥ 2 2 , whereλ ≥ 0 is a weight decay regularization coefficient andw 0 is the weight initialization point. Note that the regularization term differs from standard weight decay∥w∥ 2 2 and is more appropriate for linearized neural networks, as it allows us to derive the dynamics analytically (see Appendix H.4). 3.2 Uniqueinformationofasampleintheweights We start with defining a notion of unique information in the weight space. Consider a (possibly stochastic) training algorithmQ W|S that, given a training datasets, returns weightsW ∼ Q W|S=s for the classifier f w . From an information-theoretic point of view, the amount of unique information a samplez i =(x i ,y i ) provides about the weights is given by the conditional point-wise mutual information: I(W;Z i =z i |S − i =s − i )=KL Q W|S=s ∥P W|S − i =s − i , (3.1) whereP W|S i is derived from the joint distributionP S,W induced byP S andQ W|S (i.e., denotes the average distribution of the weights over all possible samplings ofZ i and fixed S − i =s − i ). 3.2.1 Approximatinguniqueinformationwithleave-one-outKLdivergence Computing the conditional distributionP W|S − i =s − i is challenging because of the high-dimensionality and the cost of training for multiple replacements ofz i . One can address this problem by using the following upper bound. 24 Proposition3.2.1. LetP W,S be the joint distribution induced byP S andQ W|S . Assume that∀S =s,i∈ [n], P W|S − i =s − i ≪ Q W|S=s − i . Then∀i∈[n] KL Q W|Z i ,S − i =s − i ∥P W|S − i =s − i ≤ KL Q W|Z i ,S − i =s − i ∥Q W|S=s − i . (3.2) This proposition shows that the expectation (overZ i ) of the unique information of Eq. (3.1) can be upper bounded by the expectation of the following the following quantity: SI(z i ,Q)≜KL Q W|S=s ∥Q W|S=s − i , (3.3) which we call sample information ofz i w.r.t. algorithmQ. Figure 3.1 illustrates this approximation step for a toy 2D dataset. The dataset has two classes, each with 40 examples, generated from a Gaussian distribution (see Figure 3.1a). We consider training a linear regression on this dataset using stochastic gradient descent for 200 epochs, with batch size equal to 5 and 0.1 learning rate. We are interested in approximating the unique information of thei-th example (denoted with a green cross in Figure 3.1a). Figure 3.1b plots the distribution of regression weights after training on the entire dataset: Q W|S=s . Figure 3.1d plots the distribution of regression weights averaging out the effect of the i-th example: P W|S − i =s − i =E Z ′ ∼ P Z Q W|S − i =s − i ,Z i =Z ′ . We see that compared toQ W|S=s this distribution is more complex. Precisely for this reason we replace it with the distribution of regression weights after training on the dataset that excludes thei-th example:Q W|S=s − i , shown in Figure 3.1c. In this case we have thatI(W;Z i =z i |S − i =s − i )≈ 1.3, whileSI(z i ,Q)=KL Q W|S=s ∥Q W|S=s − i ≈ 3.0. 3.2.2 Smoothedsampleinformation The formulation above is valid in theory but, in practice, even SGD is used in a deterministic fashion by fixing the random seed and, in the end, we obtain just one set of weights rather than a distribution of them. Under these circumstances, all the above KL divergences are degenerate, as they evaluate to infinity. It is common to address the problem by assuming thatQ W|S is a continuous stochastic optimization algorithm, such as stochastic gradient Langevin dynamics (SGLD) or a continuous approximation of SGD which adds Gaussian noise to the gradients. However, this creates a disconnect with the practice, where such approaches do not perform at the state-of-the-art. Our definitions below aim to overcome this disconnect. Definition3.2.1 (Smooth sample information). LetQ W|S be a possibly stochastic algorithm. Following Eq. (3.3), we define the smooth sample information with smoothingΣ as: SI Σ (z i ,Q)=KL Q Σ W|S=s ∥Q Σ W|S=s − i , (3.4) whereQ Σ W|S denotes the distribution ofW +ξ withW ∼ Q W|S andξ ∼N (0,Σ) . Note that if the algorithmQ W|S is continuous, we can pickΣ → 0, which will makeSI Σ (z i ,Q)→ SI(z i ,Q). The following proposition shows how to compute the value ofSI Σ whenQ W|S is deterministic (the most common case in practice). Proposition3.2.2. LetQ W|S be a deterministic training algorithm (i.e., a distribution that puts all the mass on a pointW =A(S)). We have that SI Σ (z i ,Q)= 1 2 (w− w − i ) T Σ − 1 (w− w − i ), (3.5) 25 wherew =A(s) andw − i =A(s − i ) are the weights obtained by training respectively with and without the training samplez i . That is, the value ofSI Σ (z i ,Q) depends on the distance between the solutions obtained training with and without the samplez i , rescaled byΣ . Proposition 3.2.2 follows from the fact that the KL divergence between two Gaussian distributions with with meansµ 1 andµ 2 and equal covariance matricesΣ 1 = Σ 2 = Σ is equal to 1 2 (µ 1 − µ 2 ) T Σ − 1 (µ 1 − µ 2 ). The smoothing of the weights by a matrixΣ can be seen as a form of soft-discretization. Rather than simply using an isotropic discretization Σ = σ 2 I – since different filters have different norms and/or importance for the final output of the network – it makes sense to discretize them differently. In Sections 3.3 and 3.4 we show two canonical choices forΣ . One is the inverse of the Fisher information matrix, which discounts weights not used for classification, and the other is the covariance of the steady-state distribution of SGD, which respects the level of SGD noise and flatness of the loss. 3.3 Uniqueinformationinthepredictions SI Σ (z i ,Q) measures how much information an examplez i provides to the weights. Alternatively, instead of working in weight-space, we can approach the problem in function-space, and measure the informativeness of a training example for the network predictions. LetX ∼ P X be an independent test example and let Q b Y|S,X denote the distribution of the prediction on X after training on S. Following the reasoning in Section 3.2.1, we define functional sample information as F-SI(z i ,Q)=E P X KL Q b Y|S=s,X ∥Q b Y|S=s − i ,X . (3.6) Again, when training with a discrete algorithm and/or when the output of the network is deterministic, the above quantity may be infinite. Similar to smooth sample information, we define: Definition 3.3.1 (Smooth functional sample information). Let Q W|S be a possibly stochastic training algorithm and letQ b Y|S,X denote the distribution of the prediction onX after training onS. We define the smooth functional sample information (F-SI) as: F-SI σ (z i ,Q)=E P X KL Q σ b Y|S=s,X ∥Q σ b Y|S=s − i ,X , (3.7) whereQ σ b Y|S,X is the distribution of b Y +ξ with b Y ∼ Q b Y|S,X andξ ∼N (0,σ 2 I). The following proposition shows how to compute the value ofF-SI Σ whenQ W|S is deterministic. Proposition3.3.1. LetQ W|S be a deterministic training algorithm (i.e., a distribution that puts all the mass on a pointW = A(S)). Letw = A(s) andw − i = A(s − i ) be the weights obtained training respectively with and without samplez i . Then, F-SI σ (z i ,Q)= 1 2σ 2 E P X ∥f w (X)− f w − i (X)∥ 2 2 . (3.8) As Proposition 3.2.2, this proposition also follows directly from formula of KL divergence between two Gaussians. 26 By using first-order Taylor approximation of f w (x) with respect tow and assuming that∇ w f w (x)≈ ∇ w − i f w − i (x), we can approximate smooth functional sample information as follows: F-SI σ (z i ,A)= 1 2σ 2 E P X ∥f w (X)− f w − i (X)∥ 2 2 (3.9) ≈ 1 2 (w− w − i ) T E P X ∇ w f w (X)∇ w f w (X) T (w− w − i ) (3.10) = 1 2σ 2 (w− w − i ) T F(w)(w− w − i ), (3.11) withF(w)=E P X ∇ w f w (X)∇ w f w (X) T being the Fisher information matrix ofQ σ =1 b Y|S,X . By comparing Eq. (3.5) and Eq. (3.11), we see that the functional sample information is approximated by using the inverse of the Fisher information matrix to smooth the weight space. However, this smoothing is not isotropic as it depends on the pointw. 3.4 Exactsolutionforlinearizednetworks In this section, we derive a close-form expression forSI Σ andF-SI σ using a linear approximation of the network around the initial weights. We show that this approximation can be computed efficiently and, as we validate empirically in Section 3.5, correlates well with the actual informativeness values. We also show that the covariance matrix of SGD’s steady-state distribution is a canonical choice for the smoothing matrix Σ ofSI Σ . LinearizedNetwork. Linearized neural networks are a class of neural networks obtained by taking the first-order Taylor expansion of a DNN around the initial weights [Jacot et al., 2018, Lee et al., 2019]: f lin w (x)≜f w 0 (x)+∇ w f w (x) T | w=w 0 (w− w 0 ). (3.12) These networks are linear with respect to their parametersw, but can be highly non-linear with respect to their inputx. Continuous-time gradient descent dynamics of linearized neural networks is ˙ w t =− η ∇ f lin t (x) L T ∇ w f 0 (x), (3.13) whereη > 0 is the learning rate. One of the advantages of linearized neural networks is that the dynamics of continuous-time or discrete-time gradient descent can be written analytically if the loss function is the mean squared error (MSE). In particular, for the continuous-time gradient descent of Eq. (3.13), we have that [Lee et al., 2019]: w t =∇ w f 0 (x)Θ − 1 0 I− e − η Θ 0 t (f 0 (x)− y), (3.14) f lin t (x)=f 0 (x)+Θ 0 (x,x)Θ − 1 0 I− e − η Θ 0 t (y− f 0 (x)), (3.15) whereΘ 0 =∇ w f 0 (x) T ∇ w f 0 (x)∈R nk× nk is the Neural Tangent Kernel (NTK) [Jacot et al., 2018, Lee et al., 2019] andΘ 0 (x,x)=∇ w f 0 (x) T ∇ w f 0 (x). Note that the NTK matrix will generally be invertible for overparametrized neural networks (d≫ nk). Analogs of equations (3.14) and (3.15) for discrete time can be derived by replacinge − ηt Θ 0 with(I− η Θ 0 ) t . The expressions for networks trained with weight decay is essentially the same (see Appendix H.4). Namely, when weight decay of form λ 2 ∥w− w 0 ∥ 2 2 is added, the only change that happens to (3.14) and (3.15) is thatΘ 0 gets replaced by(Θ 0 +λI ). To keep the notation simple, we will usef w (x) to indicatef lin w (x) from now on. 27 StochasticGradientDescent. As mentioned in Section 3.2, a popular alternative approach to make information quantities well-defined is to use continuous-time SGD [Li et al., 2017, Mandt et al., 2017], which is defined as dw t =− η ∇ w L w (w t )dt+η r 1 b Λ( w t )dn(t), (3.16) whereη is the learning rate,b is the batch size,n(t) is a Brownian motion, andΛ( w t ) is the covariance matrix of the per-sample gradients (see Appendix H.1 for details). LetQ SGD be the algorithm that returns a random sample from the steady-state distribution of Eq. (3.16), and letQ ERM be the deterministic algorithm that returns the global minimumw ∗ of the lossL(w) (for a regularized linearized networkL(w) is strictly convex). We now show that the non-smooth sample informationSI(z i ,Q SGD ) is the same as the smooth sample information using SGD’s steady-state covariance as the smoothing matrix andQ ERM as the training algorithm. Proposition3.4.1. Let the loss function be regularized MSE,w ∗ be the global minimum of it, and algorithms Q SGD andQ ERM be defined as above. Assuming Λ( w) is approximately constant aroundw ∗ and SGD’s steady-state covariance remains constant after removing an example, we have SI(z i ,Q SGD )=SI Σ (z i ,Q ERM )= 1 2 (w ∗ − w ∗ − i ) T Σ − 1 (w ∗ − w ∗ − i ), (3.17) whereΣ is the solution of HΣ+Σ H T = η b Λ( w ∗ ), (3.18) withH =(∇ w f 0 (x)∇ w f 0 (x) T +λI ) being the Hessian of the loss function. This proposition motivates the use of SGD’s steady-state covariance as a smoothing matrix. From (3.17) and (3.18) we see that SGD’s steady-state covariance is proportional to the flatness of the loss at the minimum, the learning rate, and to SGD’s noise, while inversely proportional to the batch size. WhenH is positive definite, as in our case when using weight decay, the continuous Lyapunov equation (3.18) has a unique solution, which can be found in O(d 3 ) time using the Bartels-Stewart algorithm [Bartels and Stewart, 1972]. One particular case when the solution can be found analytically is whenΛ( w ∗ ) andH commute, in which caseΣ = η 2b Λ H − 1 . For example, this is the case for Langevin dynamics, for which Λ( w)=σ 2 I in Eq. (3.16). In this case, we have SI(z i ,Q SGD )=SI Σ (z i ,Q ERM )= b ησ 2 (w ∗ − w ∗ − i ) T H(w ∗ − w ∗ − i ), (3.19) which was already suggested by Cook [1977] as a way to measure the importance of a datum in linear regression. FunctionalSampleInformation. The definition in Section 3.3 simplifies for linearized neural networks: The step from Eq. (3.8) to Eq. (3.11) becomes exact, and the Fisher information matrix becomes independent ofw and equal toF =E P X ∇ w f 0 (X)∇ w f 0 (X) T . This shows that functional sample information can be seen as weight sample information with smoothing covariance Σ = F − 1 . The functional sample 28 information depends on the data distributionP X , which is usually unknown. We can estimateF-SI using a test set: F-SI σ (z i ,Q)≈ 1 2σ 2 n test ntest X j=1 f w (x test j )− f w − i (x test j ) (3.20) = 1 2σ 2 n test (w− w − i ) T ∇f 0 (x test )∇f 0 (x test ) T (w− w − i ) (3.21) = 1 2σ 2 n test (w− w − i ) T (H test − λI )(w− w − i ). (3.22) It is instructive to compare the sample weight information of Eq. (3.19) and functional sample information of Eq. (3.22). Besides the constants, the former uses the Hessian of the training loss, while the latter uses the Hessian of the test loss (without theℓ 2 regularization term). One advantage of the latter is computational cost: As demonstrated in the next section, we can use Eq. (3.20) to compute the prediction information, entirely in the function space, without any costly operation on weights. For this reason, we focus on the linearized F-SI approximation in our experiments. Sinceσ − 2 is just a multiplicative factor in Eq. (3.22) we setσ =1. We also focus on the case where the training algorithmA is discrete gradient descent running fort epochs ((3.14) and (3.15)). EfficientImplementation. To compute the proposed sample information measures for linearized neural networks, we need to compute the change in weightsw− w − i (or change in predictionsf w (x)− f w i (x)) after excluding an example from the training set. This can be done without retraining using the analytical expressions of weight and prediction dynamics of linearized neural networks (3.14) and (3.15), which also work when the algorithm has not yet converged (t<∞). We now describe a series of measures to make the problem tractable. First, to compute the NTK matrix we would need to store the Jacobian∇f 0 (x i ) of all training points and compute∇ w f 0 (x) T ∇ w f 0 (x). This is prohibitively slow and memory consuming for large DNNs. Instead, similarly to Zancato et al. [2020], we use low-dimensional random projections of per-example Jacobians to obtain provably good approximations of dot products [Achlioptas, 2003, Li et al., 2006]. We found that just taking2000 random weights coordinates per layer provides a good enough approximation of the NTK matrix. Importantly, we consider each layer separately, as different layers may have different gradient magnitudes. With this method, computing the NTK matrix takes O(nkd+n 2 k 2 d 0 ) time, whered 0 ≈ 10 4 is the number of subsampled weight indices (d 0 ≪ d). We also need to recompute Θ − 1 0 after removing an example from the training set. This can be done in quadratic time by using rank-one updates of the inverse (see Appendix H.3). Finally, when t ̸= ∞ we need to recompute e − η Θ 0 t after removing an example. This can be done in O(n 2 k 2 ) time by downdating the eigendecomposition of Θ 0 [Gu and Eisenstat, 1995]. Overall, the complexity of computingw− w i for all training examples is O(n 2 k 2 d 0 +n(n 2 k 2 +C)),C is the complexity of a single pass over the training dataset. The complexity of computing functional sample information form test samples isO(C +nmk 2 d 0 +n(mnk 2 +n 2 k 2 )). This depends on the network size lightly, only throughC. 3.5 Experiments In this section, we test the validity of linearized network approximation in terms of estimating the effects of removing an example and show several applications of the proposed information measures. 29 Table 3.1: Pearson correlations of weight change∥w− w − i ∥ 2 2 and test prediction change∥f w (x test )− f w − i (x test )∥ 2 2 norms computed with influence functions and linearized neural networks with their corre- sponding measures computed for standard neural networks with retraining. Reg. Method MNIST MLP MNIST CNN Cats and Dogs scratch scratch pretrained pr. ResNet-18 pr. ResNet-50 weights λ =0 Linearization 0.987 0.193 0.870 0.895 0.968 Infl. functions 0.935 0.319 0.736 0.675 0.897 λ =10 3 Linearization 0.977 -0.012 0.964 0.940 0.816 Infl. functions 0.978 0.069 0.979 0.858 0.912 predictions λ =0 Linearization 0.993 0.033 0.875 0.877 0.895 Infl. functions 0.920 0.647 0.770 0.530 0.715 λ =10 3 Linearization 0.993 0.070 0.974 0.931 0.519 Infl. functions 0.990 0.407 0.954 0.753 0.506 Table 3.2: Details of experiments presented in Table 3.1. For influence functions, we add a dumping term with magnitude 0.01 wheneverℓ 2 regularization is not used (λ =0). Experiment Method Details MNIST MLP (scratch) Brute force 2000 epochs, learning rate= 0.001, batch size= 1000 Infl. functions LiSSA algorithm, 1000 recursion steps, scale = 1000 Linearization t=2000, learning rate= 0.001 MNIST CNN (scratch) Brute force 1000 epochs, learning rate= 0.01, batch size= 1000 Infl. functions LiSSA algorithm, 1000 recursion steps, scale = 1000 Linearization t=1000, learning rate= 0.01 MNIST CNN (pretrained) Brute force 1000 epochs, learning rate= 0.002, batch size= 1000 Infl. functions LiSSA algorithm, 1000 recursion steps, scale = 1000 Linearization t=1000, learning rate= 0.002 Cats and dogs Brute force 500 epochs, learning rate= 0.001, batch size= 500 Infl. functions LiSSA algorithm, 50 recursion steps, scale = 1000 Linearization t=1000, learning rate= 0.001 3.5.1 Accuracyofthelinearizednetworkapproximation We measure∥w− w − i ∥ 2 2 andE ∥ f w (X test )− f w − i (X test )∥ 2 2 for each samplez i by training with and without that example. Then, instead of retraining, we use the efficient linearized approximation to estimate the same quantities and measure their correlation with the ground-truth values (Table 3.1). For comparison, we also estimate these quantities using influence functions [Koh and Liang, 2017]. We consider two classification tasks: (a) a toy MNIST 4 vs 9 classification task and (b) Kaggle cats vs dogs classification task [Kaggle, 2013], both with 1000 training examples. For MNIST we consider a fully connected network with a single hidden layer of 1024 ReLU units (MLP) and a small 5-layer convolutional neural network (the same as in Table 2.1 but with one output unit), either trained from scratch or pretrained on EMNIST letters [Cohen et al., 2017]. For cats vs dogs classification, we consider ResNet-18 and ResNet-50 networks [He et al., 2016] pretrained on ImageNet. In both tasks, we train both with and without weight decay (ℓ 2 regularization). In all our experiments, when using pretrained ResNets, we disable the exponential averaging of batch statistics in batch norm layers. For computing functional sample information we use tests sets consisting of 1000 30 0.0 0.2 0.4 0.6 0.8 1.0 Informativeness of an example 1e−4 0 50 100 150 200 250 300 Count 4 9 A 9 4 9 9 9 9 9 9 9 4 9 9 9 9 9 4 4 9 4 4 B Least informative samples C Most informative samples 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Informativeness of an example 1e−4 0 50 100 150 200 250 Count cat dog A dog cat dog dog cat cat dog cat dog dog cat cat cat dog cat dog cat cat cat dog B Least informative samples C Most informative samples 0 2 4 6 8 Informativeness of an example 1e−4 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Density 1e4 cbb cbsd cgm cmd healthy A cbb cgm cgm cgm cbsd cgm cgm cbb cgm cgm cmd cmd cmd cmd cmd cmd cmd cmd cmd cmd B Least informative samples C Most informative samples Figure 3.2: Functional sample information of samples in MNIST 4 vs 9 classification task (top), Kaggle cats vs dogs (middle), and iCassava (bottom) classification tasks. A: histogram of sample informations,B: 10 least informative samples,C: 10 most informative samples. examples for both MNIST and cats vs dogs. The exact details of running influence functions and linearized neural network predictions are presented in Table 3.2. The results in Table 3.1 shows that linearized approximation correlates well with ground-truth when the network is wide enough (MLP) and/or pretraining is used (CNN with pretraining and pretrained ResNets). This is expected, as wider networks can be approximated with linearized ones better [Lee et al., 2019], and pretraining decreases the distance from initialization, making the Taylor approximation more accurate. Adding regularization also keeps the solution close to initialization, and generally increases the accuracy of the approximation. Furthermore, in most cases linearization gives better results compared to influence functions, while also being around 30 times faster in our settings. 3.5.2 Whichexamplesareinformative? To understand which examples are informative, we start with analyzing the top 10 least and most informative examples in MNIST 4 vs 9, Kaggle cats vs dogs, and iCassava plant disease classification [Mwebaze et al., 2019] tasks. For MNIST 4 vs 9, we use the MLP of the previous subsection, settingt=2000 andη =0.001. For cats vs dogs classification we use ResNet-18 and set t=1000,η =0.001. Finally, for iCassava with subsample a training set of 1000 examples, use an ImageNet-pretrained ResNet-18 and sett=5000,η =0.0003. 31 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Informativeness of an example 1e−4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Density 1e5 correct corrupted (a) Kaggle cats vs dogs 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Informativeness of an example 1e−4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Density 1e4 correct corrupted (b) iCassava Figure 3.3: Comparison of functional sample information of examples with correct and incorrect labels. The results presented in Figure 3.2 indicate that in all datasets majority of the examples are not highly informative. Furthermore, especially in the cases of MNIST 4 vs 9 and Kaggle cats vs dogs, we see that the least informative samples look typical and easy, while the most informative ones look more challenging and atypical. In the case of iCassava, the informative samples are more zoomed on features that are important for classification (e.g., the plant disease spots). We observe that most samples have small unique information, possibly because they are easier or because the dataset may have many similar-looking examples. While in the case of MNIST 4 vs 9 and cats vs dogs, the two classes have on average similar information scores, in Fig. 1a we see that in iCassava examples from rare classes (such as ‘healthy’ and ‘cbb’) are on average more informative. For all tasks, it is true that most of the examples are relatively not informative. Mislabeledexamples. We expect a mislabeled example to carry more unique information, since the network needs to memorize unique features of that particular example to classify it. To test this, we add 10% uniform label noise to Kaggle cats vs dogs and iCassava classification tasks (both with 1000 examples in total), while keeping the test sets clean. For both datasets with use a ResNet-19 pretrained on ImageNet. We set t = 1000,η = 0.001 for cats vs dogs classification and t = 10000η = 0.0001 for the iCassava task. Figure 3.3 presents the histogram of functional sample information for both correct and mislabeled examples for cats vs dogs and iCassava, respectively. The results indicate that mislabeled examples are indeed much more informative on average. This suggests that the proposed informativeness measure can be used to detect outliers or corrupted examples. Harderexamples. After examining the top 10 most and least informative examples in Figure 3.2, we hypothesize that “more challenging” examples are more informative. To test this hypothesis, we train a 10-way digit classification on a dataset consisting of randomly selected 500 examples from MNIST and randomly selected 500 examples from SVHN. Examples from SVHN are colored and have more variations. Therefore, for the purpose of this experiment we consider them harder examples. We use an ImageNet- pretrained ResNet-18 and set t = 20000,η = 0.0001. We compute functional sample information of training examples with respect to a test set consisting of 500 MNIST and 500 SVHN unseen examples. The results presented in Figure 3.4 demonstrate that, as expected, SVHN examples are more informative than MNIST examples. Adversarialexamples. It is also natural to expect that training examples closer to the decision boundary of a neural network are have probably been informative in the training of that same neural network. Perhaps 32 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Informativeness of an example 1e−4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Density 1e4 mnist svhn Figure 3.4: Comparison of functional sample in- formation of MNIST and SVHN examples in the context of a joint digit classification task with equal number of examples per dataset. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Informativeness of an example 1e−4 0 2 4 6 8 Density 1e4 adversarial normal Figure 3.5: Histogram of the functional sample information of samples of the Kaggle cats vs dogs classification task, where 10% of examples are adversarial. such examples can be also informative in training of other neural networks. To test this hypothesis, we consider creating adversarial examples [Szegedy et al., 2014] for one neural network and measuring their informativeness with respect to another neural network. In particular, we fine-tune a pretrained ResNet-18 on 1000 examples from the cats vs dogs dataset. We then use the FGSM method [Goodfellow et al., 2015] withϵ = 0.01 to create successful adversarial examples. Next, for the 10% of examples, we replace the original images with the corresponding adversarial ones (keeping the original correct label), and fine-tune a new pretrained ResNet-18. Finally, we compute functional sample information of all examples of the modified training set, setting t = 1000 andη = 0.001. The results reported in Figure 3.5 confirm that adversarial examples are on average more informative. This partly explains the findings of Goodfellow et al. [2015] who show that adversarial training (i.e., adding adversarial examples to the training dataset) improves adversarial robustness and generalization. Underrepresentedsubpopulations. The histogram of functional sample information in the case of iCassava (Figure 3.2 bottom) suggests that examples of underrepresented classes or subpopulations might be more informative on average. However, we cannot conclude this based on the results of Figure 3.2 as there can be confounding factors affecting both representation and informativeness. Therefore, using CIFAR-10 images, we create a dataset for “pets vs deer” classification, where the class “pets” consists of two subpopulations: cats (200 examples) and dogs (4000 examples), while the class ’deer’ consist of 4000 deer examples. In this dataset, cats are underrepresented, but there is no other significant difference between the cat and dog subpopulations. Since there are relatively few cat images, we expect each to carry more unique information. This is confirmed when we compute F-SI for a pretrained ResNet-18 with t=10000 andη =0.0001 (see Figure 3.6). This result also suggests that analyzingF-SI can help to detect underrepresented subpopulations. 3.5.3 Datasummarization As we saw in Figure 3.2, majority of examples in considered datasets are have low sample information. This hints that it might be possible to remove a significant fraction of examples from these datasets, while not degrading performance of networks trained on them. To test whether such data summarization is possible, given a training dataset, we remove a fraction of its least informative examples and measure the test performance of a network trained on the remaining examples. We expect that removing the least 33 0 1 2 3 4 5 6 7 Informativeness of an example 1e−6 0.0 0.2 0.4 0.6 0.8 1.0 Density 1e6 cat deer dog Figure 3.6: Histogram of the functional sample information of examples of the 3 subpopulations of the pets vs deer dataset. Since cats are under- represented, cat images tend to be more infor- mative on average compared to dog images. 0.0 0.2 0.4 0.6 0.8 Ratio of removed examples 0.5 0.6 0.7 0.8 0.9 Test accuracy random top bottom bottom (iterative) Figure 3.7: Dataset summarization without train- ing. Test accuracy as a function of the ratio of removed training examples for different strate- gies. informative training samples should not affect the performance of the model. Note however that, since we are considering the unique information, removing one example can increase the informativeness of another. For this reason, we consider two strategies: In one we compute the informativeness scores once, and remove a given percentage of the least informative samples. In the other we remove 5% of the least informative samples, recompute the scores, and iterate until we remove the desired number of samples. For comparison, we also consider removing the most informative examples (we call this “top” baseline) and randomly selected examples (we call this “random” baseline). The results on MNIST 4 vs 9 classification task with the one-hidden-layer network described earlier, t = 2000, and η = 0.001, are shown in Figure 3.7. Indeed, removing the least informative training samples has little effect on the test error, while removing the top examples has the most impact. Moreover, recomputing the information scores after each removal steps (“bottom iterative”) greatly improves the performance when many samples are removed, confirming that SI andF-SI are good practical measures of unique information of an example, but also that the total information of a group of examples is not simply the sum of the unique information of its members. Interestingly, removing more than 80% of the least informative examples degrades the performance more than removing the same number of the most informative examples. In the former case, we are left with a small number of hard, atypical, or mislabeled examples, while in the latter case we are left with the same number of easy and typical examples. Consequently, the performance is better in the latter case. In other words, when learning a classifier with just 200 examples, it is better to have these examples be typical. 3.5.4 Howmuchdoessampleinformationdependonalgorithm? The proposed information measures depend on the training algorithm, which includes the architecture, random seed, initialization, and training length. This is unavoidable as one example can be more informative for one algorithm and less informative for another. Nevertheless, in this subsection, we test how much does informativeness depend on the network, initialization, and training time. We consider the Kaggle cats vs dogs classification task with 1000 training examples. First, fixing the training time t=1000, we consider four pretrained architectures: ResNet-18, ResNet-34, ResNet-50, and DenseNet-121. The correlations between F-SI scores computed for these four architectures are presented in Figure 3.8a. We see that F-SI scores computed for two completely different architectures, such as ResNet-50 and DenseNet-121 have 34 ResNet-18 ResNet-34 ResNet-50 DenseNet-121 ResNet-18 ResNet-34 ResNet-50 DenseNet-121 1.000 0.234 0.373 0.319 0.234 1.000 0.333 0.287 0.373 0.333 1.000 0.452 0.319 0.287 0.452 1.000 (a) Varying architecture t = 2 5 0 t = 5 0 0 t = 1 0 0 0 t = 2 0 0 0 t = 4 0 0 0 t = ∞ t = 2 5 0 t = 5 0 0 t = 1 0 0 0 t = 2 0 0 0 t = 4 0 0 0 t = ∞ 1.000 0.975 0.919 0.856 0.810 0.788 0.975 1.000 0.981 0.937 0.898 0.876 0.919 0.981 1.000 0.986 0.960 0.943 0.856 0.937 0.986 1.000 0.992 0.982 0.810 0.898 0.960 0.992 1.000 0.998 0.788 0.876 0.943 0.982 0.998 1.000 (b) Varying training length Figure 3.8: Correlations between functional sample information scores computed for different architectures and training lengths. On the left: correlations between F-SI scores of the 4 pretrained networks, all computed with settingt=1000 andη =0.001. Ontheright: correlations between F-SI scores computed for pretrained ResNet-18s, with learning rate η = 0.001, but varying training lengths t. All reported correlations are averages over 10 different runs. The training dataset consists of 1000 examples from the Kaggle cats vs dogs classification task. significant correlation, around 45%. Furthermore, there is a significant overlap in top 10 most informative examples for these networks (see Figure 3.9). Next, fixing the network to be a pretrained ResNet-18 and fixing the training length t=1000, we consider changing initialization of the classification head (which is not pretrained). In this case the correlation between F-SI scores is0.364± 0.066. Finally, fixing the network to be a ResNet-18 and fixing the initialization of the classification head, we consider changing the number of iterations in the training. We find strong correlations between F-SI scores of different training lengths (see Figure 3.8b). MNISTvsSVHNexperimentforDenseNet-121. We also redo the experiment with the joint MNIST and SVHN classification experiment (Figure 3.4) but for a different network, a pretrained DenseNet-121, to test the dependence of the results on the architecture choice. The results are presented in Figure 3.10a and are qualitatively identical to those with a pretrained ResNet18 (Figure 3.4). Datasummarizationwithachangeofarchitecture. To test how much sample information scores computed for one network are useful for another network, we reconsider the MNIST 4 vs 9 data sum- marization experiment (Figure 3.7). This time we compute F-SI scores for the original network with one hidden layer, but train a two-hidden-layer neural network (both layers having 1024 ReLU units). The data summarization results presented in Figure 3.10b are qualitatively and quantitively almost identical to the original results presented in Figure 3.7. This confirms that F-Si scores computed for one network can be useful for another network. 35 Dog Cat Cat Dog Cat Dog Cat Cat Dog Dog (a) ResNet-18 Cat Cat Cat Dog Cat Cat Cat Dog Cat Cat (b) ResNet-50 Cat Cat Cat Dog Cat Dog Cat Dog Dog Dog (c) DenseNet-121 Figure 3.9: Top 10 most informative examples from Kaggle cats vs dogs classification task for 3 pretrained networks: ResNet-18, ResNet-50, and DenseNet-121. 3.6 Discussionandfuturework The smooth (functional) sample information depends not only on the example itself, but on the network architecture, initialization, and the training procedure (i.e. the training algorithm). This has to be the case, since an example can be informative with respect to one algorithm or architecture, but not informative for another one. Similarly, some examples may be more informative at the beginning of the training (e.g., simpler examples) rather than at the end. Nevertheless, the results presented in Section 3.5.4 indicate that F-SI still captures something inherent in the example. This suggests thatF-SI computed with respect to one network can reveal useful information for another one. The proposed sample information measures only the unique information provided by an example. For this reason, it is not surprising that typical examples are usually the least informative, while atypical and rare ones are more informative. While the typical examples are usually less informative according to the proposed measures, they still provide information about the decision functions, which is evident in the data summarization experiment – removing lots of typical examples was worse than removing the same number of random examples. Generalizing sample information to capture this kind of contributions is an interesting direction for future work. Similar to Data Shapley [Ghorbani and Zou, 2019]), one can look at the average unique information an example provides when considering along with a random subset of data. One can also consider common information, high-order information, synergistic information, and other notions of information between samples. The relation among these quantities is complex in general, even for 3 variables and is an open challenge [Williams and Beer, 2010]. 3.7 Relatedwork Our work is related to information-theoretic stability notions [Bassily et al., 2016, Raginsky et al., 2016, Feldman and Steinke, 2018] that seek to measure the influence of a sample on the output, and to measure generalization. Raginsky et al. [2016] define information stability as E S 1 n P n i=1 I(W;Z i |S − i ) , the expected average amount of unique (Shannon) information that weights have about an example. This, without the expectation overS, is also our starting point (Eq. (3.1)). Bassily et al. [2016] define KL-stability sup s,s ′KL Q W|S=s ∥Q W|S=s ′ , wheres ands ′ are datasets that differ by one example, while Feldman 36 0 1 2 3 4 Informativeness of an example 1e−4 0.0 0.2 0.4 0.6 0.8 1.0 Density 1e4 mnist svhn (a) MNVT vs SVHN experiment for DenseNet-121 0.0 0.2 0.4 0.6 0.8 Ratio of removed examples 0.5 0.6 0.7 0.8 0.9 Test accuracy random top bottom bottom (iterative) (b) MNIST 4 vs 9 data summarization Figure 3.10: Testing how much F-SI scores computed for different networks are qualitatively different. On the left: the MNIST vs SVHN experiment with a pretrained DesneNet-121 instead of a pretrained ResNet-18. Ontheright: Data summarization for the MNIST 4 vs 9 classification task, where the F-SI scores are computed for a one-hidden-layer network, but a two-hidden-layer network is trained to produce the test accuracies. and Steinke [2018] define average leave-one-out KL stability as sup s 1 n P n i=1 KL Q W|S=s ∥Q W|S=s − i . The latter closely resembles our definition (Eq. (3.4)). Unfortunately, while the weights are continuous, the optimization algorithm (such as SGD) is usually discrete. This generally makes the resulting quantities degenerate (infinite). Most works address this issue by replacing the discrete optimization algorithm with a continuous one, such as stochastic gradient Langevin dynamics [Welling and Teh, 2011] or continuous stochastic differential equations that approximate SGD [Li et al., 2017] in the limit. We aim to avoid such assumptions and give a definition that is directly applicable to real networks trained with standard algorithms. To do this, we apply a smoothing procedure to a standard discrete algorithm. The final result can still be interpreted as a valid bound on Shannon mutual information, but for a slightly modified optimization algorithm. Our definitions relate informativeness of a sample to the notion of algorithmic stability [Bousquet and Elisseeff, 2002, Hardt et al., 2016], where a training algorithm is called stable if its outputs on datasets differing by only one example are close to each other. To ensure our quantities are well-defined, we apply a smoothing technique which is reminiscent of a soft discretization of weight space. In Section 3.3, we show that a canonical discretization is obtained using the Fisher information matrix, which relates to classical results of Rissanen [1996] on optimal coding length. It also relates to the use of a post-distribution by Achille et al. [2019], who however use it to estimate the total amount of information in the weights of a network. We use a first-order approximation (linearization) inspired by the Neural Tangent Kernel (NTK) [Jacot et al., 2018, Lee et al., 2019] to efficiently estimate informativeness of a sample. While NTK predicts that, in the limit of an infinitely wide network, the linearized model is an accurate approximation, we do not observe this on more realistic architectures and datasets. However, we show that, when using pretrained networks as common in practice, linearization yields an accurate approximation, similarly to what is observed by Mu et al. [2020]. Shwartz-Ziv and Alemi [2020] study the total information contained by an ensemble of randomly initialized linearized networks. They notice that, while considering ensembles makes the mutual information finite, it still diverges to infinity as training time goes to infinity. On the other hand, we consider the unique information about a single example, without the need for ensembles, by considering smoothed information, which remains bounded for any time. Other complementary works study how information about an input sample propagates through the network [Shwartz-Ziv and Tishby, 2017, Achille and Soatto, 37 2018, Saxe et al., 2019] or total amount of information (complexity) of a classification dataset [Lorena et al., 2019], rather than how much information the sample itself contains. In terms of applications, our work is related to works that estimate influence of an example [Koh and Liang, 2017, Toneva et al., 2019, Katharopoulos and Fleuret, 2018, Ghorbani and Zou, 2019, Yoon et al., 2020]. This can be done by estimating the change in weights if a sample is removed from the training set, which is addressed by several works [Koh and Liang, 2017, Golatkar et al., 2020, Wu et al., 2020]. Influence functions [Cook, 1977, Koh and Liang, 2017] model removal of a sample as reducing its weight infinitesimally in the loss function, and show an efficient first-order approximation of its effect on other measures (such as test time predictions). We found influence functions to be prohibitively slow for the networks and data regimes we consider. Basu et al. [2021] found that influence functions are not accurate for large DNNs. Additionally, influence functions assume that the training has converged, which does not hold when early stopping is used. We instead use linearization of neural networks to estimate the effect of removing an example efficiently. We find that this approximation is accurate in realistic settings, and that the computational cost scales better with network size, making it applicable to very large neural networks. Our work is orthogonal to that of feature selection: while we aim to evaluate the informativeness for the final weights of a subset of training samples, feature selection aims to quantify the informativeness for the task variable of a subset of features. However, they share some high-level similarities. In particular, Kohavi et al. [1997] propose the notion of strongly-relevant feature as one that changes the discriminative distribution when it is excluded. This notion is similar to the notion of unique sample information in Eq. (3.1). 3.8 Conclusion There are many notions of information that are relevant to understanding the inner workings of neural networks. Recent efforts have focused on defining information in the weights or activations that do not degenerate for deterministic training. We look at the information in the training data, which ultimately affects both the weights and the activations. In particular, we focus on the most elementary case, which is the unique information contained in an example, because it can be the foundation for understanding more complex notions. However, our approach can be readily generalized to unique information of a group of samples. Unlike most previously introduced information measures, ours is tractable even for real datasets used to train standard network architectures, and does not require restriction to limiting cases. In particular, we can approximate our quantities without requiring the limit of small learning rate (continuous training time), or the limit of infinite network width. 38 Chapter4 Information-theoreticgeneralizationboundsforblack-boxlearning algorithms 4.1 Introduction Large neural networks trained with variants of stochastic gradient descent have excellent generalization capabilities, even in regimes where the number of parameters is much larger than the number of training examples. Zhang et al. [2017] showed that classical generalization bounds based on various notions of complexity of hypothesis set fail to explain this phenomenon, as the same neural network can generalize well for one choice of training data and memorize completely for another one. This observation has spurred a tenacious search for algorithm-dependent and data-dependent generalization bounds that give meaningful results in practical settings for deep learning [Jiang* et al., 2020, Dziugaite et al., 2020]. One line of attack bounds generalization error based on the information about training dataset stored in the weights [Xu and Raginsky, 2017, Bassily et al., 2018, Negrea et al., 2019, Bu et al., 2020, Steinke and Zakynthinou, 2020, Haghifam et al., 2020, Neu et al., 2021, Raginsky et al., 2021]. The main idea is that when the training and testing performance of a neural network are different, the network weights necessarily capture some information about the training dataset. However, the opposite might not be true: A neural network can store significant portions of training set in its weights and still generalize well [Shokri et al., 2017, Yeom et al., 2018, Nasr et al., 2019]. Furthermore, because of their information-theoretic nature, these generalization bounds become infinite or produce trivial bounds for deterministic algorithms. When such bounds are not infinite, they are notoriously hard to estimate, due to the challenges arising in estimation of Shannon mutual information between two high-dimensional variables (e.g., weights of a ResNet and a training dataset). This work addresses the aforementioned challenges. We first improve some of the existing information- theoretic generalization bounds, providing a unified view and derivation of them (Section 4.2). We then derive novel generalization bounds that measure information with predictions, rather than with the output of the training algorithm (Section 4.3). These bounds are applicable to a wide range of methods, including neural networks, Bayesian algorithms, ensembling algorithms, and non-parametric approaches. In the case of neural networks, the proposed bounds improve over the existing weight-based bounds, partly because they avoid a counter-productive property of weight-based bounds that information stored inunused weights affects generalization bounds, even though it has no effect on generalization. The proposed bounds produce meaningful results for deterministic algorithms and are significantly easier to estimate. For example, in case of classification, computing our most efficient bound involves estimating mutual information between a pair of predictions and a binary variable. We apply the proposed bounds to ensembling algorithms, binary classification algorithms with finite VC dimension hypothesis classes, and to stable learning algorithms (Section 4.4). We compute our most efficient bound on realistic classification problems involving neural networks, and show that the bound 39 closely follows the generalization error, even in situations when a neural network with 3M parameters is trained deterministically on 4000 examples, achieving 1% generalization error. 4.2 Weight-basedgeneralizationbounds We start by describing the necessary notation and definitions, after which we present some of the existing weigh-based information-theoretic generalization bounds, slightly improve some of them, and prove relations between them. The purpose of this section is to introduce the relevant existing bounds and prepare grounds for the functional conditional mutual information bounds introduced in Section 4.3, which we consider our main contribution. Theorems proved in the subsequent sections will be relying on the following lemma. Lemma 4.2.1. Let (Φ ,Ψ) be a pair of random variables with joint distribution P Φ ,Ψ . If g(ϕ,ψ ) is a measurable function such thatE P Φ ,Ψ [g(Φ ,Ψ)] exists andg(Φ ,Ψ) isσ -subgaussian underP Φ ⊗ P Ψ , then E P Φ ,Ψ [g(Φ ,Ψ)] − E P Φ ⊗ P Ψ [g(Φ ,Ψ]) ≤ p 2σ 2 I(Φ;Ψ) . (4.1) Furthermore, ifg(ϕ, Ψ) isσ -subgaussian for eachϕ , then E P Φ ,Ψ h g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 i ≤ 4σ 2 (I(Φ;Ψ)+log3) , (4.2) and P Φ ,Ψ g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) ≥ ϵ ≤ 4σ 2 (I(Φ;Ψ)+log3) ϵ 2 , ∀ϵ> 0, (4.3) provided that the expectation in Eq. (4.2) exists. The first part of this lemma is equivalent to Lemma 1 of Xu and Raginsky [2017], which in turn has its roots in [Russo and Zou, 2019]. The second part generalizes Lemma 2 of Hafez-Kolahi et al. [2020] by also providing bounds on the expected squared difference. 4.2.1 Generalizationboundswithinput-outputmutualinformation Let S = (Z 1 ,Z 2 ,...,Z n ) be a dataset of n i.i.d. examples sampled from P Z and Q W|S be a training algorithm with hypothesis setW. Given a loss functionℓ:W×Z → R, the empirical risk of a hypothesis w isr S (w) = 1 n P n i=1 ℓ(w,Z i ) and the population risk isR(w) =E Z ′ ∼ P Z ℓ(w,Z ′ ). LetW ∼ Q W|S be a random hypothesis drawn after training onS. We are interested in bounding the generalization gap R(W)− r S (W), also referred as generalization error sometimes. Xu and Raginsky [2017] establish the following information-theoretic bound on the absolute value of the expected generalization gap. Theorem 4.2.2 (Thm. 1 of Xu and Raginsky [2017]). Let W ∼ Q W|S . If ℓ(w,Z ′ ), where Z ′ ∼ P Z , is σ -subgaussian for allw∈W, then E P S,W [R(W)− r S (W)] ≤ r 2σ 2 I(W;S) n . (4.4) We generalize this result by showing that instead of measuring information with the entire dataset, one can measure information with a subset of sizem chosen uniformly at random. For brevity, hereafter we call subsets chosen uniformly at random just “random subsets”. 40 Theorem4.2.3. LetW ∼ Q W|S . Let alsoU be a random subset of[n] with sizem, independent ofS and W . Ifℓ(w,Z ′ ), whereZ ′ ∼ P Z , isσ -subgaussian for allw∈W, then E P S,W [R(W)− r S (W)] ≤ E P U r 2σ 2 m I U (W;S U ), (4.5) and E P S,W h (R(W)− r S (W)) 2 i ≤ 4σ 2 n (I(W;S)+log3). (4.6) With a simple application of Markov’s inequality one can get tail bounds from the second part of the theorem. Furthermore, by taking square root of both sides of Eq. (4.6) and using Jensen’s inequality on the left side, one can also construct an upper bound for the expected absolute value of generalization gap, E P S,W |R(W)− r S (W)|. These observations apply also to the other generalization gap bounds presented later in this work. Note the bound on the squared generalization gap is written only for the case ofm=n. It is possible to derive squared generalization gap bounds of form 4σ 2 m (I(W;S U |U)+log3). Unfortunately, for small m thelog3 constant starts to dominate, resulting in vacuous bounds. Picking a smallm decreases the mutual information term in Eq. (4.5), however, it also decreases the denominator. When settingm=n, we get the bound of Xu and Raginsky [2017] (Theorem 4.2.2). When m=1, the bound of Eq. (4.5) becomes E P S,W [R(W)− r S (W)] ≤ 1 n n X i=1 p 2σ 2 I(W;Z i ), (4.7) matching the result of Bu et al. [2020] (Proposition 1). A similar bound, but for a different notion of information, was derived by Alabdulmohsin [2015]. Bu et al. [2020] prove that the bound withm=1 is tighter than the bound withm = n. We generalize this result by proving that the bound of Eq. (4.5) is non-descreasing inm. Proposition4.2.4. Letm∈[n− 1],U be a random subset of[n] of sizem,U ′ be a random subset of size m+1, andϕ :R→R be any non-decreasing concave function. Then E P U ϕ 1 m I U (W;S U ) ≤ E P U ′ ϕ 1 m+1 I U ′ (W;S U ′) . (4.8) Whenϕ (x)= √ x, this result proves that the optimal value form in Eq. (4.5) is 1. Furthermore, when we use Jensen’s inequality to move expectation overU inside the square root in Eq. (4.5), then the resulting bound becomes q 2σ 2 m I(W;S U |U) and matches the result of Negrea et al. [2019] (Thm. 2.3). These bounds are also non-decreasing with respect tom (using Proposition 4.2.4 withϕ (x)=x). Theorem 4.2.2 can be used to derive generalization bounds that depend on the information betweenW and a single exampleZ i conditioned on the remaining examplesZ − i =(Z 1 ,...,Z i− 1 ,Z i+1 ,...,Z n ). Theorem4.2.5. LetW ∼ Q W|S . Ifℓ(w,Z ′ ), whereZ ′ ∼ P Z , isσ -subgaussian for allw∈W, then E P S,W [R(W)− r S (W)] ≤ 1 n n X i=1 p 2σ 2 I(W;Z i |Z − i ), (4.9) 41 and E P S,W h (R(W)− r S (W)) 2 i ≤ 4σ 2 n n X i=1 I(W;Z i |Z − i )+log3 ! . (4.10) This theorem is a simple corollary of Theorem 4.2.3, using the facts thatI(W;Z i )≤ I(W;Z i |Z − i ) and thatI(W;S) is upper bounded by P n i=1 I(W;Z i |Z − i ), which is also known as erasure information [Verdu and Weissman, 2008]. The first part of it improves the result of Raginsky et al. [2016] (Thm. 2), as the averaging overi is outside of the square root. While these bounds are worse that the corresponding bounds of Theorem 4.2.3, it is sometimes easier to manipulate them analytically. The bounds described above measure information with the outputW of the training algorithm. In the case of prediction tasks with parametric methods, the parametersW might contain information about the training dataset, but not use it to make predictions. Partly for this reason, our the main goal is to derive generalization bounds that measure information with the prediction function, rather than with the weights. In general, there is no straightforward way of encoding the prediction function into a random variable. When the domainZ is finite, we can encode the prediction function as the collection of predictions on all examples ofZ. While this approach is ineffective when |Z| is large or infinite, it suggests to pick a random finite subset of examples Z ′ ⊂Z to evaluate the learned function on. Since the learned function might behave very differently on seen and unseen examples, Z ′ should include examples from both types. This naturally leads us to the random subsamling setting introduced by Steinke and Zakynthinou [2020] (albeit with a different motivation), where one first fixes a set of 2n examples, and then randomly selectsn of them to form the training set. Evaluations of the learned function on the2n examples makes a good representation of the learned function and allows us to derive functional generalization bounds (presented in Section 4.3). Before describing these bounds we present the setting of Steinke and Zakynthinou [2020] in detail and generalize some of the existing weight-based bounds in that setting. 4.2.2 Generalizationboundswithconditionalmutualinformation Let ˜ Z ∈ Z n× 2 be a collection of 2n i.i.d samples from P Z , grouped in n pairs. The random variable J ∼ Uniform({0,1} n ) specifies which example to select from each pair to form the training set S = ˜ Z J ≜( ˜ Z i,J i ) n i=1 . LetW ∼ Q W|S . In this setting Steinke and Zakynthinou [2020] defined condition mutual information (CMI) of algorithmQ with respect to the data distributionP Z as CMI P (A)=I(W;J | ˜ Z)=E P ˜ Z h I ˜ Z (W;J) i , (4.11) and proved the following upper bound on expected generalization gap. Theorem4.2.6 (Thm. 2, Steinke and Zakynthinou [2020]). LetW ∼ Q W|S . If the loss functionℓ(w,z)∈ [0,1],∀w∈W,z∈Z, then the expected generalization gap can be bounded as follows: E P S,W [R(W)− r S (W)] ≤ r 2 n CMI P (Q). (4.12) Haghifam et al. [2020] improved this bound in two aspects. First, they provided bounds where expecta- tion over ˜ Z is outside of the square root. Second, they considered measuring information with subsets ofJ, as we did in the previous section. 42 Theorem4.2.7 (Thm. 3.1 of Haghifam et al. [2020]). LetW ∼ Q W|S . Let alsom∈[n] andU ⊆ [n] be a random subset of sizem, independent from ˜ Z,J, andW . If the loss functionℓ(w,z)∈[0,1],∀w∈W,z∈ Z, then E P S,W [R(W)− r S (W)] ≤ E P ˜ Z r 2 m I ˜ Z (W;J U |U). (4.13) Furthermore, form=1 they tighten the bound by showing that one can move the expectation overU outside of the squared root (Haghifam et al. [2020], Thm 3.4). We generalize these results by showing that for allm expectation overU can be done outside of the square root. Furthermore, our proof closely follows the proof of Theorem 4.2.3. Theorem4.2.8. LetW ∼ Q W|S . Let alsom∈[n] andU ⊆ [n] be a random subset of sizem, independent from ˜ Z,J, andW . Ifℓ(w,z)∈[0,1],∀w∈W,z∈Z, then E P S,W [R(W)− r S (W)] ≤ E P ˜ Z,U r 2 m I ˜ Z,U (W;J U ), (4.14) and E P S,W h (R(W)− r S (W)) 2 i ≤ 8 n I(W;J | ˜ Z)+2 . (4.15) The bound of Eq. (4.14) improves over the bound of Theorem 4.2.7 and matches the special result for m=1. Rodríguez-Gálvez et al. [2021] proved even tighter expected generalization gap bound by replacing I ˜ Z,U (W;J U ) withI ˜ Z U ,U (A(W;J U ). Haghifam et al. [2020] showed that if one takes the expectations over ˜ Z inside the square root in Eq. (4.13), then the resulting looser upper bounds become non-decreasing over m. Using this result they showed that their special case bound form = 1 is the tightest. We generalize their results by showing that even without taking the expectations inside the squared root, the bounds of Theorem 4.2.7 are non-decreasing overm. We also show that the same holds for our tighter bounds of Eq. (4.14). Proposition4.2.9. LetW ∼ Q W|S . Let alsom∈[n− 1],U be a random subset of[n] of sizem,U ′ be a random subset of sizem+1, andϕ :R→R be any non-decreasing concave function. Then E P U ϕ 1 m I ˜ Z,U (W;J U ) ≤ E P U ′ ϕ 1 m+1 I ˜ Z,U ′ (W;J U ′) . (4.16) By settingϕ (x)=x, taking square root of both sides of Eq. (4.16), and then taking expectation over ˜ Z, we prove that bounds of Eq. (4.13) are non-decreasing overm. By settingϕ (x)= √ x and then taking expectation over ˜ Z, we prove that bounds of Eq. (4.14) are non-decreasing withm. Similarly to the Theorem 4.2.5 of the previous section, the following result establishes generalization bounds with information-theoretic stability quantities. Theorem4.2.10. Ifℓ(w,z)∈[0,1],∀w∈W,z∈Z, then E P W,S [R(W)− r S (W)] ≤ E P ˜ Z " 1 n n X i=1 q 2I ˜ Z (W;J i |J − i ) # , (4.17) 43 and E P W,S (R(W)− r S (W)) 2 ≤ 8 n E P ˜ Z " n X i=1 I ˜ Z (W;J i |J − i ) # +2 ! . (4.18) 4.3 Functionalconditionalmutualinformation The bounds in Section 4.2 leverage information in the output of the algorithm, W . In this section we focus on supervised learning problems:Z =X ×Y . To encompass many types of approaches, we do not assume that the training algorithm has an outputW , which is then used to make predictions. Instead, we assume that the learning method implements a functionf :Z n ×X ×R→K that takes a training sets, a test inputx ′ , an auxiliary argumentr capturing the stochasticity of training and predictions, and outputs a predictionf(s,x ′ ,r) on the test example. Note that the prediction domainK can be different from Y. This setting includes non-parametric methods (for whichW is the training dataset itself), parametric methods, Bayesian algorithms, and more. For example, in parametric methods, where a hypothesis set H ={h w :X →K|w∈W} is defined, f(s,x,r)=h A(s,r) (x), withA(s,r) denoting the weights after training on datasets with randomnessr. In this supervised setting, the loss function ℓ : K×Y → R measures the discrepancy between a prediction and a label. As in the previous subsection, we assume that a collection of 2n i.i.d examples ˜ Z ∼ P n× 2 Z is given, grouped inn pairs, and the random variableJ ∼ Uniform({0,1} n ) specifies which example to select from each pair to form the training setS = ˜ Z J ≜( ˜ Z i,J i ) n i=1 . LetR be an auxiliary random variable, independent of ˜ Z andJ, that provides stochasticity for predictions (e.g., in neural networksR can be used to make the training stochastic). The empirical risk of learning methodf trained on dataset S with randomnessR is defined as r S (f)= 1 n P n i=1 ℓ(f(S,X i ,R),Y i ). The population risk is defined as R(f)=E Z ′ ∼ P Z ℓ(f(S,X ′ ,R),Y ′ ). Before moving forward we adopt two conventions. First, ifz is a collection of examples, thenx and y denote the collection of its inputs and labels respectively. Ifs is a training set andx is a collection of inputs, thenf(s,x,r) denotes the collection of predictions onx after training ons with randomnessr. Let ˜ F denote the predictions on ˜ Z after training onS with randomnessR, i.e., ˜ F =f(S, ˜ X,R). Given a subsetu⊂ [n], we use the ˜ F u ∈K |u|× 2 notation to denote the rows of ˜ F corresponding tou. Definition4.3.1. Letf, ˜ Z,J,R, and ˜ F be defined as above and let u⊆ [n] be a subset of sizem. Then pointwise functional conditional mutual informationf-CMI(f,˜ z,u) is defined as f-CMI(f,˜ z,u)=I ˜ Z=˜ z ( ˜ F u ;J u ), (4.19) while functional conditional mutual informationf-CMI P (f,u) is defined as f-CMI P (f,u)=E P ˜ Z f-CMI(f,˜ z,u). (4.20) Whenu=[n] we simply use the notationsf-CMI(f,˜ z) andf-CMI P (f), instead off-CMI(f,˜ z,[n]) andf-CMI P (f,[n]), respectively. Theorem4.3.1. LetU be a random subset of sizem, independent of ˜ Z,J, andR. Ifℓ(b y,y)∈[0,1],∀b y∈ K,z∈Z, then E P S,R [R(f)− r S (f)]≤ E P ˜ Z,U r 2 m f-CMI(f, ˜ Z,U), (4.21) and E P S,R (R(f)− r S (f)) 2 ≤ 8 n E P ˜ Z f-CMI(f, ˜ Z)+2 . (4.22) 44 For parametric methodsw =A(s,r), the bound of Eq. (4.21) improves over the bound of Eq. (4.14), as J u —W — ˜ F u is a Markov chain underP ˜ F,W,J| ˜ Z , implying the following data processing inequality: I ˜ Z (F u ;J u )≤ I ˜ Z (W;J u ). (4.23) For deterministic algorithmsI ˜ Z (W;J u ) is often equal toH(J u )=mlog2, as most likely each choice of J produces a different W . In such cases the bound withI ˜ Z (W;J u ) is vacuous. In contrast, the proposed bounds with f-CMI (especially when m = 1) do not have this problem. Even when the algorithm is stochastic, information between W and J u can be much larger than information between predictions andJ u , as having access to weights makes it easier to determineJ u (e.g., by using gradients). A similar phenomenon has been observed in the context of membership attacks, where having access to weights of a neural network allows constructing more successful membership attacks compared to having access to predictions only [Nasr et al., 2019, Gupta et al., 2021]. Corollary4.3.2. Whenm=n, the bound of Eq. (4.21) becomes E P S,R [R(f)− r S (f)] ≤ E P ˜ Z r 2 n f-CMI(f, ˜ Z)≤ r 2 n f-CMI P (f). (4.24) For parametric models, this improves over the CMI bound (Theorem 4.2.6), as by data processing inequality,f-CMI P (f)=I( ˜ F;J | ˜ Z)≤ I(W);J | ˜ Z)=CMI P (A). Remark4.3.1. Note that the collection of training and testing predictions ˜ F =f( ˜ Z J , ˜ X,R) cannot be replaced with only testing predictionsf( ˜ Z J , ˜ X¯ J ,R). As an example, consider an algorithm that memorizes the training examples and outputs a constant prediction on any other example. This algorithm will have non-zero generalization gap, butf( ˜ Z J , ˜ X¯ J ,R) will be constant and will have zero information withJ conditioned on any random variable. Moreover, if we replacef( ˜ Z J , ˜ X,R) with only training predictions f( ˜ Z J , ˜ X J ,R), the resulting bound can become too loose, as one can deduceJ by comparing training set predictions with the labels ˜ Y . Corollary4.3.3. Whenm=1, the bound of Eq. (4.21) becomes E P S,R [R(f)− r S (f)] ≤ 1 n n X i=1 E P ˜ Z q 2I ˜ Z ( ˜ F i ;J i ). (4.25) A great advantage of this bound compared to all other bounds described so far is that the mutual information term is computed between a relatively low-dimensional random variable ˜ F i and a binary random variableJ i . For example, in the case of binary classification with K ={0,1}, ˜ F i will be a pair of 2 binary variables. This allows us to estimate the bound efficiently and accurately. Note that estimating other information-theoretic bounds is significantly harder. The bounds of Xu and Raginsky [2017], Negrea et al. [2019], and Bu et al. [2020] are hard to estimate as they involve estimation of mutual information between a high-dimensional non-discrete variableW and at least one exampleZ i . Furthermore, this mutual information can be infinite in case of deterministic algorithms or when H(Z i ) is infinite. The bounds of Haghifam et al. [2020] and Steinke and Zakynthinou [2020] are also hard to estimate as they involve estimation of mutual information betweenW and at least one train-test split variableJ i . 45 As in the case of bounds presented in the previous section (Theorem 4.2.3 and Theorem 4.2.8), we prove that the bound of Theorem 4.3.1 is non-decreasing inm. This stays true even when we increase the upper bounds by moving the expectation overU or the expectation over ˜ Z or both under the square root. The following proposition allows us to prove all these statements. Proposition4.3.4. Letm∈[n− 1],U be a random subset of[n] of sizem,U ′ be a random subset of size m+1, andϕ :R→R be any non-decreasing concave function. Then E P U ϕ 1 m I ˜ Z,U ( ˜ F U ;J U ) ≤ E P U ′ ϕ 1 m+1 I ˜ Z,U ′ ( ˜ F U ′;J U ′) . (4.26) By settingϕ (x)= √ x and then taking expectation over ˜ Z andu, we prove that bounds of Theorem 4.3.1 are non-decreasing overm. By settingϕ (x)=x, taking expectation over ˜ Z, and then taking square root of both sides of Eq. (4.26), we prove that bounds are non-decreasing inm when both expectations are under the square root. Proposition 4.3.4 proves thatm=1 is the optimal choice in Theorem 4.3.1. Notably, the bound that is the easiest to compute is also the tightest! Analogously to Theorem 4.2.10, we provide the following stability-based bounds. Theorem4.3.5. Ifℓ(b y,y)∈[0,1],∀b y∈K,z∈Z, then E P S,R [R(f)− r S (f)] ≤ E P ˜ Z " 1 n n X i=1 q 2I ˜ Z ( ˜ F i ;J i |J − i ) # , (4.27) and E P S,R (R(f)− r S (f)) 2 ≤ 8 n E P ˜ Z " n X i=1 I ˜ Z ( ˜ F;J i |J − i ) # +2 ! . (4.28) Note that unlike Eq. (4.27), in the second part of Theorem 4.3.5 we measure information with predictions on all2n pairs andJ i conditioned onJ − i . Whether ˜ F can be replaced with ˜ F i , predictions only on thei-th pair, is addressed in Chapter 5. 4.4 Applications In this section we describe 3 applications of thef-CMI-based generalization bounds. 4.4.1 Ensemblingalgorithms Ensembling algorithms combine predictions of multiple learning algorithms to obtain better performance. Let us considert learning algorithms,f 1 ,f 2 ,...,f t , each with its own independent randomnessR i , i∈[t]. Some ensembling algorithms can be viewed as a possibly stochastic function g : K t → K that takes predictions of thet algorithms and combines them into a single prediction. Relating the generalization gap of the resulting ensembling algorithm to that of individualf i s can be challenging for complicated choices 46 ofg. However, it is easy to bound the generalization gap ofg(f 1 ,...,f t ) in terms off-CMIs of individual predictors. Letx ′ be an arbitrary collection of inputs. DenotingF i =f i ( ˜ Z J ,x ′ ,R i ), i∈[t], we have that I ˜ Z (g(F 1 ,...,F t );J) (a) ≤ I ˜ Z (F 1 ,...,F t ;J) (4.29) (b) = I ˜ Z (F 1 ;J)+I ˜ Z (F 2 ,...,F t ;J)− I ˜ Z (F 1 ;F 2 ,...,F t )+I ˜ Z (F 1 ;F 2 ,...,F t |J) (4.30) (c) ≤ I ˜ Z (F 1 ;J)+I ˜ Z (F 2 ,...,F t ;J) (4.31) (d) ≤ ...≤ I ˜ Z (F 1 ;J)+··· +I ˜ Z (F t ;J), (4.32) where (a) follows from the data processing inequality, (b) follows from the chain rule, (c) follows from non-negativity of mutual information and the fact thatF 1 ⊥ ⊥F 2 ,...,F t |J, and (d) is derived by repeating the arguments above to separate allF i . Unfortunately, the same derivation above does not work if we replaceJ withJ u , whereu is a proper subset of[n], asI ˜ Z (F 1 ;F 2 ,...,F t |J u )̸=0 in general. 4.4.2 BinaryclassificationwithfiniteVCdimension Let us consider the case of binary classification: Y ={0,1}, where the learning methodf :Z n ×X×R→ {0,1} is implemented using a learning algorithm A : Z n ×R → W that selects a classifier from a hypothesis setH ={h w :X →Y}. IfH has finite VC dimension d [Vapnik, 1998], then for any algorithm f, the quantityf-CMI(f,˜ z) can be bounded the following way. Theorem4.4.1. LetZ,H,f be defined as above, and let d<∞ be the VC dimension ofH. Then for any algorithmf and ˜ z∈Z n× 2 , f-CMI(f,˜ z)≤ max{(d+1)log2, dlog(2en/d)}. (4.33) Considering the 0-1 loss function and using this result in Corollary 4.3.2, we get an expect generalization gap bound that is q d n log n d , matching the classical uniform convergence bound [Vapnik, 1998]. The √ logn factor can be removed in some cases [Hafez-Kolahi et al., 2020]. Both Xu and Raginsky [2017] and Steinke and Zakynthinou [2020] prove similar information-theoretic bounds in the case of finite VC dimension classes, but their results holds for specific algorithms only. Even in the simple case of threshold functions: X = [0,1] andH = {h w :x7→1{x>w}|w∈[0,1]}, all weight-based bounds described in Section 4.2 are vacuous if one uses a training algorithm that encodes the training set in insignificant bits of W , while still getting zero error on the training set and hence achieving low test error. 4.4.3 Stabledeterministicorstochasticalgorithms Theorems 4.2.5, 4.2.10 and 4.3.5 provide generalization bounds involving information-theoretic stability measures, such asI(W;Z i |Z − i ),I ˜ Z (W;J |J − i ) andI ˜ Z ( ˜ F;J i |J − i ). In this section we build upon the predication-based stability bounds of Theorem 4.3.5. First, we show that for any collection of examplesx ′ , the mutual informationI ˜ Z (f( ˜ Z J ,x ′ );J i |J − i ) can be bounded as follows. 47 Proposition4.4.2. LetJ i← c denoteJ withJ i set toc∈{0,1}. Then for any ˜ z∈Z n× 2 andx ′ ∈X k , the mutual informationI(f(˜ z J ,x ′ ,R);J i |J − i ) is upper bounded by 1 4 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥f(˜ z J i← 0,x ′ ,R)|J − i + 1 4 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥f(˜ z J i← 1,x ′ ,R)|J − i . To compute the right-hand side of Proposition 4.4.2 one needs to know how much on-average the distribution of predictions onx changes after replacing thei-th example in the training dataset. The problem arises when we consider deterministic algorithms. In such cases, the right-hand side is infinite, while the left-hand side I(f(˜ z J ,x,R);J i | J − i ) is always finite and could be small. Therefore, for deterministic algorithms, directly applying the result of Proposition 4.4.2 will not give meaningful generalization bounds. Nevertheless, we show that we can add an optimal amount of noise to predictions, upper bound the generalization gap of the resulting noisy algorithm, and relate that to the generalization gap of the original deterministic algorithm. Let us consider a deterministic algorithm f : Z n ×X → R k . We define the following notions of functional stability. Definition4.4.1 (Functional stability). LetS =(Z 1 ,...,Z n ) be a collection ofn i.i.d. samples fromP Z , and Z ′ andZ test be two additional independent samples fromP Z . LetS (i) ≜(Z 1 ,...,Z i− i ,Z ′ ,Z i+1 ,...,Z n ) be the collection constructed fromS by replacing thei-th example withZ ′ . A deterministic algorithm f :Z n ×X → R k is a)β self-stable if ∀i∈[n], E S,Z ′ f(S,Z i )− f(S (i) ,Z i ) 2 ≤ β 2 , (4.34) b)β 1 test-stable if ∀i∈[n], E S,Z ′ ,Ztest f(S,Z test )− f(S (i) ,Z test ) 2 ≤ β 2 1 , (4.35) c)β 2 train-stable if ∀i,j∈[n],i̸=j, E S,Z ′ f(S,Z j )− f(S (i) ,Z j ) 2 ≤ β 2 2 . (4.36) Theorem4.4.3. LetY =R k ,f :Z n ×X → R k be a deterministic algorithm that isβ self-stable, and ℓ(b y,y)∈[0,1] be a loss function that isγ -Lipschitz in the first coordinate. Then |E S,R [R(f)− r S (f)]|≤ 2 3 2 d 1 4 p γβ. (4.37) Furthermore, iff is alsoβ 1 train-stable andβ 2 test-stable, then E S,R (R(f)− r S (f)) 2 ≤ 32 n +12 3 2 √ dγ q 2β 2 +nβ 2 1 +nβ 2 2 . (4.38) It is expected thatβ 2 is smaller thanβ andβ 1 . For example, in the case of neural networks interpolating the training data or in the case of empirical risk minimization in the realizable setting,β 2 will be zero. It is also expected thatβ is larger thanβ 1 . However, the relation ofβ 2 andnβ 2 1 is not trivial. The notion of pointwise hypothesis stabilityβ ′ 2 defined by Bousquet and Elisseeff [2002] (definition 4) is comparable to our notion of self-stabilityβ . The first part of Theorem 11 in Bousquet and Elisseeff [2002] describes a generalization bound where the difference between empirical and population losses is of order1/ √ n+ p β ′ 2 , which is comparable with our result of Theorem 4.4.3 (Θ( √ β )). The proof there also contains a bound on the expected squared difference of empirical and population losses. That bound is of order1/n+β ′ 2 . In contrast, our result of Eq. (4.38) contains two extra terms related to test-stability and 48 train-stability (the termsnβ 2 1 andnβ 2 2 ). Ifβ dominatesnβ 2 1 +nβ 2 2 , then the bound of Eq. (4.38) will match the result of Bousquet and Elisseeff [2002]. 4.5 Experiments As mentioned earlier, the expected generalization gap bound of Corollary 4.3.3 is significantly easier to com- pute compared to existing information-theoretic bounds, and does not give trivial results for deterministic algorithms. To understand how well the bound does in challenging situations, we consider cases when the algorithm generalizes well despite the high complexity of the hypothesis class and relatively small number of training examples. The code for reproducing the experiments can be found at github.com/hrayrhar/f-CMI. The exact experimental details of the experiments presented in Tables 6.7 to 6.10 of Appendix I. 4.5.1 Experimentalsetup Estimationofgeneralizationgap. In all experiments below we drawk 1 samples of ˜ Z, each time by randomly drawing2n examples from the corresponding dataset and grouping then inton pairs. For each sample ˜ z, we drawk 2 samples of the training/test split variableJ and randomnessR. We then run the training algorithm on thesek 2 splits (in totalk 1 k 2 runs). For each ˜ z,j andr, we estimate the population risk with the average error on the test examples ˜ z¯j . For each ˜ z, we average over thek 2 samples ofJ and R to get an estimateb g(˜ z) ofE J,R| ˜ Z=˜ z h ℓ( ˜ F¯ J , ˜ Y¯ J )− ℓ( ˜ F J , ˜ Y J ) i . Note that this latter quantity is not the expected generalization gap yet, as it still misses an expectation over ˜ Z. In the figures of this section, we report the mean and standard deviation ofb g( ˜ Z) estimated using thek 1 samples of ˜ Z, unlessk 1 = 1 in which case we only report the single estimate. Note that this mean will be an unbiased estimate of the true expected generalization gapE S,R [R(f)− r S (f)]. Estimation of f-CMI bound. Similarly, for each ˜ z we use the k 2 samples of J and R to estimate f-CMI(f,˜ z,{i})=I ˜ Z=˜ z ( ˜ F i ;J i ), i∈[n]. As in all considered cases we deal with classification problems (i.e., having discrete output variables), this is done straightforwardly by estimating all the states of the joint distribution of ˜ F i andJ i given ˜ Z = ˜ z, and then using a plug-in estimator of mutual information. The bias of this plug-in estimator is 1 k 2 , while the variance is (logk 2 ) 2 k 2 [Paninski, 2003]. To estimate f-CMI P (f,{i}) = E ˜ Z h f-CMI(f, ˜ Z,{i}) i we usek 1 samples of ˜ Z. After this step the estimation bias stays the same, while the variance increases by 1 k 1 . In figures of this section, we report the mean and standard deviation of our estimate off-CMI P (f,{i}) computed usingk 1 samples of ˜ Z. 4.5.2 Resultsanddiscussion First, we consider the MNIST 4 vs 9 digit classification task [LeCun et al., 2010] using a 5-layer convolutional neural network (CNN) described in Table 2.1 (but with 2 output units) that has approximately 200K parame- ters. We train the network with the cross-entropy loss for 200 epochs using the ADAM algorithm [Kingma and Ba, 2014] with 0.001 learning rate,β 1 = 0.9, and mini-batches of 128 examples. Importantly, we fix the random seed that controls the initialization of weights and the shuffling of training data, making the training algorithm deterministic. To estimate generalization gap and the bound, we setk 1 =5 andk 2 =30. Figure 4.1a plots the expected generalization gap and thef-CMI bound of Eq. (4.25). We see that the bound is not vacuous and is not too far from the expected generalization gap even when considering only 75 training examples – multiple orders of magnitude smaller than the number of parameters. As shown in the 49 75 250 1000 4000 n 0.00 0.05 0.10 0.15 0.20 0.25 Error generalization gap f-CMI bound (a) using the default parameters 75 250 1000 4000 n 0.0 0.1 0.2 0.3 Error generalization gap f-CMI bound (b) using 4 times wider network 75 250 1000 4000 n 0.0 0.1 0.2 0.3 Error generalization gap f-CMI bound (c) training with a random seed Figure 4.1: Comparison of expected generalization gap andf-CMI bound for MNIST 4 vs 9 classification with a 5-layer CNN. Panel (a) shows the results for the fixed-seed deterministic algorithm. Panel (b) repeats the experiment of panel (a) while modifying the network to have 4 times more neurons at each hidden layer. Panel (c) repeats the experiment of panel (a) while making the training algorithm stochastic by randomizing the seed. 1000 5000 20000 n 0.05 0.15 0.25 0.35 Error generalization gap f-CMI bound Figure 4.2: Comparison of expected generalization gap andf-CMI bound for a pretrained ResNet-50 fine- tuned on CIFAR-10 in a standard fashion. Figure 4.1b, if we increase the width of all layers 4 times, making the number of parameters approximately 3M, the results remain largely unchanged. Next, we move away from binary classification and consider the CIFAR-10 classication task [Krizhevsky et al., 2009]. To construct a well-generalizing algorithm, we use the ResNet-50 [He et al., 2016] network pretrained on the ImageNet [Deng et al., 2009], and fine-tune it with the cross-entropy loss for 40 epochs using SGD with mini-batches of size 64, 0.01 learning rate, 0.9 momentum, and standard data augmentations. The results presented in Figure 4.2 indicate that thef-CMI bound is always approximately 3 times larger than the expected generalization gap. In particular, whenn=20000, the expected generalization gap is 5%, while the bound predicts 16%. Note that the weight-based information-theoretic bounds discussed in Section 4.2 would give either infinite or trivial bounds for the deterministic algorithm described above. Even when we make the training algorithm stochastic by randomizing the seed, the quantities likeI(W;S) still remain infinite, while both the generalization gap and thef-CMI bound do not change significantly (see Figure 4.1c). For this reason, we change the training algorithm to Stochastic Gradient Langevin Dynamics (SGLD) [Gelfand and Mitter, 1991, Welling and Teh, 2011] and compare thef-CMI-based bound against the specialized bound of Negrea et al. [2019] (see eq. (6) of [Negrea et al., 2019]). This bound (referred as SGLD bound here) is derived from a weight-based information-theoretic generalization bound, and depends on the the hyperparameters of SGLD and on the variance of per-example gradients along the training trajectory. The SGLD algorithm is trained for 40 epochs, with learning rate and inverse temperature schedules described in Table 6.10 of Appendix I. Figure 4.3a plots the expected generalization gap, the expected test error, thef-CMI bound and 50 10 20 30 40 Epochs 0.00 0.06 0.12 0.18 0.24 0.30 Error generalization gap f-CMI bound Negrea et al. bound test error (a) MNIST 4 vs 9 2.5 5.0 7.5 10.0 12.5 15.0 Epochs 0 1 2 3 Error generalization gap f-CMI bound Negrea et al. bound test error (b) CIFAR-10 2.5 5.0 7.5 10.0 12.5 15.0 Epochs 0.02 0.04 0.06 0.08 0.10 Error generalization gap f-CMI bound test error (c) CIFAR-10, zoomed-in Figure 4.3: Comparison of expected generalization gap, Negrea et al. [2019] SGLD bound andf-CMI bound in case of a pretrained ResNet-50 fine-tuned with SGLD on a subset of CIFAR-10 of size n=20000. The figure on the right is the zoomed-in version of the figure in the middle. the SGLD bound. We see that the test accuracy plateaus after 16 epochs. Starting at epoch 8, the estimated f-CMI bound is always smaller than the SGLD bound. Furthermore, as one increases the number of epochs, the former stays small, while the latter increases to very high values. The difference between the f-CMI bound and the SGLD bound becomes more striking when we change the dataset to be a subset of CIFAR-10 consisting of 20000 examples, and fine-tune a pretrained ResNet-50 with SGLD. As shown in Figure 4.3, even after a single epoch the SGLD bound is approximately 0.45, while the generalization gap is around 0.02. For comparison, thef-CMI is approximately 0.1 after one epoch of training. 4.6 Relatedwork This work is closely related to a rich literature of information-theoretic generalization bounds, some of which were discussed earlier [Xu and Raginsky, 2017, Bassily et al., 2018, Pensia et al., 2018, Negrea et al., 2019, Bu et al., 2020, Steinke and Zakynthinou, 2020, Haghifam et al., 2020, Hafez-Kolahi et al., 2020, Alabdulmohsin, 2020, Neu et al., 2021, Raginsky et al., 2021, Esposito et al., 2021]. Most of these work derive generalization bounds that depend on a mutual information quantity measured between the output of the training algorithm and some quantity related to training data. Different from this major idea, Xu and Raginsky [2017] and Russo and Zou [2019] discussed the idea of bounding generalization gap with the information between the input and the vector of loss functions computed on training examples. This idea was later extended to the setting of conditional mutual information by Steinke and Zakynthinou [2020]. These works are similar to ours in the sense that they move away from measuring information with weights, but they did not develop this line of reasoning enough to arrive to efficient bounds similar to Corollary 4.3.3. Additionally, we believe that measuring information with the prediction function allows better interpretation and is easier to work with analytically. Another related line of research are the stability-based bounds [Bousquet and Elisseeff, 2002, Alab- dulmohsin, 2015, Raginsky et al., 2016, Bassily et al., 2016, Feldman and Steinke, 2018, Wang et al., 2016, Raginsky et al., 2021]. In Section 4.2 and Section 4.3 we improve existing generalization bounds that use information stability. In Section 4.4.3 we describe a technique of applying information stability bounds to deterministic algorithms. The main idea is to add noise to predictions, but only for analysis purposes. We employed the same idea in Chapter 3 when defining smooth unique information of an individual example. In fact, our notion of test-stability defined in Section 4.4.3 comes very close to the definition of functional sample information (Eq. (3.7)). A similar idea was used by Neu et al. [2021] in analyzing generalization 51 performance of SGD. More broadly our work is related to PAC-Bayes bounds and to classical generalization bounds. Please refer to the the survey by Jiang* et al. [2020] for more information on these bounds. Finally, our work has connections with attribute and membership inference attacks [Shokri et al., 2017, Yeom et al., 2018, Nasr et al., 2019, Gupta et al., 2021]. Some of these works show that having a white-box access to models allows constructing better membership inference attacks, compared to having a black-box access. This is analogous to our observation that prediction-based bounds are better than weight-based bounds. Shokri et al. [2017] and Yeom et al. [2018] demonstrate that even in the case ofblack-boxaccesstoa well-generalizing model, sometimes it is still possible to construct successful membership attacks. This is in line with our observation that thef-CMI bound can be significantly large, despite of small generalization gap (see epoch 4 of Figure 4.3a). This suggests a possible direction of improving thef-CMI-based bounds. 4.7 Conclusion We derived information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We showed experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning. 52 Chapter5 Formallimitationsofsample-wiseinformation-theoreticgeneralization bounds 5.1 Introduction In the previous chapter we discussed various information-theoretic bounds based on different notions of training set information captured by the training algorithm [Xu and Raginsky, 2017, Bassily et al., 2018, Negrea et al., 2019, Bu et al., 2020, Steinke and Zakynthinou, 2020, Haghifam et al., 2020, Neu et al., 2021, Raginsky et al., 2021, Hellström and Durisi, 2020, Esposito et al., 2021]. The data and algorithm dependent nature of these bounds make them applicable to typical settings of deep learning, where powerful and overparameterized neural networks are employed. Related to the considered information- theoretic generalization bounds are PAC-Bayes bounds, which are usually based on the Kullback-Leilber divergence from the distribution of hypotheses after training (the “posterior” distribution) to a fixed “prior” distribution [Shawe-Taylor and Williamson, 1997, McAllester, 1999b,a, Catoni, 2007, Alquier, 2021]. For both types of bounds the main conclusion is that when a training algorithm captures little information about the training data then the generalization gap should be small. A few works demonstrated that some of the best information-theoretic and PAC-Bayes generalization bounds produce nonvacuous results in practical settings of deep learning [Dziugaite and Roy, 2017, Pérez-Ortiz et al., 2021, Harutyunyan et al., 2021b]. A key ingredient in recent improvements of information-theoretic generalization bounds was the introduction of sample-wise information bounds by Bu et al. [2020], where one measures how much information on average the learned hypothesis has about a single training example. While PAC-Bayes and information-theoretic bounds are intimately connected to each other, the technique of measuring information with single examples has not appeared in PAC-Bayes bounds. In this chapter we explain the curious omission of single example PAC-Bayes bounds by proving the non-existence of such bounds, revealing a striking difference between information-theoretic and PAC-Bayesian perspectives. The reason for this difference is that PAC-Bayes upper bounds the probability of average population and empirical risks being far from each other, while information-theoretic generalization methods upper bound expected difference of population and empirical risks, which is an easier task. 5.1.1 Preliminaries Let us consider again the abstract learning setting (not necessarily supervised), in which the learner observes a collection ofn i.i.d examplesS =(Z 1 ,...,Z n ) sampled fromP Z and outputs a hypothesisW ∼ Q W|S belonging to a hypothesis spaceW. As before,P S andQ W|S induce a joint probabilityP W,S onW×Z n . In this setting one can consider different types of generalization bounds. 53 Expectedgeneralizationgapbounds. The simplest quantity to consider is the expected generalization gap: E P W,S [R(W)− r S (W)] , which is the left-hand side of the majority of the generalization bounds presented in the previous chapter. We saw that when ℓ(w,Z ′ ) with Z ′ ∼ P Z is σ -subgaussian for all w∈W, the following bounds hold [Xu and Raginsky, 2017, Bu et al., 2020, Theorem 4.2.3]: E P W,S [R(W)− r S (W)] ≤ 1 n n X i=1 p 2σ 2 I(W;Z i ) (5.1) ≤ E U "r 2σ 2 I U (W;S U ) m # (5.2) ≤ r 2σ 2 I(W;S) n , (5.3) where U is a uniformly random subset of [n], independent of S and W . We call bounds like Eq. (5.1) that depend on information quantities related to individual examples sample-wise bounds. The difference between sample-wise bounds and bounds that depend onI(W;S) can be significant. In fact, there are cases when the sample-wise bound of Eq. (5.1) is finite, while the bound of Eq. (5.3) is infinite [Bu et al., 2020]. PAC-Bayesbounds. A practically more useful quantity is the average difference between population and empirical risks for a fixed training set S:E P W|S [R(W)− r S (W)]. Typical PAC-Bayes bounds are of the following form: with probability at least1− δ overP S , E P W|S [R(W)− r S (W)]≤ B KL P W|S ∥π ,n,δ , (5.4) whereπ is a prior distribution overW that does not depend onS. If we choose the prior distribution to be the marginal distribution ofW (i.e.,π =P W =E S Q W|S ), then the KL term in Eq. (5.4) will be the integrand of the mutual information, asI(W;S)=E S KL P W|S ∥P W . When the functionB depends on the KL term linearly, the expectation of the bound overS will depend on the mutual informationI(W;S). There are no known PAC-Bayes bounds whereB depends on KL divergences of formKL P W|Z i ∥P W or on sample-wise mutual informationI(W;Z i ). Single-drawbounds. In practice, even when the learning algorithm is stochastic, one usually draws a single hypothesisW ∼ Q W|S and is interested in bounding the population risk ofW . Such bounds are called single-draw bounds and are usually of the following form: with probability at least1− δ overP W,S : R(W)− r S (W)≤ B dQ W|S dπ ,n,δ , (5.5) whereπ is a prior distribution as in PAC-Bayes bounds. Whenπ =P W the single-draw bounds depends on the information densityι (W,S)= dQ W|S dP W . Single-draw bounds are the hardest to obtain and no sample-wise versions are known for them either. Expectedsquaredgeneralizationgapbounds. In terms of difficulty, expected generalization bounds are the easiest to obtain, and as indicated above, some of those bounds are sample-wise bounds. If we consider a simple change of moving the absolute value inside:E P W,S |R(W)− r S (W)|, then no sample-wise bounds are known. The same is true for the expected squared generalization gapE P W,S (R(W)− r S (W)) 2 , 54 for which the known bounds are of the following form [Steinke and Zakynthinou, 2020, Aminian et al., 2021a, Theorem 4.2.3]: E P W,S h (R(W)− r S (W)) 2 i ≤ I(W;S)+c n , (5.6) where c is some constant. From this result one can upper bound the expected absolute value of generalization gap using Jensen’s inequality. 5.1.2 Ourcontributions In Section 5.3 we show that even for the expected squared generalization gap, sample-wise information- theoretic bounds are impossible. The same holds for PAC-Bayes and single-draw bounds as well. In Section 5.4 we discuss the consequences for other information-theoretic generalization bounds. Finally, in Section 5.5 we show that starting at subsets of size 2, there are expected squared generalization gap bounds that measure information betweenW and a subset of examples. This in turn implies that such PAC-Bayes and single-draw bounds are also possible, albeit they are not tight and high-probability bounds. 5.2 Ausefullemma We first prove a lemma that will be used in the proof of the main result of the next section. Lemma5.2.1. Consider a collection of bitsa 1 ,...,a N 0 +N 1 , such thatN 0 of them are zero and the remaining N 1 of them are one. We want to partition these numbers intok =(N 0 +N 1 )/n groups of sizen (assuming thatn dividesN 0 +N 1 ). Consider a uniformly random ordered partition(A 1 ,...,A k ). LetY i =⊕ a∈A i a be the parity of numbers of subsetA i . For any givenδ > 0, there existsN ′ such that whenmin{N 0 ,N 1 }≥ N ′ thenCov[Y i ,Y j ]≤ δ E[Y 1 ] 2 , for alli,j∈[k],i̸=j. Proof. By symmetry all random variablesY i are identically distributed. Without loss of generality let’s prove the result for the covariance ofY 1 andY 2 , which can be written as follows: Cov[Y 1 ,Y 2 ]=E[Y 1 Y 2 ]− E[Y 1 ]E[Y 2 ] =P Y 1 ,Y 2 (Y 1 =1,Y 2 =1)− P(Y 1 =1)P(Y 2 =1). (5.7) Consider the process of generating a uniformly random ordered partition by first picking n elements for the first subset, then n elements for the second subset, and so on. In this scheme, the probability that both Y 1 =1 andY 2 =1 equals to 1 M ⌊(n− 1)/2⌋ X u=0 ⌊(n− 1)/2⌋ X v=0 N 1 2u+1 N 0 n− 2u− 1 N 1 − 2u− 1 2v+1 N 0 − n+2u+1 n− 2v− 1 , (5.8) where M = N 0 +N1 n N 0 +N 1 − n n . (5.9) Letq u,v be the(u,v)-th summand of Eq. (5.8) divided byM. On the other hand, the product of marginals P(Y 1 =1)P(Y 2 =1) is equal to 1 M ′ ⌊(n− 1)/2⌋ X u=0 ⌊(n− 1)/2⌋ X v=0 N 1 2u+1 N 0 n− 2u− 1 N 1 2v+1 N 0 n− 2v− 1 , (5.10) 55 where M ′ = N 0 +N 1 n N 0 +N 1 n . (5.11) Letq ′ u,v be the(u,v)-th summand of Eq. (5.10) divided byM ′ . Consider the ratioq u,v /q ′ u,v : q u,v q ′ u,v = N 0 +N 1 n N 0 +N 1 − n n · N 1 − 2u− 1 2v+1 N 1 2v+1 · N 0 − n+2u+1 n− 2v− 1 N 0 n− 2v− 1 . (5.12) By pickingN 0 andN 1 large enough, we will makeq u,v close toq ′ u,v . This is possible as for any fixed n, these 3 fractions converge to 1 asmin{N 0 ,N 1 }→∞. Therefore, for anyδ > 0 there existsN ′ such that when min{N 0 ,N 1 }≥ N ′ thenq u,v ≤ (1+δ )q ′ u,v . This implies thatP Y 1 ,Y 2 (Y 1 =1,Y 2 =1)≤ (1+δ )P(Y 1 = 1)P(Y 2 =1). Combining this result with Eq. (5.7) proves the desired result. 5.3 Acounterexample Theorem 5.3.1. For any training set sizen = 2 r andδ > 0, there exists a finite input space Z, a data distributionP Z , a learning algorithmQ W|S with a finite hypothesis space W, and a binary loss function ℓ:W×Z →{ 0,1} such that (a) KL Q W|S ∥P W ≥ n− 1 with probability at least1− δ , (b) W andZ i are independent for eachi∈[n], (c) E P W,S [R(W)− r S (W)]=0, (d) P W,S R(W)− r S (W)≥ 1 4 ≥ 1 2 . This result shows that there can be no meaningful sample-wise expected squared generalization bounds, asI(W;Z i ) = 0 whileE P W,S (R(W)− r S (W)) 2 ≥ 1/16. Similarly, there can be no sample-wise PAC- Bayes or single-draw generalization bounds that depend on quantities such asKL P W|Z i ∥P W ,I(W;Z i ) orι (W,Z i ), as all of them are zero while with probability at least1/2 the generalization gap will be at least 1/4. The first property verifies that in order to make the generalization gap large W needs to capture a significant amount of information about the training set. The main idea behind the counterexample construction is to ensure thatW contains sufficient informa- tion about the whole training set but no information about individual examples. This captured information is then used to make the losses on all training examples equal to each other, but possibly different for different training sets. This way we induce significant variance of empirical risk. Along with making the population risk to be roughly the same for all hypotheses, we ensure that the generalization gap will be large on average. Satisfying the information and risk constraints separately is trivial, the challenge is in satisfying both at the same time. To better understand this result and its implications, it is instructive to go into the details of the construction. Proof of Theorem 5.3.1. Letn=2 r andZ be the set of all binary vectors of sized:Z ={0,1} d withd>r. LetN =2 d denote the cardinality of the input space. We choose the data distribution to be the uniform distribution onZ:P Z =Uniform(Z). Let the hypothesis setW be the set of all partitions ofZ into subsets of sizen: W = {A 1 ,...,A N/n }||A i |=n,∪ i A i =Z,A i ∩A j =∅ . (5.13) 56 This hypothesis space is finite, namely, |W| = N! (n!) N/n (N/n)! . When the training algorithm receives a training set S that contains duplicate examples, it outputs a hypothesis fromW uniformly at random. WhenS contains no duplicates, then the learning algorithm outputs a uniformly random hypothesis from the setW S ⊂W of hypotheses/partitions that containS as a partition subset. Formally, Q W|S = Uniform(W), ifS has duplicates, Uniform(W S ), otherwise, (5.14) whereW S ≜ A 1 ,...,A N/n ∈W |∃i s.t. A i =S . Let ρ (n,d) be the probability of S containing duplicate examples. By pickingd large enough we can makeρ as small as needed. Given a partitionw ={A 1 ,...,A 2 d /n }∈W and an examplez∈Z, we define [z] w to be the subset A i ∈ w that contains z. Given a set of examples A ⊂ Z , we define ⊕ (2) (A) to be xor of all bits of all examples ofA: ⊕ (2) (A)=⊕ (a 1 ,...,a d )∈A ⊕ d i=1 a i . (5.15) Finally, we define the loss function as follows: ℓ(w,z)=⊕ (2) ([z] w )∈{0,1}. (5.16) LetW ∼ Q W|S . Let us verify now that properties (a)-(d) listed in the statement of Theorem 5.3.1 hold. (a). By symmetry, the marginal distributionP W =E S Q W|S will be the uniform distribution overW. With probability1− ρ (n,d), the training setS has no duplicates andQ W|S = Uniform(W S ). In such cases, the support size ofQ W|S is equal to (N− n)! (n!) N/n− 1 (N/n− 1)! , while the support size ofP W is always equal to N! (n!) N/n (N/n)! . Therefore, KL Q W|S ∥P W =log |supp(P W )| supp(Q W|S ) ! (5.17) =log (N− n+1)(N− n+2)··· N n!(N/n) (5.18) ≥ log n(N− n+1) n n n N (5.19) =nlog(N− n+1)− (n− 1)logn− logN (5.20) ≈ (n− 1)dlog2− nlogn. (for a suff. large d) (5.21) For larged, the quantity above is approximately(n− 1)dlog2, which is expected as knowledge of anyZ i (d bits) along withW is enough to reconstruct the training setS that hasnd bits of entropy. To satisfy the property (a), we need to pickd large enough to makeρ (n,d)<δ and(n− 1)dlog2− nlogn≥ n− 1. (b). Consider anyz∈Z and anyi∈[n]. Then P W|Z i =z (W =w)= P W (W =w)P Z i |W=w (Z i =z) P Z (Z =z) (5.22) =P W (W =w), (5.23) 57 where the second equality follows from the fact that conditioned on a fixed partition w, because of the symmetry, there should be no difference between probabilities of different Z i values. (c). With probability1− ρ (n,d), r S (W)= 1 n n X i=1 ℓ(W,Z i ) (5.24) = 1 n n X i=1 ⊕ (2) ([Z i ] W ) (5.25) = 1 n n X i=1 ⊕ (2) (S) (5.26) =⊕ (2) (S)∈{0,1}. (5.27) Furthermore, R(W)=E P Z [ℓ(W,Z)]= 1 N/n X A∈W ⊕ (2) (A). (5.28) Given Eq. (5.27) and Eq. (5.28), due to symmetryE P W,S [r S (W)]=1/2 andE P W,S [R(W)]=1/2. Hence, the expected generalization gap is zero:E P W,S [R(W)− r S (W)]=0. (d). Consider a training setS that has no duplicates and letW = S,A 1 ,...,A N/n− 1 ∼ Q W|S . The population risk can be written as follows: R(W)= n N ⊕ (2) (S) | {z } ≤ n/N + n N N/n− 1 X i=1 ⊕ (2) (A i ) | {z } ≜Y i . (5.29) Consider the setZ\S. LetN 0 be the number of examples in this set with parity 0 andN 1 be the number of examples with parity 1. We use Lemma 5.2.1 to show thatY i are almost pairwise independent. Formally, for anyδ ′ >0 there existsN ′ such that whenmin{N 0 ,N 1 }≥ N ′ , we have thatCov[Y i ,Y j ]≤ δ ′ E[Y 1 ] 2 , for alli,j∈[N/n− 1],i̸=j. Therefore, Var n N N/n− 1 X i=1 Y i = n 2 N 2 N n − 1 Var[Y 1 ] (5.30) + n 2 N 2 N n − 1 N n − 2 Cov[Y 1 ,Y 2 ] (5.31) ≤ n 4N +δ ′ . (5.32) By Chebyshev’s inequality P n N N/n− 1 X i=1 Y i − n N N n − 1 E[Y 1 ] ≥ t ≤ n 4N +δ ′ t 2 . (5.33) 58 Furthermore,E[Y 1 ]→ 1/2 asd→∞. Therefore, we can pick a large enoughd, appropriateδ ′ andt to ensure that with at least 0.5 probability overP W,S ,R(W)∈[1/4,3/4] andr S (W)∈{0,1}. 5.4 Implicationsforotherbounds Information-theoretic generalization bounds have been improved or generalized in many ways. A few works have proposed to use other types of information measures and distances between distributions, instead of Shannon mutual information and Kullback-Leibler divergence respectively [Esposito et al., 2021, Aminian et al., 2021b, Rodríguez Gálvez et al., 2021]. In particular, Rodríguez Gálvez et al. [2021] derived expected generalization gap bounds that depend on the Wasserstein distance between P W|Z i and P W . Aminian et al. [2021b] derived similar bounds but that depend on sample-wise Jensen-Shannon information I JS (W;Z i ) ≜ JS(P W,Z i ||P W ⊗ P Z i ) or on lautum information L(W;Z i ) ≜ KL(P W ⊗ P Z i ∥P W,Z i ). Esposito et al. [2021] derive bounds on probability of an event in a joint distribution P X,Y in terms of the probability of the same event in the product of marginals distributionP X ⊗ P Y and an information measure betweenX andY (Sibson’sα -mutual information, maximal leakage,f-mutual information) or a divergence betweenP X,Y andP X ⊗ P Y (Rényiα -divergences,f-divergences, Hellinger divergences). They note that one can derive in-expectation generalization bounds from these results. These results will be sample-wise if one starts withX =W andY =Z i and then takes an average overi∈[n]. The property (b) of the counterexample implies that there are no PAC-Bayes, single-draw, or expected squared generalization bounds for the aforementioned information measures or divergences, as all of them will be zero when ∀z∈Z,P W =P W|Z i =z . PAC-Bayes bounds have been improved by comparing population and empirical risks differently (instead of just subtracting them) [Langford and Seeger, 2001, Germain et al., 2009, Rivasplata et al., 2020]. The property (d) of the counterexamples implies that these improvements will not make sample-wise PAC-Bayes bounds possible, as the changed distance function will be at least constant whenr S (W)∈{0,1} while R(W)∈[1/4,3/4]. Another way of improving information-theoretic bounds is to use the random subsampling setting introduced by Steinke and Zakynthinou [2020]. In this setting one considers2n i.i.d. samples fromP Z grouped inton pairs: ˜ Z∈Z n× 2 . A random variableJ ∼ Uniform({0,1} n ), independent of ˜ Z, specifies which example to select from each pair to form the training setS =( ˜ Z i,J i ) n i=1 . Steinke and Zakynthinou [2020] proved that ifℓ(w,z)∈[0,1],∀w∈W,z∈Z, then the expected generalization gap can be bounded as follows: |E S,W [R(W)− r S (W)]|≤ r 2 n I(W;J | ˜ Z). (5.34) This result was improved in many works, leading to the following sample-wise bounds [Haghifam et al., 2020, Harutyunyan et al., 2021b, Rodríguez-Gálvez et al., 2021, Zhou et al., 2021]: |E S,W [R(W)− r S (W)]|≤ 1 n n X i=1 E ˜ Z i q 2I ˜ Z i (W;J i ) , (5.35) |E S,W [R(W)− r S (W)]|≤ 1 n n X i=1 E ˜ Z q 2I ˜ Z (W;J i ) , (5.36) where ˜ Z i =( ˜ Z i,1 , ˜ Z i,2 ) is thei-th row of ˜ Z. Given a partitionW ∼ Q W|S and two examples ˜ Z i , one cannot tell which of the examples was in the training set because of the symmetry. Hence, the counterexample implies that expected squared, PAC-Bayes, and single draw generalization bounds that depend on quantities likeI ˜ Z i (W;J i ) cannot exist. However, if we consider the weaker sample-wise bounds of Eq. (5.36), then 59 the knowledge of ˜ Z helps to reveal the entireJ at once with high probability. This can be done by going over all possible choices ofJ and checking whether ˜ Z J =( ˜ Z i,J i ) belongs to partitionW . This will be true for the true value ofJ, but will be increasingly unlikely for all other values ofJ asn andd are increased. In fact, we derive the following expected squared generalization gap that depends on termsI ˜ Z (W;J i ). Theorem5.4.1. In the random subsampling setting, letW ∼ Q W|S . Ifℓ(w,z)∈[0,1], then E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n n X i=1 E ˜ Z q 2I ˜ Z (W;J i ). (5.37) Therefore, the counterexample is a discrete case where the bound of Eq. (5.35) is much better than the weaker bound of Eq. (5.36). It is also a case whereI(W;Z i ) ≪ I(W;J i | ˜ Z) (i.e., CMI bounds are not always better). Finally, another way of improving information-theoretic bounds is to use evaluated conditional mutual information (e-CMI) [Steinke and Zakynthinou, 2020] or functional conditional mutual information (f-CMI, see Chapter 4). Similarly to the functional CMI bounds, one can derive the following expected generalization gap bound. Theorem5.4.2. In the random subsampling setting, letW ∼ Q W|S . Ifℓ(w,z)∈[0,1], then |E[R(W)− r S (W)]|≤ 1 n n X i=1 q 2I(ℓ(W, ˜ Z i );J i ), (5.38) |E[R(W)− r S (W)]|≤ 1 n n X i=1 E ˜ Z i q 2I ˜ Z i (ℓ(W, ˜ Z i );J i ) , (5.39) |E[R(W)− r S (W)]|≤ 1 n n X i=1 E ˜ Z q 2I ˜ Z (ℓ(W, ˜ Z i );J i ) , (5.40) whereℓ(W, ˜ Z i )∈{0,1} 2 is the pair of losses on the two examples of ˜ Z i . In the case of the counterexample bounds of Eq. (5.38) and Eq. (5.39) will be zero as one cannot guessJ i knowing lossesℓ(W, ˜ Z i ) and possibly also ˜ Z i . This rules out the possibility of such sample-wise expected squared, PAC-Bayes and single-draw generalization bounds. Unlike the case of the weaker CMI bound of Eq. (5.36), the weaker e-CMI bound of Eq. (5.40) convergences to zero in the case of the counterexample as d→∞. Therefore, the counterexample is a discrete case where the sample-wise e-CMI bound of Eq. (5.40) can be much stronger than the sample-wise CMI bound of Eq. (5.36). 5.5 Thecaseofm = 2 In Section 5.1 we mentioned that there are expected generalization bounds that are based on information contained in sizem subsets of examples. In Section 5.3 we showed that there can be no expected squared generalization bounds withm=1. In this section we show that expected squared generalization bounds are possible for anym≥ 2. For brevity, letG (2) ≜E P W,S h (R(W)− r S (W)) 2 i denote expected squared generalization gap. 60 Theorem5.5.1. Assumeℓ(w,z)∈[0,1]. LetW ∼ Q W|S . Then G (2) ≤ 1 n + 1 n 2 X i̸=k p 2I(W;Z i ,Z k ). (5.41) Proof. We have that G (2) =E P W,S 1 n n X i=1 (ℓ(W,Z i )− R(W)) ! 2 (5.42) = 1 n 2 n X i=1 E P W,Z i h (ℓ(W,Z i )− R(W)) 2 i | {z } ≤ 1/n (5.43) + 1 n 2 X i̸=k E[(ℓ(W,Z i )− R(W))(ℓ(W,Z k )− R(W))] | {z } C i,k . (5.44) For bounding the C i,k terms we use Lemma 4.2.1 with Φ = W , Ψ = ( Z i ,Z k ), and f(Φ ,(Ψ 1 ,Ψ 2 )) = (ℓ(Φ ,Ψ 1 )− R(Φ))( ℓ(Φ ,Ψ 2 )− R(Φ)) . Asf(Φ ,Ψ) is 1-subgaussian underP Φ ⊗ P Ψ , by the lemma E P Φ ,Ψ [f(Φ ,Ψ)] − E P Φ ⊗ P Ψ [f(Φ ,Ψ)] ≤ p 2I(Φ;Ψ) , (5.45) which translates to C i,k − E P W P Z i ,Z k [(ℓ(W,Z i )− R(W))(ℓ(W,Z k )− R(W))] | {z } ¯ C i,k ≤ p 2I(W;Z i ,Z k ). (5.46) It is left to notice that ¯ C i,k = 0, as for anyw the factors(ℓ(w,Z i )− R(w)) and(ℓ(w,Z k )− R(w)) are independent and have zero mean. Corollary5.5.2. LetU m be a uniformly random subset of[n] of sizem, independent fromW andS. Then G (2) ≤ 1 n +2E Um "r I Um (W;S Um ) m # . (5.47) Proof. By Proposition 4.2.9 we have that for anym∈[n− 1], E Um " r 1 m I Um (W;S Um ) # ≤ E U m+1 " r 1 m+1 I U m+1 (W;S U m+1 ) # . (5.48) 61 Therefore, for anym=2,...,n, starting with Theorem 5.5.1, G (2) ≤ 1 n + 1 n 2 X i̸=k p 2I(W;Z i ,Z k ) (5.49) ≤ 1 n +2E U 2 "r I U 2 (W;S U 2 ) 2 # (5.50) ≤ 1 n +2E Um "r I Um (W;S Um ) m # . (5.51) At m = n this bound is weaker that the bound of Theorem 4.2.3, which depends on I(W;S)/n rather than p I(W;S)/n. We leave improving the bound of Theorem 5.5.1 for future work. Nevertheless, Corollary 5.5.2 shows that it is possible to bound the expected squared generalization gap with quantities that involve mutual information terms betweenW and subsets of examples of sizem, wherem≥ 2 (unlike the case ofm=1). Possibility of bounding expected squared generalization gap withm≥ 2 information terms makes it possible for single-draw and PAC-Bayes bounds as well. The simplest way is to use Markov’s inequality, even though it will not give high probability bounds. Bounds similar to Theorem 5.4.1 but with CMI and e-CMI bounds are also possible, as shown by the following result. Theorem5.5.3. In the random subsampling setting, letW ∼ Q W|S . Ifℓ(w,z)∈[0,1], then E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n 2 X i̸=k q 2I(ℓ(W, ˜ Z i ),ℓ(W, ˜ Z k );J i ,J k ), (5.52) E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n 2 X i̸=k E q 2I ˜ Z i , ˜ Z k (ℓ(W, ˜ Z i ),ℓ(W, ˜ Z k );J i ,J k ), (5.53) E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n 2 X i̸=k E q 2I ˜ Z (ℓ(W, ˜ Z i ),ℓ(W, ˜ Z k );J i ,J k ), (5.54) and E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n 2 X i̸=k E q 2I ˜ Z i , ˜ Z k (W;J i ,J k ), (5.55) E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n 2 X i̸=k E q 2I ˜ Z (W;J i ,J k ). (5.56) Finally, it is worth to mention that the bound of Theorem 5.5.1 holds for higher order moments of generalization gap too, as for[0,1]-bounded loss functions E P W,S h (R(W)− r S (W)) k i ≤ E P W,S h (R(W)− r S (W)) 2 i , (5.57) for anyk≥ 2,k∈N. 62 5.6 Conclusion In the counterexample presented in Section 5.3 the empirical risk is sometimes larger than the popula- tion risk, which is rare in practice. In fact, if empirical risk is never larger than population risk, then E[|R(W)− r S (W)|] reduces toE[R(W)− r S (W)], implying existence of sample-wise bounds. Further- more, the constructed learning algorithm intentionally captures only high-order information about samples. This suggests, that sample-wise generalization bounds might be possible if we consider specific learning algorithms. 63 Chapter6 SupervisionComplexityanditsRoleinKnowledgeDistillation 6.1 Introduction Knowledge distillation (KD) [Buciluˇ a et al., 2006, Hinton et al., 2015] is a popular method of compressing a large “teacher” model into a more compact “student” model. In its most basic form, this involves training the student to fit the teacher’s predicted label distribution or soft labels for each sample. There is strong empirical evidence that distilled students usually perform better than students trained on raw dataset labels [Hinton et al., 2015, Furlanello et al., 2018, Stanton et al., 2021, Gou et al., 2021]. Multiple works have devised novel knowledge distillation procedures that further improve the student model performance (see Gou et al. [2021] and references therein). The fact that knowledge distillation outperforms training on the raw dataset labels is surprising from a purely information-theoretic perspective. This is because the teacher itself is usually trained on the same dataset. By data-processing inequality, the teacher’s predicted soft labels add no new information beyond what is present in the original dataset. Clearly, the distillation dataset has some additional properties that enable better learning for the student. Several works have aimed to rigorously formalize why knowledge distillation can improve the student model performance. Some prominent observations from this line of work are that (self-)distillation induces certain favorable optimization biases in the training objective [Phuong and Lampert, 2019, Ji and Zhu, 2020], lowers variance of the objective [Menon et al., 2021, Dao et al., 2021, Ren et al., 2022], increases regularization towards learning “simpler” functions [Mobahi et al., 2020], transfers information from different data views [Allen-Zhu and Li, 2023], and scales per-example gradients based on the teacher’s confidence [Furlanello et al., 2018, Tang et al., 2020]. Despite this remarkable progress, there are still many open problems and unexplained phenomena around knowledge distillation; to name a few: — Why do soft labels (sometimes) help? It is agreed that teacher’s soft predictions carry information about class similarities [Hinton et al., 2015, Furlanello et al., 2018], and that this softness of predictions has a regularization effect similar to label smoothing [Yuan et al., 2020]. Nevertheless, knowledge distillation also works in binary classification settings with limited class similarity information [Müller et al., 2020]. How exactly the softness of teacher predictions (controlled by a temperature parameter) affects the student learning remains far from well understood. — The role of capacity gap. There is evidence that when there is a significant capacity gap between the teacher and the student, the distilled model usually falls behind its teacher [Mirzadeh et al., 2020, Cho and Hariharan, 2019, Stanton et al., 2021]. It is unclear whether this is due to difficulties in optimization, or due to insufficient student capacity. 64 (a) Offline distillation (b) Online distillation Figure 6.1: Online vs. online distillation. Figures (a) and (b) illustrate possible teacher and student function trajectories in offline and offline knowledge distillations respectively. The yellow dotted lines indicate knowledge distillation. — What makes a good teacher? Sometimes less accurate models are better teachers [Cho and Hariharan, 2019, Mirzadeh et al., 2020]. Moreover, early stopped or exponentially averaged models are often better teachers [Ren et al., 2022]. A comprehensive explanation of this remains elusive. The aforementioned wide range of phenomena suggest that there is a complex interplay between teacher accuracy, softness of teacher-provided targets, and complexity of the distillation objective. This work provides a new theoretically grounded perspective on knowledge distillation through the lens of supervision complexity. In a nutshell, this quantifies why certain targets (e.g., temperature-scaled teacher probabilities) may be “easier” for a student model to learn compared to others (e.g., raw one-hot labels), owing to better alignment with the student’sneuraltangentkernel (NTK) [Jacot et al., 2018, Lee et al., 2019]. In particular, we provide a novel theoretical analysis (Section 6.2, Theorems 6.2.2 and 6.2.3) of the role of supervision complexity on kernel classifier generalization, and use this to derive a new generalization bound for distillation (Proposition 6.3.1). The latter highlights how student generalization is controlled by a balance of the teacher generalization, the student’s margin with respect to the teacher predictions, and the complexity of the teacher’s predictions. Based on the preceding analysis, we establish the conceptual and practical efficacy of a simple online distillation approach (Section 6.4), wherein the student is fit to progressively more complex targets, in the form of teacher predictions at various checkpoints during its training. This method can be seen as guiding the student in the function space (see Figure 6.1), and leads to better generalization compared to offline distillation. We provide empirical results on a range of image classification benchmarks confirming the value of online distillation, particularly for students with weak inductive biases. Beyond practical benefits, the supervision complexity view yields new insights into distillation: — Theroleoftemperaturescalingandearly-stopping. Temperature scaling and early-stopping of the teacher have proven effective for knowledge distillation. We show that both of these techniques reduce the supervision complexity, at the expense of also lowering the classification margin. Online distillation manages to smoothly increase teacher complexity, without degrading the margin. — Teaching a weak student. We show that for students with weak inductive biases, and/or with much less capacity than the teacher, the final teacher predictions are often as complex as dataset labels, particularly during the early stages of training. In contrast, online distillation allows the supervision complexity to progressively increase, thus allowing even a weak student to learn. — NTK and relational transfer. We show that online distillation is highly effective at matching the teacher and student NTK matrices. This transfers relational knowledge in the form of example-pair similarity, as opposed to standard distillation which only transfers per-example knowledge. 65 Problemsetting. We focus on classification problems from input domain X tod classes. We are given a training set of n labeled examples{(x 1 ,y 1 ),...,(x n ,y n )}, with one-hot encoded labels y i ∈ {0,1} d . Typically, a modelf w :X →R d is trained with the softmax cross-entropy loss: L ce (w)=− 1 n X n i=1 y ⊤ i log(softmax(f w (x i ))). (6.1) In standard knowledge distillation, given a trainedteacher modelg :X →R d that outputs logits, one trains a student modelf w :X →R d to fit the teacher predictions. Hinton et al. [2015] propose the following knowledge distillation loss: L kd− ce (w;g,τ )=− τ 2 n X n i=1 softmax(g(x i )/τ ) ⊤ log(softmax(f w (x i )/τ )), (6.2) where temperatureτ > 0 controls the softness of teacher predictions. To highlight the effect of knowledge distillation and simplify exposition, we assume that the student is not trained with the dataset labels. 6.2 Supervisioncomplexityandgeneralization One apparent difference between standard training and knowledge distillation ((6.1) and (6.2)) is that the latter modifies the targets that the student attempts to fit. The targets used during distillation ensure a better generalization for the student; what is the reason for this? Towards answering this question, we present a new perspective on knowledge distillation in terms of supervision complexity. To begin, we show how the generalization of a kernel-based classifier is controlled by a measure of alignment between the target labels and the kernel matrix. We first treat binary kernel-based classifiers (Theorem 6.2.2), and later extend our analysis to multiclass kernel-based classifiers (Theorem 6.2.3). Finally, by leveraging the neural tangent kernel machinery, we discuss the implications of our analysis for neural classifiers in Section 6.2.2. 6.2.1 Supervisioncomplexitycontrolskernelmachinegeneralization The notion of supervision complexity is easiest to introduce and study for kernel-based classifiers. We briefly review some necessary background [Scholkopf and Smola, 2001]. Let k :X ×X → R be a positive semidefinite kernel defined over an input space X . Any such kernel uniquely determines a reproducing kernel Hilbert space (RKHS)H of functions fromX toR. This RKHS is the completion of the set of functions of formf(x) = P m i=1 α i k(x i ,x), withx i ∈X,α i ∈ R. Anyf(x) = P m i=1 α i k(x i ,x)∈H has (RKHS) norm ∥f∥ 2 H = m X i=1 m X j=1 α i α j k(x i ,x j )=α ⊤ Kα , (6.3) whereα =(α 1 ,...,α n ) ⊤ andK i,j =k(x i ,x j ). Intuitively,∥f∥ 2 H measures the smoothness off, e.g., for a Gaussian kernel it measures the Fourier spectrum decay off [Scholkopf and Smola, 2001]. For simplicity, we start with the case of binary classification. Suppose {(X i ,Y i )} i∈[n] aren i.i.d. examples sampled from some probability distribution onX ×Y , withY ⊂ R, where positive and negative labels correspond to distinct classes. LetK i,j =k(X i ,X j ) denote the kernel matrix, andY =(Y 1 ,...,Y n ) ⊤ be the concatenation of all training labels. Definition6.2.1 (Supervision complexity). The supervision complexity of targetsY 1 ,...,Y n with respect to a kernelk is defined to be Y ⊤ K − 1 Y in cases whenK is invertible, and+∞ otherwise. 66 We now establish how supervision complexity controls the smoothness of the optimal kernel classifier. Consider a classifier obtained by solving a regularized kernel classification problem: f ∗ ∈argmin f∈H 1 n X n i=1 ℓ(f(X i ),Y i )+ λ 2 ∥f∥ 2 H , (6.4) whereℓ is a loss function andλ> 0. The following proposition shows whenever the supervision complexity is small, the RKHS norm of any optimal solutionf ∗ will also be small. This is an important learning bias that shall help us explain certain aspects of knowledge distillation. Proposition 6.2.1. Assume that K is full rank almost surely; ℓ(y,y ′ ) ≥ 0,∀y,y ′ ∈ Y; and ℓ(y,y) = 0,∀y∈Y. Then, with probability 1, for any solutionf ∗ of Eq. (6.4), we have∥f ∗ ∥ 2 H ≤ Y ⊤ K − 1 Y. Equipped with the above result, we now show how supervision complexity controls generalization. In the following, letϕ γ :R→[0,1] be the margin loss [Mohri et al., 2018] with scaleγ > 0: ϕ γ (α )= 1 ifα ≤ 0 1− α/γ if0<α ≤ γ 0 ifα>γ. (6.5) Theorem6.2.2. Assume thatκ =sup x∈X k(x,x)<∞ andK is full rank almost surely. Further, assume thatℓ(y,y ′ )≥ 0,∀y,y ′ ∈Y andℓ(y,y)=0,∀y∈Y. LetM 0 =⌈γ √ n/(2 √ κ )⌉. Then, with probability at least1− δ , for any solutionf ∗ of problem in Eq. (6.4), we have P X,Y (Yf ∗ (X)≤ 0)≤ 1 n n X i=1 ϕ γ (sign(Y i )f ∗ (X i ))+ 2 √ Y ⊤ K − 1 Y +2 γn p Tr(K) +3 r ln(2M 0 /δ ) 2n . (6.6) One can compare Theorem 6.2.2 with the standard Rademacher bound for kernel classifiers [Bartlett and Mendelson, 2002]. The latter typically consider learning over functions with RKHS norm bounded by a constant M >0. The corresponding complexity term then decays asO( p M· Tr(K)/n), which is data-independent. Consequently, such a bound cannot adapt to the intrinsic “difficulty” of the targets Y . In contrast, Theorem 6.2.2 considers functions with RKHS norm bounded by the data-dependent supervision complexity term. This results in a more informative generalization bound, which captures the “difficulty” of the targets. Here, we note that Arora et al. [2019] characterized the generalization of an overparameterized two-layer neural network via a term closely related to the supervision complexity (see Section 6.5 for additional discussion). The supervision complexityY ⊤ K − 1 Y is small wheneverY is aligned with top eigenvectors ofK and/orY has small scale. Furthermore, one cannot make the bound close to zero by just reducing the scale of targets, as one would need a smallγ to control the margin loss that would otherwise increase due to student predictions getting closer to zero (as the student aims to matchY i ). To better understand the role of supervision complexity, it is instructive to consider two special cases that lead to a poor generalization bound: (1) uninformative features, and (2) uninformative labels. 67 Complexity under uninformative features. Suppose the kernel matrixK is diagonal, so that the kernel providesno information on example-pair similarity; i.e., the kernel is “uninformative”. An application of Cauchy-Schwarz reveals the key expression in the second term in Eq. (6.6) satisfies: 1 n q Y ⊤ K − 1 Y Tr(K)= 1 n r X n i=1 Y 2 i · k(X i ,X i ) − 1 X n i=1 k(X i ,X i ) ≥ 1 n n X i=1 |Y i |. (6.7) Consequently, this term is least constant in order, and does not vanish asn→∞. Complexityunderuninformativelabels. Suppose the labelsY i are purely random, and independent from inputsX i . Conditioned on{X i },Y ⊤ K − 1 Y concentrates around its mean by the Hanson-Wright inequality [Vershynin, 2018]. Hence, ∃ϵ (K,δ,n ) such that with probability ≥ 1− δ , Y ⊤ K − 1 Y ≥ E {Y i } Y ⊤ K − 1 Y − ϵ =E Y 2 1 Tr(K − 1 )− ϵ . Thus, with the same probability, 1 n q Y ⊤ K − 1 Y Tr(K)≥ 1 n q E Y 2 1 Tr(K − 1 )− ϵ Tr(K)≥ 1 n q E Y 2 1 n 2 − ϵ Tr(K), (6.8) where the last inequality is by Cauchy-Schwarz. For sufficiently large n, the quantityE Y 2 1 n 2 dominates ϵ Tr(K), rendering the bound of Theorem 6.2.2 close to a constant. 6.2.2 Extensions: multiclassclassificationandneuralnetworks We now show that a result similar to Theorem 6.2.2 holds for multiclass classification as well. In addition, we also discuss how our results are instructive about the behavior of neural networks. Extension to multiclass classification. Let{(X i ,Y i )} i∈[n] be drawn i.i.d. from a distribution over X ×Y , whereY ⊂ R d . Letk :X ×X → R d× d be a matrix-valued positive definite kernel and H be the corresponding vector-valued RKHS. As in the binary classification case, we consider a kernel problem in Eq. (6.4). LetY ⊤ =(Y ⊤ 1 ,...,Y ⊤ n ) andK be the kernel matrix of training examples: K = k(X 1 ,X 1 ) ··· k(X 1 ,X n ) ··· ··· ··· k(X n ,X 1 ) ··· k(X n ,X n ) ∈R nd× nd . (6.9) Forf :X →R d and a labeled example(x,y), letρ f (x,y) = f(x) y − max y ′ ̸=y f(x) y ′ be the prediction margin. Then, the following analogue of Theorem 6.2.2 holds. Theorem6.2.3. Assume thatκ =sup x∈X,y∈[d] k(x,x) y,y <∞, andK is full rank almost surely. Further, assume that ℓ(y,y ′ ) ≥ 0,∀y,y ′ ∈ Y and ℓ(y,y) = 0,∀y ∈ Y. Let M 0 = ⌈γ √ n/(4d √ κ )⌉. Then, with probability at least1− δ , for any solutionf ∗ of problem in Eq. (6.4), P X,Y (ρ f ∗ (X,Y)≤ 0)≤ 1 n n X i=1 1{ρ f ∗ (X i ,Y i )≤ γ }+ 4d(Y ⊤ K − 1 Y +1) γn p Tr(K) +3 r log(2M 0 /δ ) 2n . (6.10) 68 Implicationsforneuralclassifiers. Our analysis has so far focused on kernel-based classifiers. While neural networks are not exactly kernel methods, many aspects of their performance can be understood via a correspondinglinearizedneuralnetwork (see Ortiz-Jiménez et al. [2021] and references therein). We follow this approach, and given a neural networkf w with current weightsw 0 , we consider the corresponding linearized neural network comprising the linear terms of the Taylor expansion off w (x) aroundw 0 [Jacot et al., 2018, Lee et al., 2019]: f lin w (x)≜f w 0 (x)+∇ w f w 0 (x) ⊤ (w− w 0 ). (6.11) Letω≜w− w 0 . This networkf lin ω (x) is a linear function with respect to the parametersω, but is generally non-linear with respect to the inputx. Note that∇ w f w 0 (x) acts as a feature representation, and induces the neural tangent kernel (NTK)k 0 (x,x ′ )=∇ w f w 0 (x) ⊤ ∇ w f w 0 (x ′ )∈R d× d . Given a labeled datasetS ={(x i ,y i )} i∈[n] and a loss functionL(f;S), the dynamics of gradient flow with learning rateη > 0 forf lin ω can be fully characterized in the function space, and depends only on the predictions atw 0 and the NTKk 0 : ˙ f lin t (x ′ )=− η · K 0 (x ′ ,x) ∇ f L(f lin t (x);S), (6.12) where f(x) ∈ R nd denotes the concatenation of predictions on training examples and K 0 (x ′ ,x) = ∇ w f w 0 (x ′ ) ⊤ ∇ w f w 0 (x). Lee et al. [2019] show that as one increases the width of the network or when (w− w 0 ) does not change much during training, the dynamics of the linearized and original neural network become close. When f w is sufficiently overparameterized and L is convex with respect to ω, then ω t converges to an interpolating solution. Furthermore, for the mean squared error objective, the solution has the minimum Euclidean norm [Gunasekar et al., 2017]. As the Euclidean norm of ω corresponds to norm off lin ω (x)− f w 0 (x) in the vector-valued RKHSH corresponding tok 0 , training a linearized network to interpolation is equivalent to solving the following with a smallλ> 0: h ∗ =argmin h∈H 1 n X n i=1 (f w 0 (x i )+h(x i )− y i ) 2 + λ 2 ∥h∥ 2 H . (6.13) Therefore, the generalization bounds of Theorems 6.2.2 and 6.2.3 apply toh ∗ with supervision complexity of residual targetsy i − f w 0 (x i ). However, we are interested in the performance off w 0 +h ∗ . As the proofs of these results rely on bounding the Rademacher complexity of hypothesis sets of form{h∈H: ∥h∥≤ M}, and shifting a hypothesis set by a constant function does not change the Rademacher complexity (see Remark G.6 of Appendix G), these proofs can be easily modified to handle hypotheses shifted by the constant functionf w 0 . 6.3 Knowledgedistillation: asupervisioncomplexitylens We now turn to knowledge distillation, and explore how supervision complexity affects student’s gener- alization. We show that student’s generalization depends on three terms: the teacher generalization, the student’s margin with respect to the teacher predictions, and the complexity of the teacher’s predictions. 6.3.1 Trade-offbetweenteacheraccuracy,margin,andcomplexity Consider the binary classification setting of Section 6.2, and a fixed teacher g :X →R that outputs a logit. Let{(X i ,Y ∗ i )} i∈[n] ben i.i.d. labeled examples, whereY ∗ i ∈{− 1,1} denotes the ground truth labels. For temperatureτ > 0, letY i ≜2sigmoid(g(X i )/τ )− 1∈[− 1,+1] denote the teacher’s soft predictions, for 69 sigmoid functionz7→(1+exp(− z)) − 1 . Our key observation is: if the teacher predictionsY i are accurate enough and have significantly lower complexity compared to ground truth labels Y ∗ i , then a student kernel method (cf. Eq. (6.4)) trained withY i can generalize better than the one trained withY ∗ i . The following result quantifies the trade-off between teacher accuracy, student prediction margin, and teacher prediction complexity. Proposition6.3.1. Assume thatκ = sup x∈X k(x,x) <∞ andK is full rank almost surely,ℓ(y,y ′ )≥ 0,∀y,y ′ ∈Y, andℓ(y,y)=0,∀y∈Y. LetY i andY ∗ i be defined as above. Let M 0 =⌈γ √ n/(2 √ κ )⌉. Then, with probability at least1− δ , any solutionf ∗ of problem Eq. (6.4) satisfies P X,Y ∗ (Y ∗ f ∗ (X)≤ 0) | {z } student risk ≤ P X,Y ∗ (Y ∗ g(X)≤ 0) | {z } teacher risk + 1 n X n i=1 ϕ γ (sign(Y i )f ∗ (X i )) | {z } student’s empirical margin loss w.r.t. teacher predictions + 2 √ Y ⊤ K − 1 Y +2 p Tr(K)/(γn ) | {z } complexity of teacher’s predictions +3 p ln(2M 0 /δ )/(2n). (6.14) Note that a similar result is easy to establish for multiclass classification using Theorem 6.2.3. The first term in the above accounts for the misclassification rate of the teacher. While this term is not irreducible (it is possible for a student to perform better than its teacher), generally a student performs worse that its teacher, especially when there is a significant teacher-student capacity gap. The second term is student’s empirical margin loss w.r.t. teacher predictions. This captures the price of making teacher predictions too soft. Intuitively, the softer (i.e., closer to zero) teacher predictions are, the harder it is for the student to learn the classification rule. The third term accounts for the supervision complexity and the margin parameterγ . Thus, one has to chooseγ carefully to achieve a good balance between empirical margin loss and margin-normalized supervision complexity. Theeffectoftemperature. For a fixed margin parameter γ > 0, increasing the temperatureτ makes teacher’s predictionsY i softer. On the one hand, the reduced scale decreases the supervision complexity Y ⊤ K − 1 Y . Moreover, we shall see that in the case of neural networks the complexity decreases even further due toY becoming more aligned with top eigenvectors of K. On the other hand, the scale of predictions of the (possibly interpolating) studentf ∗ will decrease too, increasing the empirical margin loss. This suggests that setting the value ofτ is not trivial: the optimal value can be different based on the kernelk and teacher logitsg(X i ). 6.3.2 Fromofflinetoonlineknowledgedistillation We identified that supervision complexity plays a key role in determining the efficacy of a distillation procedure. The supervision from a fully trained teacher model can prove to be very complex for a student model in an early stage of its training (Figure 6.2). This raises the question: is there value in providing progressively difficult supervision to the student ? In this section, we describe a simple online distillation method, where the the teacher is updated during the student training. Over the course of their training, neural models learn functions of increasing complexity [Kalimeris et al., 2019]. This provides a natural way to construct a set of teachers with varying prediction complexities. Similar to Jin et al. [2019], for practical considerations of not training the teacher and the student simultaneously, we assume the availability of teacher checkpoints over the course of its training. Givenm teacher checkpoints at timesT ={t i } i∈[m] , during thet-th step of distillation, the student receives supervision from the teacher 70 Algorithm2 Online knowledge distillation. 1: Require: Training sample S; teacher checkpoints{g (t1) ,...,g (tm) }; temperature τ > 0; training steps T ; minibatch sizeb 2: fort=1,...,T do 3: Draw randomb-sized minibatchS ′ fromS 4: Compute nearest teacher checkpointt ∗ =min{i∈[m]: t i >t} 5: Update studentw← w− η t ·∇ w L kd− ce (w;g (t ∗ ) ,τ ) overS ′ 6: endfor 7: returnf w Table 6.1: Knowledge distillation results on CIFAR-10. Every second line is an MSE student. Setting NoKD OfflineKD OnlineKD Teacher τ =1 τ =4 τ =1 τ =4 ResNet-56→ LeNet-5x8 81.8± 0.5 82.4± 0.5 86.0± 0.2 86.8± 0.2 88.6± 0.1 93.2 ResNet-56→ LeNet-5x8 83.4± 0.3 83.1± 0.2 84.9± 0.1 85.6± 0.1 87.1± 0.1 93.2 ResNet-110→ LeNet-5x8 81.7± 0.3 81.9± 0.5 85.8± 0.1 86.5± 0.1 88.8± 0.1 93.9 ResNet-110→ LeNet-5x8 83.2± 0.4 83.2± 0.1 85.0± 0.3 85.6± 0.1 87.1± 0.2 93.9 ResNet-110→ ResNet-20 91.4± 0.2 91.4± 0.1 92.8± 0.0 92.2± 0.3 93.1± 0.1 93.9 ResNet-110→ ResNet-20 90.9± 0.1 90.9± 0.2 91.6± 0.2 91.2± 0.1 92.1± 0.2 93.9 checkpoint at timemin{t ′ ∈T :t ′ >t}. Note that the student is trained for the same number of epochs in total as in offline distillation. We use the term “online distillation” for this approach (cf. Algorithm 2). Online distillation can be seen as guiding the student network to follow the teacher’s trajectory in functionspace (see Figure 6.1). Given that NTK can be interpreted as a principled notion of example similarity and controls which examples affect each other during training [Charpiat et al., 2019], it is desirable for the student to have an NTK similar to that of its teacher at each time step. To test whether online distillation also transfers NTK, we propose to measure similarity between the final student and final teacher NTKs. For computational efficiency we work with NTK matrices corresponding to a batch of b examples (bd× bd matrices). Explicit computation of even batch NTK matrices can be costly, especially when the number of classesd is large. We propose to view student and teacher batch NTK matrices (denoted byK f andK g respectively) as operators and measure their similarity by comparing their behavior on random vectors: sim(K f ,K g )=E V∼N (0,I bd ) ⟨K f V,K g V⟩ ∥K f V∥∥K g V∥ . (6.15) Note that the cosine distance is used to account for scale differences of K g andK f . The kernel-vector products appearing in this similarity measure above can be computed efficiently without explicitly con- structing the kernel matrices. For example, withx ′ denoting the collection of inputs of the mini-batch, K f v = ∇ w f w (x ′ ) ⊤ (∇ w f w (x ′ )v) can be computed with one vector-Jacobian product followed by a Jacobian-vector product. The former can be computed efficiently using backpropagation, while the latter can be computed efficiently using forward-mode differentiation. 6.4 Experimentalresults We now present experimental results to showcase the importance of supervision complexity in distillation, and to establish efficacy of online distillation. 71 Table 6.2: Knowledge distillation results on CIFAR-100. Setting NoKD OfflineKD OnlineKD Teacher τ =1 τ =4 τ =1 τ =4 ResNet-56→ LeNet-5x8 47.3± 0.6 50.1± 0.4 59.9± 0.2 61.9± 0.2 66.1± 0.4 72.0 ResNet-56→ ResNet-20 67.7± 0.5 68.2± 0.3 71.6± 0.2 69.6± 0.3 71.4± 0.3 72.0 ResNet-110→ LeNet-5x8 47.2± 0.5 48.6± 0.8 59.0± 0.3 60.8± 0.2 65.8± 0.2 73.4 ResNet-110→ ResNet-20 67.8± 0.3 67.8± 0.2 71.2± 0.0 69.0± 0.3 71.4± 0.0 73.4 Table 6.3: Knowledge distillation results on Tiny ImageNet. Setting NoKD OfflineKD OnlineKD Teacher τ =1 τ =2 τ =1 τ =2 MobileNet-V3-125→ MobileNet-V3-35 58.5± 0.2 59.2± 0.1 60.2± 0.2 60.7± 0.2 62.3± 0.3 62.7 ResNet-101→ MobileNet-V3-35 58.5± 0.2 59.4± 0.5 61.6± 0.2 61.1± 0.3 62.0± 0.3 66.0 MobileNet-V3-125→ VGG-16 48.9± 0.3 54.1± 0.4 59.4± 0.4 58.9± 0.7 62.3± 0.3 62.7 ResNet-101→ VGG-16 48.6± 0.4 53.1± 0.4 60.6± 0.2 60.4± 0.2 64.0± 0.1 66.0 6.4.1 Experimentalsetup We consider standard image classification benchmarks: CIFAR-10, CIFAR-100, and Tiny-ImageNet. Ad- ditionally, we derive a binary classification task from CIFAR-100 by grouping the first and last 50 classes into two meta-classes. We consider teacher and student architectures that are ResNets [He et al., 2016], VGGs [Simonyan and Zisserman, 2015], and MobileNets [Howard et al., 2019] of various depths. As a student architecture with relatively weaker inductive biases, we also consider the LeNet-5 [LeCun et al., 1998] with 8 times wider hidden layers. We use standard hyperparameters to train these models. In particular, in all experiments we use stochastic gradient descent optimizer with 128 batch size and 0.9 Nesterov momentum. The starting learning rates are presented in Table 6.4. All models for CIFAR datasets are trained for 256 epochs, with a learning schedule that divides the learning rate by 10 at epochs 96, 192, and 224. All models for Tiny ImageNet are trained for 200 epochs, with a learning rate schedule that divides the learning rate by 10 at epochs 75 and 135. The learning rate is warmed-up linearly to its initial value in the first 10 and 5 epochs for CIFAR and Tiny ImageNet models respectively. All VGG and ResNet models use 2e-4 weight decay, while MobileNet models use 1e-5 weight decay. The LeNet-5 uses ReLU activations. We use the CIFAR variants of ResNets in experiments with CIFAR-10 or (binary) CIFAR-100 datasets. Tiny ImageNet examples are resized to 224x224 resolution to suit the original ResNet, VGG and MobileNet architectures. In all experiments we use standard data augmentation – random cropping and random horizontal flip. In all online learning methods we consider one teacher checkpoint per epoch. We compare (1) regular one-hot training (without any distillation), (2) regular offline distillation using the temperature-scaled softmax cross-entropy, and (3) online distillation using the same loss. For CIFAR-10 72 Table 6.4: Initial learning rates for different dataset and model pairs. Dataset Model Learning rate CIFAR-10, CIFAR-100, binary CIFAR-100 ResNet-56 (teacher) 0.1 ResNet-110 (teacher) 0.1 ResNet-20 (CE or MSE students) 0.1 LeNet-5x8 (CE or MSE students) 0.04 Tiny ImageNet MobileNet-V3-125 (teacher) 0.04 ResNet-101 (teacher) 0.1 MobileNet-V3-35 (student) 0.04 VGG-16 (student) 0.01 and binary CIFAR-100, we also consider training with mean-squared error (MSE) loss and its corresponding KD loss: L mse (w)= 1 2n n X i=1 ∥y i − f w (x i )∥ 2 2 , (6.16) L kd− mse (w;g,τ )= τ 2n n X i=1 ∥softmax(g(x i )/τ )− f w (x i )∥ 2 2 . (6.17) The MSE loss allows for interpolation in case of one-hot labelsy i , making it amenable to the analysis in Sections 6.2 and 6.3. Moreover, Hui and Belkin [2021] show that under standard training, the CE and MSE losses perform similarly; as we shall see, the same is true for distillation as well. As mentioned in Section 6.1, in all KD experiments student networks receive supervision only through a knowledge distillation loss (i.e., dataset labels are not used). This choice help us decrease differences between the theory and experiments. Furthermore, in our preliminary experiments we observed that this choice does not result in student performance degradation (see Table 6.6). 6.4.2 Resultsanddiscussion Tables 6.1 to 6.3 and 6.5 present the results (mean and standard deviation of test accuracy over 3 random trials). First, we see that online distillation with proper temperature scaling typically yields the most accurate student. The gains over regular distillation are particularly pronounced when there is a large teacher-student gap. For example, on CIFAR-100, ResNet to LeNet distillation with temperature scaling appears to hit a limit of≈ 60% accuracy. Online distillation however manages to further increase accuracy by+6%, which is a≈ 20% increase compared to standard training. Second, the similar results on binary CIFAR-100 shows that “dark knowledge” in the form of membership information in multiple classes is not necessary for distillation to succeed. The results also demonstrate that knowledge distillation with the MSE loss of Eq. (6.17) has a qualitatively similar behavior to KD with CE objective. We use these MSE models to highlight the role of supervision complexity. As an instructive case, we consider a LeNet-5x8 network trained on binary CIFAR-100 with the standard MSE loss function. For a given checkpoint of this network and a given set ofm labeled (test) examples{(X ′ i ,Y ′ i )} i∈[m] , we compute the adjusted supervision complexity defined as 1/n q (Y ′ − f(X ′ )) ⊤ (K ′ ) − 1 (Y ′ − f(X ′ ))· Tr(K ′ ), (6.18) 73 Table 6.5: Knowledge distillation results on binary CIFAR-100. Every second line is an MSE student. Setting NoKD OfflineKD OnlineKD Teacher τ =1 τ =4 τ =1 τ =4 ResNet-56→ LeNet-5x8 71.5± 0.2 72.4± 0.1 73.6± 0.2 74.7± 0.2 76.1± 0.2 77.9 ResNet-56→ LeNet-5x8 71.5± 0.4 71.9± 0.3 73.0± 0.3 75.1± 0.3 75.1± 0.1 77.9 ResNet-56→ ResNet-20 75.8± 0.5 76.1± 0.2 77.1± 0.6 77.8± 0.3 78.1± 0.1 77.9 ResNet-56→ ResNet-20 76.1± 0.5 76.0± 0.2 77.4± 0.3 78.0± 0.2 78.4± 0.3 77.9 ResNet-110→ LeNet-5x8 71.4± 0.4 71.9± 0.1 72.9± 0.3 74.3± 0.3 75.4± 0.3 78.4 ResNet-110→ LeNet-5x8 71.6± 0.2 71.5± 0.4 72.6± 0.4 74.8± 0.4 74.6± 0.2 78.4 ResNet-110→ ResNet-20 76.0± 0.3 76.0± 0.2 77.0± 0.1 77.3± 0.2 78.0± 0.4 78.4 ResNet-110→ ResNet-20 76.1± 0.2 76.4± 0.3 77.6± 0.3 77.9± 0.2 78.1± 0.1 78.4 wheref denotes the current prediction function, andK is derived from the current NTK. Note that the subtraction of initial predictions is the appropriate way to measure complexity given the form of the optimization problem Eq. (6.13). Nevertheless, it is meaningful to consider the following quantity as well: 1 n q Y ′ ⊤ (K ′ ) − 1 Y ′ · Tr(K ′ ), (6.19) in order to measure “alignment” of targetsY ′ with the NTKk. We call this quantity adjusted supervision complexity*. As the training NTK matrix becomes aligned with dataset labels during training (see Baratin et al. [2021] and Figure 6.5), we pick{X ′ i } i∈[m] to be a set of2 12 test examples. Comparison of supervision complexities. We compare the adjusted supervision complexities of random labels, dataset labels, and predictions of an offline and online ResNet-56 teacher predictions with respect to various checkpoints of the LeNet-5x8 and ResNet-20 networks. The results presented in Figures 6.2a and 6.2c indicate that the dataset labels and offline teacher predictions are as complex as random labels in the beginning. After some initial decline, the complexity of these targets increases as the network starts to overfit. Given the lower bound on the supervision complexity of random labels (see Section 6.2), this increase means that the NTK spectrum becomes less uniform. This is confirmed in Figure 6.3. Unlike LeNet-5x8, for ResNet-20, random labels, dataset labels, and offline teacher predictions do not exhibit a U-shaped behavior. In this case too, the shape of these curves is in agreement with the behavior of the condition number of the NTK (see Figure 6.3). In contrast to these static targets, the complexity of the online teacher predictions smoothly increases, and is significantly smaller for most of the epochs. To account for softness differences of the various targets, we consider plotting the adjusted supervision complexity normalized by the target norm∥Y ′ ∥ 2 . As shown in Figures 6.2b and 6.2d, the normalized complexity of offline and online teacher predictions is smaller compared to the dataset labels, indicating a better alignment with top eigenvectors of the corresponding NTKs. Importantly, we see that the predictions of an online teacher have significantly lower normalized complexity in the critical early stages of training. We compare the adjusted supervision complexities* of random labels, dataset labels, and predictions of an offline and online ResNet-56 teacher predictions with respect to various checkpoints of the LeNet-5x8 network. The results presented in Figure 6.4 are remarkably similar to the results with adjusted supervision complexity (Figure 6.2a and Figure 6.2b). We therefore, focus only on adjusted supervision complexity of Eq. (6.18) when comparing various targets. The only other experiment where we compute adjusted supervision complexities* (i.e., without subtracting the current predictions from labels) is presented in 74 0 50 100 150 200 250 Epochs 0 1 2 3 4 Adjusted supervision complexity labels random labels offline teacher online teacher (a) LeNet-5x8 0 50 100 150 200 250 Epochs 1 2 3 4 Adjusted supervision complexity over target norm labels random labels offline teacher online teacher (b) LeNet-5x8, normalized complexity 0 50 100 150 200 250 Epochs 10 0 10 1 10 2 Adjusted supervision complexity labels random labels offline teacher online teacher (c) ResNet-20 0 50 100 150 200 250 Epochs 10 1 10 2 Adjusted supervision complexity over target norm labels random labels offline teacher online teacher (d) ResNet-20, normalized complexity Figure 6.2: Adjusted supervision complexity of various targets with respect to NTKs at different stages of training. The underlying dataset is binary CIFAR-100. Panels (b) and (d) plot adjusted supervision complexity normalized by norm of the targets. Note the y-axes of ResNet-20 plots are in logarithmic scale. 0 50 100 150 200 250 Epochs 10 3 10 4 10 5 NTK condition number (a) LeNet-5x8 student 0 50 100 150 200 250 Epochs 10 5 10 6 10 7 10 8 10 9 NTK condition number (b) ResNet-20 student Figure 6.3: Condition number of the NTK matrix of a LeNet5x8 (a) and ResNet-20 (b) students trained with MSE loss on binary CIFAR-100. The NTK matrices are computed on2 12 test examples. Figure 6.5, where the goal is to demonstrate that training labels become aligned with the training NTK matrix over the course of training. Onearlystoppedteachers. Cho and Hariharan [2019] observe that sometimes offline KD works better with early stopped teachers. Such teachers have worse accuracy and perhaps results in a smaller student margin, but they also have a significantly smaller supervision complexity (see Figure 6.2), which provides a possible explanation for this phenomenon. 75 0 50 100 150 200 250 Epochs 0 1 2 3 4 Adjusted supervision complexity* labels random labels offline teacher online teacher 0 50 100 150 200 250 Epochs 1 2 3 4 Adjusted supervision complexity* over target norm labels random labels offline teacher online teacher Figure 6.4: Adjusted supervision complexities* of various targets with respect to a LeNet-5x8 network at different stages of its training. The experimental setup of the left and right plots matches that of Figure 6.2a and Figure 6.2b respectively. 0 50 100 150 200 250 Epochs 1 2 3 4 Adjusted supervision complexity* (a) LeNet-5x8 student 0 50 100 150 200 250 Epochs 10 0 10 1 10 2 Adjusted supervision complexity* (b) ResNet-20 student Figure 6.5: Adjusted supervision complexity* of dataset labels measured on a subset of2 12 training examples of binary CIFAR-100. Complexities are measured with respect to either a LeNet-5x8 (on the left) or ResNet-20 (on the right) models trained with MSE loss and without knowledge distillation. Note that the plot on the right is in logarithmic scale. 0 5 10 15 20 Temperature 0.4 0.6 0.8 1.0 1.2 Adjusted supervision complexity 0.8 1.0 1.2 1.4 Adjusted supervision complexity over target norm (a) Temperature effect 0 50 100 150 200 250 Epochs 0.5 1.0 1.5 2.0 Adjusted supervision complexity averaged teacher current teacher (b) Teacher averaging Figure 6.6: Adjusted supervision complexity for various targets. One the left: The effect of temperature on the supervision complexity of an offline teacher for a LeNet-5x8 after training for 25 epochs. On the right: The effect of averaging teacher predictions. Effectoftemperaturescaling. As discussed earlier, higher temperature makes the teacher predictions softer, decreasing their norm. This has a large effect on supervision complexity (Figure 6.6a). Even when one controls for the norm of the predictions, the complexity still decreases (Figure 6.6a). 76 50 55 60 65 Test accuracy 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Train NTK similarity correlation: 0.93 No KD offline KD online KD (a) LeNet-5x8 student 68 69 70 71 Test accuracy 0.40 0.45 0.50 0.55 Train NTK similarity correlation: 0.87 No KD offline KD online KD (b) ResNet-20 student 72 74 76 Test accuracy 0.980 0.982 0.984 0.986 0.988 0.990 0.992 0.994 Train fidelity correlation: -0.57 No KD offline KD online KD (c) LeNet-5x8 student Figure 6.7: Relationship between test accuracy, train NTK similarity, and train fidelity for CIFAR-100 students training with either ResNet-56 teacher (panels (a) and (c)) or ResNet-110 (panel (b)). Averageteachercomplexity. Ren et al. [2022] observe that teacher predictions fluctuate over time, and showe that using exponentially averaged teachers improves knowledge distillation. Figure 6.6b demonstrates that the supervision complexity of an online teacher predictions is always slightly larger than that of the average of predictions of teachers of the last 10 preceding epochs. Teachingstudentswithweakinductivebiases. As we saw earlier, a fully trained teacher can have predictions as complex as random labels for a weak student at initialization. This low alignment of student NTK and teacher predictions can result in memorization. In contrast, an early stopped teacher captures simple patterns and has a better alignment with the student NTK, allowing the student to learn these patterns in a generalizable fashion. This feature learning improves the student NTK and allows learning more complex patterns in future iterations. We hypothesize that this is the mechanism that allows online distillation to outperform offline distillation in some cases. NTKsimilarity. Remarkably, we observe that across all of our experiments, the final test accuracy of the student is strongly correlated with the similarity of final teacher and student NTKs (see Figures 6.7, 6.11 and 6.12). This cannot be explained by better matching the teacher predictions. In fact, we see that the final fidelity (the rate of classification agreement of a teacher-student pair) measured on training set has no clear relationship with test accuracy. Furthermore, we see that online KD results in better NTK transfer without an explicit regularization loss enforcing such transfer. Theeffectoffrequencyofteachercheckpoints. As mentioned earlier, we used one teacher check- point per epoch so far. While this served our goal of establishing efficacy of online distillation, this choice is prohibitive for large teacher networks. To understand the effect of the frequency of teacher checkpoints, we conduct an experiment on CIFAR-100 with ResNet-56 and LeNet-5x8 student with vary- ing frequency of teacher checkpoints. In particular, we consider checkpointing the teacher once in every {1,2,4,8,16,32,64,128} epochs. The results presented in Figure 6.8 show that reducing the teacher check- pointing frequency to once in 16 epochs results in only a minor performance drop for online distillation withτ =4. On label supervision in KD. So far in all distillation methods dataset labels were not used as an additional source of supervision for students. However, in practice it is common to train a student with a 77 Teacherupdateperiod OnlineKD inepochs τ =1 τ =4 1 (the default value) 61.9± 0.2 66.1± 0.4 2 61.5± 0.4 66.0± 0.3 4 61.4± 0.2 65.6± 0.2 8 60.0± 0.3 65.4± 0.0 16 59.3± 0.6 65.4± 0.0 32 56.9± 0.0 64.1± 0.4 64 55.5± 0.4 62.8± 0.7 128 51.4± 0.5 61.3± 0.1 1 2 4 8 16 32 64 128 Teacher update period in epochs 55 60 65 Test accuracy Online KD τ = 1 Online KD τ = 4 Figure 6.8: Online KD results for a LeNet-5x8 student on CIFAR-100 with varying frequency of a ResNet-56 teacher checkpoints. Table 6.6: Knowledge distillation results on CIFAR-100 with varying loss mixture coefficient α . Setting α NoKD OfflineKD OnlineKD Teacher τ =1 τ =4 τ =1 τ =4 ResNet-56→ LeNet-5x8 0.2 47.3± 0.6 47.6± 0.7 57.6± 0.2 54.3± 0.7 59.0± 0.6 72.0 0.4 48.9± 0.3 58.9± 0.4 56.7± 0.5 62.5± 0.2 0.6 49.4± 0.5 59.7± 0.0 61.1± 0.0 65.3± 0.2 0.8 49.8± 0.1 60.1± 0.1 62.0± 0.1 65.9± 0.2 1.0 50.1± 0.4 59.9± 0.2 61.9± 0.2 66.1± 0.4 ResNet-56→ ResNet-20 0.2 67.7± 0.5 67.9± 0.3 70.3± 0.3 68.2± 0.3 70.3± 0.1 72.0 0.4 67.9± 0.1 71.0± 0.2 68.7± 0.2 71.4± 0.2 0.6 68.1± 0.3 71.3± 0.1 69.6± 0.4 71.5± 0.2 0.8 68.3± 0.2 71.4± 0.4 69.8± 0.3 71.1± 0.3 1.0 68.2± 0.3 71.6± 0.2 69.6± 0.3 71.4± 0.3 convex combination of knowledge distillation and standard losses: (1− α )L ce +α L kd− ce . To verify that the choice ofα =1 does not produce unique conclusions regarding efficacy of online distillation, we do experiments on CIFAR-100 with varying values ofα . The results presented in Table 6.6 confirm our main conclusions on online distillation. Furthermore, we observe that pickingα =1 does not result in significant degradation of student performance. 6.5 Relatedwork The key contributions of this work are the demonstration of the role of supervision complexity in student generalization, and the establishment of online knowledge distillation as a theoretically grounded and effective method. Both supervision complexity and online distillation have a number of relevant precedents in the literature that are worth comment. Transferringknowledgebeyondlogits. In the seminal works of Buciluˇ a et al. [2006] and Hinton et al. [2015] transferred “knowledge” is in the form of output probabilities. Later works suggest other notions of “knowledge“ and other ways of transferring knowledge [Gou et al., 2021]. These include activations of intermediate layers [Romero et al., 2015], attention maps [Zagoruyko and Komodakis, 2017], classifier head parameters [Chen et al., 2022], and various notions of example similarity [Passalis and Tefas, 2018, Park et al., 2019, Tung and Mori, 2019, Tian et al., 2020, He and Ozay, 2021]. Transferring teacher NTK matrix 78 belongs to this latter category of methods. Zhang et al. [2022] propose to transfer a low-rank approximation of a feature map corresponding the teacher NTK. Non-staticteachers. Some works on KD consider non-static teachers. In order to bridge teacher-student capacity gap, Mirzadeh et al. [2020] propose to perform a few rounds of distillation with teachers of increasing capacity. In deep mutual learning [Zhang et al., 2018, Chen et al., 2020], codistillation [Anil et al., 2018], and collaborative learning [Guo et al., 2020], multiple students are trained simultaneously, distilling from each other or from an ensemble. In Zhou et al. [2018] and Shi et al. [2021], the teacher and the student are trained together. In the former they have a common architecture trunk, while in the latter the teacher is penalized to keep its predictions close to the student’s predictions. Jin et al. [2019] study route constrained optimization which is closest to the online distillation in Section 6.3.2. They employ a few teacher checkpoints to perform a multi-round knowledge distillation. Rezagholizadeh et al. [2022] employ a similar procedure but with an annealed temperature that decreases linearly with training time, followed by a phase of training with dataset labels only. The idea of distilling from checkpoints also appears in Yang et al. [2019], where a network is trained with a cosine learning rate schedule, simultaneously distilling from the checkpoint of the previous learning rate cycle. We complement this line of work by highlighting the role of supervision complexity and by demonstrating that online distillation can be very powerful for students with weak inductive biases. Fundamentalunderstandingofdistillation. The effects of temperature, teacher-student capacity gap, optimization time, data augmentations, and other training details is non-trivial [Cho and Hariharan, 2019, Beyer et al., 2022, Stanton et al., 2021]. It has been hypothesized and shown to some extent that teacher soft predictions capture class similarities, which is beneficial for the student [Hinton et al., 2015, Furlanello et al., 2018, Tang et al., 2020]. Yuan et al. [2020] demonstrate that this softness of teacher predictions also has a regularization effect, similar to label smoothing. Menon et al. [2021] argue that teacher predictions are sometimes closer to the Bayes classifier than the hard labels of the dataset, reducing the variance of the training objective. The vanilla knowledge distillation loss also introduces some optimization biases. Mobahi et al. [2020] prove that for kernel methods with RKHS norm regularization, self-distillation increases regularization strength, resulting in smaller norm RKHS norm solutions. Phuong and Lampert [2019] prove that in a self-distillation setting, deep linear networks trained with gradient flow converge to the projection of teacher parameters into the data span, effectively recovering teacher parameters when the number of training points is large than the number of parameters. They derive a bound of the transfer risk that depends on the distribution of the acute angle between teacher parameters and data points. This is in spirit related to supervision complexity as it measures an “alignment ” between the distillation objective and data. Ji and Zhu [2020] extend this results to linearized neural networks, showing that the quantity∆ ⊤ z K − 1 ∆ z , where∆ z is the logit change during training, plays a key role in estimating the bound. The resulting bound is qualitatively different compared to ours, and the ∆ ⊤ z K − 1 ∆ z becomes ill-defined for hard labels. Supervisioncomplexity. The key quantity in our work is supervision complexityY ⊤ K − 1 Y . Cristianini et al. [2001] introduced a related quantityY ⊤ KY calledkernel-targetalignment and derived a generalization bound with it for expected Parzen window classifiers. As an easy-to-compute proxy to supervision complexity, Deshpande et al. [2021] use kernel-target alignment for model selection in transfer learning. Ortiz-Jiménez et al. [2021] demonstrate that when NTK-target alignment is high, learning is faster and generalizes better. Arora et al. [2019] prove a generalization bound for overparameterized two-layer neural networks with NTK parameterization trained with gradient flow. Their bound is approximately p Y ⊤ (K ∞ ) − 1 Y/ √ n, whereK ∞ is the expected NTK matrix at a random initialization. Our bound of 79 Theorem 6.2.2 can be seen as a generalization of this result for all kernel methods, including linearized neural networks of any depth and sufficient width, with the only difference of using the empirical NTK matrix. Belkin et al. [2018] warns that bounds based on RKHS complexity of the learned function can fail to explain the good generalization capabilities of kernel methods in presence of label noise. 6.6 Conclusionandfuturework We presented a treatment of knowledge distillation through the lens of supervision complexity. We formalized how the student generalization is controlled by three key quantities: the teacher’s accuracy, the student’s margin with respect to the teacher labels, and the supervision complexity of the teacher labels under the student’s kernel. This motivated an online distillation procedure that gradually increases the complexity of the targets that the student fits. In the broader context, our results highlight the important role of data in generalization. There are several potential directions for future work. Adaptivetemperature scaling for online distillation, where the teacher predictions are smoothened so as to ensure low target complexity, is one such direction. Another avenue is to explore alternative ways to smoothen teacher prediction besides temperature scaling; e.g., can one performsample-dependent scaling? There is large potential for improving online KD by making more informed choices for the frequency and positions of teacher checkpoints, and controlling how much the student is trained in between teacher updates. Finally, while we demonstrated that online distillation results in a better alignment with the teacher’s NTK matrix, understanding why this happens is an open and interesting problem. 80 Bibliography Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representa- tions. The Journal of Machine Learning Research, 19(1):1947–1980, 2018. Alessandro Achille, Giovanni Paolini, and Stefano Soatto. Where is the information in a deep neural network? arXiv preprint arXiv:1905.12213, 2019. Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003. Ibrahim Alabdulmohsin. Towards a unified theory of learning and information. Entropy, 22(4):438, 2020. Ibrahim M Alabdulmohsin. Algorithmic stability and uniform generalization.AdvancesinNeuralInformation Processing Systems, 28:19–27, 2015. Zeyuan Allen-Zhu and Yuanzhi Li. Understanding ensemble, knowledge distillation and self-distillation in deep learning. In International Conference on Learning Representations, 2023. Pierre Alquier. User-friendly introduction to pac-bayes bounds. arXiv preprint arXiv:2110.11216, 2021. Gholamali Aminian, Laura Toni, and Miguel RD Rodrigues. Information-theoretic bounds on the moments of the generalization error of learning algorithms. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 682–687. IEEE, 2021a. Gholamali Aminian, Laura Toni, and Miguel RD Rodrigues. Jensen-shannon information based character- ization of the generalization error of learning algorithms. In 2020 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2021b. Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, and Thomas Hofmann. The curious case of benign memorization. In The Eleventh International Conference on Learning Representations, 2023. Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. In International Conference on Learning Representations, 2018. Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. arXiv preprint arXiv:1904.11238, 2019. Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019. Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 233–242. JMLR. org, 2017. 81 Aristide Baratin, Thomas George, César Laurent, R Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. In International Conference on Artificial Intelligence and Statistics , pages 2269–2277. PMLR, 2021. R. H. Bartels and G. W. Stewart. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM, 15(9): 820–826, September 1972. ISSN 0001-0782. doi: 10.1145/361573.361582. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017. Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020. Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021. P.L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536, 1998. doi: 10.1109/18.661502. Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman. Algorithmic stability for adaptive data analysis. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1046–1059, 2016. Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. Learners that use little information. In Algorithmic Learning Theory, pages 25–55. PMLR, 2018. Samyadeep Basu, Phil Pope, and Soheil Feizi. Influence functions in deep learning are fragile. In International Conference on Learning Representations, 2021. Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116 (32):15849–15854, 2019. Lucas Beyer, Xiaohua Zhai, Amélie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowl- edge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934, 2022. Léonard Blier and Yann Ollivier. The description length of deep learning models. In Advances in Neural Information Processing Systems, pages 2216–2226, 2018. Olivier Bousquet and André Elisseeff. Stability and generalization. TheJournalofMachineLearningResearch, 2:499–526, 2002. Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd annual ACM SIGACT symposium on theory of computing, pages 123–132, 2021. 82 Yuheng Bu, Shaofeng Zou, and Venugopal V Veeravalli. Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020. Cristian Buciluˇ a, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933395. doi: 10.1145/1150402.1150464. Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems, 35:25237–25250, 2022. Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, volume 267, 2019. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom B Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021. Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, and Florian Tramer. The privacy onion effect: Memorization is relative. Advances in Neural Information Processing Systems, 35:13263–13276, 2022. Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023a. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023b. Olivier Catoni. Pac-bayesian supervised classification: The thermodynamics of statistical learning. Lecture Notes-Monograph Series, 56:i–163, 2007. ISSN 07492170. Guillaume Charpiat, Nicolas Girard, Loris Felardos, and Yuliya Tarabalka. Input similarity from the neural network perspective. Advances in Neural Information Processing Systems, 32, 2019. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019. Defang Chen, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. Online knowledge distillation with diverse peers. InProceedingsoftheAAAIConferenceonArtificialIntelligence , volume 34, pages 3430–3437, 2020. Defang Chen, Jian-Ping Mei, Hailin Zhang, Can Wang, Yan Feng, and Chun Chen. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11933–11942, 2022. Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pages 1062–1070. PMLR, 2019. 83 Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019. Gilad Cohen, Guillermo Sapiro, and Raja Giryes. Dnn or k-nn: That is the generalize vs. memorize question. arXiv preprint arXiv:1805.06822, 2018. Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921–2926. IEEE, 2017. R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977. Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience, 2006. Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001. Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, and Lester Mackey. Knowledge distillation as semiparametric inference. In International Conference on Learning Representations, 2021. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. Aditya Deshpande, Alessandro Achille, Avinash Ravichandran, Hao Li, Luca Zancato, Charless Fowlkes, Rahul Bhotika, Stefano Soatto, and Pietro Perona. A linearized framework and a new benchmark for model selection for fine-tuning. arXiv preprint arXiv:2102.00084, 2021. Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27, 2014. Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Gal Elidan, Kristian Kersting, and Alexander T. Ihler, editors, Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017 . AUAI Press, 2017. Gintare Karolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, and Daniel M Roy. In search of robust measures of generalization. Advances in Neural Information Processing Systems, 33:11723–11733, 2020. Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. Generalization error bounds via rényi-, f-divergences and maximal leakage. IEEE Transactions on Information Theory, 67(8):4986–5004, 2021. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020. Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535–544, 2018. Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020. 84 Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020. Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022. B. Frenay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems, 25(5):845–869, May 2014. ISSN 2162-2388. doi: 10.1109/TNNLS. 2013.2292894. Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018. Saul B Gelfand and Sanjoy K Mitter. Recursive stochastic algorithms for global optimization in rˆd. SIAM Journal on Control and Optimization, 29(5):999–1018, 1991. Pascal Germain, Alexandre Lacasse, François Laviolette, and Mario Marchand. Pac-bayesian learning of linear classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360, 2009. Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, pages 2242–2251. PMLR, 2019. Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Thirty-First AAAI Conference on Artificial Intelligence , 2017. Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Forgetting outside the box: Scrubbing deep net- works of information accessible from input-output observations. Proceedings of the European Conference on Computer Vision (ECCV), 2020. Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017. Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021. Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011. Ming Gu and Stanley C Eisenstat. Downdating the singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 16(3):793–810, 1995. Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017. Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distillation via collaborative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11020–11029, 2020. 85 Umang Gupta, Dimitris Stripelis, Pradeep K. Lam, Paul Thompson, Jose Luis Ambite, and Greg Ver Steeg. Membership inference attacks on deep regression models for neuroimaging. In Medical Imaging with Deep Learning, 2021. Hassan Hafez-Kolahi, Zeinab Golgooni, Shohreh Kasaei, and Mahdieh Soleymani. Conditioning and processing: Techniques to improve information-theoretic generalization bounds. Advances in Neural Information Processing Systems, 33, 2020. Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9925–9935. Curran Associates, Inc., 2020. Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537, 2018. Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. In Proceedings of the IEEE International Conference on Computer Vision, pages 5138–5147, 2019. Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR, 2016. Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, and Aram Galstyan. Improving generalization by control- ling label-noise information in neural network weights. In International Conference on Machine Learning, pages 4071–4081. PMLR, 2020. Hrayr Harutyunyan, Alessandro Achille, Giovanni Paolini, Orchid Majumder, Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Estimating informativeness of samples with smooth unique information. In International Conference on Learning Representations, 2021a. Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, and Aram Galstyan. Information-theoretic gen- eralization bounds for black-box learning algorithms. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021b. Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Formal limitations of sample-wise information- theoretic generalization bounds. In 2022 IEEE Information Theory Workshop (ITW), pages 440–445, 2022. doi: 10.1109/ITW54588.2022.9965850. Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, and Sanjiv Kumar. Supervision complexity and its role in knowledge distillation. In International Conference on Learning Representations, 2023. Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022. Adi Haviv, Ido Cohen, Jacob Gidron, Roei Schuster, Yoav Goldberg, and Mor Geva. Understanding trans- former memorization recall through idioms. InProceedingsofthe17thConferenceoftheEuropeanChapter oftheAssociationforComputationalLinguistics, pages 248–264. Association for Computational Linguistics, May 2023. Bobby He and Mete Ozay. Feature kernel distillation. InInternationalConferenceonLearningRepresentations, 2021. 86 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. Fredrik Hellström and Giuseppe Durisi. Generalization bounds via information density and conditional information density. IEEE Journal on Selected Areas in Information Theory, 1(3):824–839, 2020. Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. InAdvancesinneuralinformationprocessingsystems, pages 10456–10465, 2018. Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXivpreprint arXiv:1503.02531, 2(7), 2015. Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ’93, page 5–13, New York, NY, USA, 1993. Association for Computing Machinery. ISBN 0897916115. doi: 10.1145/168304.168306. Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In International Conference on Learning Representations, 2021. Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8571–8580. Curran Associates, Inc., 2018. Guangda Ji and Zhanxing Zhu. Knowledge distillation in wide neural networks: Risk bound, data efficiency and imperfect teacher. Advances in Neural Information Processing Systems, 33:20823–20833, 2020. Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International conference on machine learning, pages 2304–2313. PMLR, 2018. Yiding Jiang*, Behnam Neyshabur*, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic gen- eralization measures and where to find them. In International Conference on Learning Representations, 2020. Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, and Xiaolin Hu. Knowledge distillation via route constrained optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019. Kaggle. Dogs vs. Cats, 2013. URL https://www.kaggle.com/c/dogs-vs-cats/overview. Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in Neural Information Processing Systems, 32, 2019. 87 Angelos Katharopoulos and François Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International conference on machine learning, pages 1885–1894. PMLR, 2017. Ron Kohavi, George H John, et al. Wrappers for feature subset selection. Artificial intelligence , 97(1-2): 273–324, 1997. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Vitaly Kuznetsov, Mehryar Mohri, and U Syed. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classification: Learning with a Very Large Number of Labels, volume 2, 2015. John Langford and Matthias Seeger. Bounds for averaging classifiers. 2001. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATTLabs[Online].Available: http://yann.lecun.com/exdb/mnist, 2, 2010. Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991. Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019. Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018. Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. Learning to learn from noisy labeled data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5051–5059, 2019. Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International conference on artificial intelligence and statistics, pages 4313–4324. PMLR, 2020. Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. InProceedingsofthe12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296, 2006. Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2017. 88 Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020. Ana C Lorena, Luís PF Garcia, Jens Lehmann, Marcilio CP Souto, and Tin Kam Ho. How complex is your classification problem? a survey on measuring classification complexity. ACMComputingSurveys(CSUR), 52(5):1–34, 2019. Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pages 3355–3364. PMLR, 2018. Hartmut Maennel, Ibrahim M Alabdulmohsin, Ilya O Tolstikhin, Robert Baldock, Olivier Bousquet, Sylvain Gelly, and Daniel Keysers. What do neural networks learn when trained with random labels? Advances in Neural Information Processing Systems, 33:19693–19704, 2020. Neil Rohit Mallinar, James B Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Advances in Neural Information Processing Systems, 2022. Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1):4873–4907, 2017. David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999a. David A McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355–363, 1999b. Aditya K Menon, Ankit Singh Rawat, Sashank Reddi, Seungyeon Kim, and Sanjiv Kumar. A statistical perspective on distillation. In International Conference on Machine Learning, pages 7632–7642. PMLR, 2021. Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Can gradient clipping mitigate label noise? In International Conference on Learning Representations, 2020. Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence , volume 34, pages 5191–5198, 2020. Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018. Fangzhou Mu, Yingyu Liang, and Yin Li. Gradients as features for deep representation learning. In International Conference on Learning Representations, 2020. Rafael Müller, Simon Kornblith, and Geoffrey Hinton. Subclass distillation. arXiv preprint arXiv:2002.03936, 2020. Ernest Mwebaze, Timnit Gebru, Andrea Frome, Solomon Nsumba, and Jeremy Tusubira. icassava 2019 fine-grained visual categorization challenge. arXiv preprint arXiv:1908.02900, 2019. 89 Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019. Milad Nasr, Reza Shokri, and Amir Houmansadr. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In2019IEEEsymposium on security and privacy (SP), pages 739–753. IEEE, 2019. Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013. Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015. Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information- theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11015–11025, 2019. Gergely Neu, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Mikhail Belkin and Samory Kpotufe, editors, ProceedingsofThirtyFourthConferenceonLearningTheory, volume 134 ofProceedingsofMachineLearning Research, pages 3526–3545. PMLR, 15–19 Aug 2021. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on learning theory, pages 1376–1401. PMLR, 2015. Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017. Guillermo Ortiz-Jiménez, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. What can linearized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34: 8998–9010, 2021. Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253, 2003. Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3967–3976, 2019. Nikolaos Passalis and Anastasios Tefas. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 268–284, 2018. Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017. Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. AdvancesinNeuralInformationProcessingSystems, 34:20596–20607, 2021. Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018. Marıa Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, and Csaba Szepesvári. Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22, 2021. 90 Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. Language models as knowledge bases? InProceedingsofthe2019ConferenceonEmpiricalMethods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473. Association for Computational Linguistics, November 2019. doi: 10.18653/v1/D19-1250. Mary Phuong and Christoph Lampert. Towards understanding knowledge distillation. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedingsofthe36thInternationalConferenceonMachine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142–5151. PMLR, 09–15 Jun 2019. Vinaychandran Pondenkandath, Michele Alberti, Sammer Puran, Rolf Ingold, and Marcus Liwicki. Leverag- ing random label memorization for unsupervised pre-training. arXiv preprint arXiv:1811.01640, 2018. Maxim Raginsky, Alexander Rakhlin, Matthew Tsao, Yihong Wu, and Aolin Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016. Maxim Raginsky, Alexander Rakhlin, and Aolin Xu. Information-theoretic stability and generalization. Information-Theoretic Methods in Data Science, page 302, 2021. Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301–5310. PMLR, 2019. Scott E. Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. CoRR, abs/1412.6596, 2014. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International conference on machine learning, pages 4334–4343. PMLR, 2018. Yi Ren, Shangmin Guo, and Danica J. Sutherland. Better supervisory signals by observing learning paths. In International Conference on Learning Representations, 2022. Mehdi Rezagholizadeh, Aref Jafari, Puneeth S.M. Saladi, Pranav Sharma, Ali Saheb Pasand, and Ali Ghodsi. Pro-KD: Progressive distillation by following the footsteps of the teacher. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4714–4727, 2022. J. J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1): 40–47, Jan 1996. ISSN 0018-9448. doi: 10.1109/18.481776. Omar Rivasplata, Ilja Kuzborskij, Csaba Szepesvári, and John Shawe-Taylor. Pac-bayes analysis beyond the usual bounds. Advances in Neural Information Processing Systems, 33:16833–16845, 2020. Borja Rodríguez-Gálvez, Germán Bassi, Ragnar Thobaben, and Mikael Skoglund. On random subset generalization error bounds and the stochastic gradient langevin dynamics algorithm. In 2020 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2021. Borja Rodríguez Gálvez, Germán Bassi, Ragnar Thobaben, and Mikael Skoglund. Tighter expected general- ization error bounds via wasserstein distance. Advances in Neural Information Processing Systems, 34, 2021. 91 Adriana Romero, Samira Ebrahimi Kahou, Polytechnique Montréal, Y. Bengio, Université De Montréal, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. IninInternationalConferenceonLearningRepresentations(ICLR, 2015. Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019. Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972. Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. On learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019. Bernhard Scholkopf and Alexander J. Smola. LearningwithKernels: SupportVectorMachines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759. John Shawe-Taylor and Robert C. Williamson. A pac analysis of a bayesian estimator. In COLT ’97, 1997. Saharon Shelah. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics , 41(1):247–261, 1972. Wenxian Shi, Yuxuan Song, Hao Zhou, Bohan Li, and Lei Li. Follow your path: A progressive method for knowledge distillation. In Nuria Oliver, Fernando Pérez-Cruz, Stefan Kramer, Jesse Read, and Jose A. Lozano, editors, Machine Learning and Knowledge Discovery in Databases. Research Track, pages 596–611, Cham, 2021. Springer International Publishing. ISBN 978-3-030-86523-8. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. InAdvancesinNeuralInformationProcessingSystems, pages 1917–1928, 2019. Ravid Shwartz-Ziv and Alexander A Alemi. Information in infinite ensembles of infinitely-wide neural networks. In Symposium on Advances in Approximate Bayesian Inference, pages 1–17. PMLR, 2020. Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In Yoshua Bengio and Yann LeCun, editors,3rdInternationalConferenceonLearningRepresentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35: 19523–19536, 2022. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018. 92 Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A Alemi, and Andrew G Wilson. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919, 2021. Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Conditional Mutual Informa- tion. In Jacob Abernethy and Shivani Agarwal, editors,ProceedingsofThirtyThirdConferenceonLearning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3437–3452. PMLR, 09–12 Jul 2020. Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir D. Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. In ICLR 2015, 2014. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018. Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532, 2020. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations, 2020. Kushal Tirumala, Aram H. Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. In AdvancesinNeuralInformation Processing Systems, 2022. Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Ge- offrey J. Gordon. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019. Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. InProceedingsoftheIEEE/CVF International Conference on Computer Vision, pages 1365–1374, 2019. Vladimir Vapnik. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4. Sergio Verdu and Tsachy Weissman. The information lost in erasures. IEEE Transactions on Information Theory, 54(11):5030–5058, 2008. Roman Vershynin. High-dimensionalprobability: Anintroductionwithapplicationsindatascience, volume 47. Cambridge university press, 2018. Yu-Xiang Wang, Jing Lei, and Stephen E Fienberg. On-average kl-privacy and its equivalence to generaliza- tion for max-entropy mechanisms. In International Conference on Privacy in Statistical Databases, pages 121–134. Springer, 2016. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011. Paul L. Williams and Randall D. Beer. Nonnegative decomposition of multivariate information. CoRR, abs/1004.2515, 2010. 93 Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pages 10355–10366. PMLR, 2020. Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2691–2699, 2015. Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524–2533, 2017. Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In Advances in Neural Information Processing Systems, pages 6222–6233, 2019. Chenglin Yang, Lingxi Xie, Chi Su, and Alan L Yuille. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2859–2868, 2019. Rubing Yang, Jialin Mao, and Pratik Chaudhari. Does the data induce capacity control in deep learning? In International Conference on Machine Learning, pages 25166–25197. PMLR, 2022. Jiangchao Yao, Hao Wu, Ya Zhang, Ivor Tsang, and Jun Sun. Safeguarded dynamic label regression for noisy supervision. Proceedings of the AAAI Conference on Artificial Intelligence , 33:9103–9110, 07 2019. doi: 10.1609/aaai.v33i01.33019103. Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282. IEEE, 2018. Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-learning without memorization. In International Conference on Learning Representations, 2020. Jinsung Yoon, Sercan Arik, and Tomas Pfister. Data valuation using reinforcement learning. In International Conference on Machine Learning, pages 10842–10851. PMLR, 2020. Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In International Conference on Machine Learning, pages 7164–7173. PMLR, 2019. Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020. Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the perfor- mance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2017. Luca Zancato, Alessandro Achille, Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Predicting training time without training. Advances in Neural Information Processing Systems 33, 2020. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 94 Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. Identity crisis: Memo- rization and generalization under extreme overparameterization. In International Conference on Learning Representations, 2020. Ruixiang Zhang, Shuangfei Zhai, Etai Littwin, and Joshua M. Susskind. Learning representation from neural fisher kernel with low-rank approximation. In International Conference on Learning Representations, 2022. Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018. Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pages 8778–8788, 2018. Guorui Zhou, Ying Fan, Runpeng Cui, Weijie Bian, Xiaoqiang Zhu, and Kun Gai. Rocket launching: A universal and efficient framework for training well-performing light net. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 32, 2018. Ruida Zhou, Chao Tian, and Tie Liu. Individually conditional individual mutual information bound on generalization error. 2021 IEEE International Symposium on Information Theory (ISIT), pages 670–675, 2021. Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous generaliza- tion bounds at the imagenet scale: a PAC-bayesian compression approach. In International Conference on Learning Representations, 2019. 95 Appendices G Proofs This appendix presents the missing proofs. G.1 ProofofTheorem2.2.1 Proof. For each example we consider the following Markov chain: Y i → X Y → X i W → b Y i . (G.1) In this setup, Fano’s inequality gives a lower bound for the error probability: H(E i )+P(E i =1)log(|Y|− 1)≥ H(Y i |X i ,W), (G.2) which can be written as: P(E i =1)≥ H(Y i |X i ,W)− H(E i ) log(|Y|− 1) . (G.3) Summing this inequality fori=1,...,n we get n X i=1 P(E i =1)≥ P n i=1 (H(Y i |X i ,W)− H(E i )) log(|Y|− 1) (G.4) ≥ P n i=1 (H(Y i |X,W)− H(E i )) log(|Y|− 1) (G.5) ≥ H(Y |X,W)− P n i=1 H(E i ) log(|Y|− 1) . (G.6) The correctness of the last step follows from the fact that total correlation is always non-negative [Cover and Thomas, 2006]: n X i=1 H(Y i |X,W)− H(Y |X,W)=TC(Y |X,W)≥ 0. (G.7) Finally, using the fact thatH(Y |X,W)=H(Y |X)− I(W;Y |X), we get that the desired result: E " n X i=1 E i # ≥ H(Y |X)− I(W;Y |X)− P n i=1 H(E i ) log(|Y|− 1) . (G.8) 96 G.2 ProofofProposition2.3.1 Proof. Given thatϵ t andµ t are independent, let us bound the expected L2 norm ofG t : E G T t G t =E (ϵ t +µ t ) T (ϵ t +µ t ) (G.9) =E ϵ T t ϵ t +E µ T t µ t (G.10) ≤ dσ 2 q +L 2 . (G.11) Among all random variablesV ∈R d withE[V T V]≤ C, the Gaussian distributionN 0, C d I d has the largest entropy, given by d 2 log 2πeC d . Therefore, H(G t )≤ d 2 log 2πe (dσ 2 q +L 2 ) d ! . (G.12) With this we can upper bound theI(G t ;Y |X,G <t ) as follows: I(G t ;Y |X,G <t )=H(G t |X,G <t )− H(G t |X,Y,G <t ) (G.13) =H(G t |X,G <t )− H(ϵ t ) (G.14) ≤ d 2 log 2πe (dσ 2 q +L 2 ) d ! − d 2 log 2πeσ 2 q (G.15) = d 2 log 1+ L 2 dσ 2 q . (G.16) Note that the proof will work for arbitraryϵ t that has zero mean and independent components, where the L2 norm of each component is bounded byσ 2 q . This holds because in such casesH(ϵ t )≤ d 2 log(2πeσ 2 q ) (as Gaussians have highest entropy for fixed L2 norm) and the transition of Eq. (G.15) remains correct. Therefore, the same result holds when ϵ t is sampled from a product of univariate zero-mean Laplace distributions with scale parameterσ q / √ 2 (which makes the second moment equal toσ 2 q ). A similar result has been derived by Pensia et al. [2018] (lemma 5) to boundI(W t ;(X t ,Y t )|W t− 1 ). 97 G.3 ProofofProposition3.2.1 Proof. Note that by definition ∀S = s,i∈ [n], Q W|S=s = P W|S=s ≪ P W|S − i =s − i . By the assumption, we also have that∀S =s,i∈[n], P W|S − i =s − i ≪ Q W|S=s − i . By transitivity,∀S =s,i∈[n], P W|S=s ≪ Q W|S=s − i . KL P W|S ∥Q W|S − i − KL P W|S − i ∥Q W|S − i =E P Z i E P W|S log dP W|S dQ W|S − i − E P W|S − i log dP W|S − i dQ W|S − i (G.17) =E P Z i E P W|S log dP W|S dQ W|S − i − E P Z i E P W|S log dP W|S − i dQ W|S − i (G.18) =E P Z i E P W|S log dP W|S dQ W|S − i − log dP W|S − i dQ W|S − i (G.19) =E P Z i E P W|S log dP W|S dP W|S − i (G.20) =KL P W|S ∥P W|S − i . (G.21) As KL-divergence is non-negative, we get thatKL P W|S ∥Q W|S − i ≥ KL P W|S ∥P W|S − i . G.4 ProofofProposition3.4.1 Proof. AssumingΛ( w) is approximately constant aroundw ∗ , the steady-state distributions of Eq. (3.16) is a Gaussian distribution with meanw ∗ and covarianceΣ such that: HΣ+Σ H T = η b Λ( w ∗ ), (G.22) where H = (∇ w f 0 (x)∇ w f 0 (x) T +λI ) is the Hessian of the loss function [Mandt et al., 2017]. This can be verified by checking that the distribution N(·;w ∗ ,Σ) satisfies the Fokker-Planck equation (see Appendix H.2). HavingQ SGD W|S=s given by the Gaussian densityN(w;w ∗ ,Σ) andQ SGD W|S=s − i by the Gaussian densityN(w;w ∗ − i ,Σ − i ), we have that SI(z i ,Q SGD )= 1 2 (w ∗ − w ∗ − i ) T Σ − 1 − i (w ∗ − w ∗ − i )+tr(Σ − 1 − i Σ)+log |Σ − i Σ − 1 |− d . (G.23) By the assumption that SGD steady-state covariance stays constant after removing an example, i.e.Σ − i =Σ , equation Eq. (G.23) simplifies to: KL Q SGD W|S=s ∥Q SGD W|S=s − i = 1 2 (w ∗ − w ∗ − i ) T Σ − 1 (w ∗ − w ∗ − i ). (G.24) By the definition of the Q ERM algorithm and smooth sample information, this is equal toSI Σ (z i ,Q ERM ). G.5 ProofofLemma4.2.1 The proof of Lemma 4.2.1 uses the Donsker-Varadhan inequality and a simple result on the moment generating function of a square of a Gaussian random variable. 98 Fact G.1 (Donsker-Varadhan inequality, Thm. 5.2.1 of Gray [2011]). Let P and Q be two probability measures defined on the same measurable space (Ω ,F), such thatP is absolutely continuous with respect toQ. Then the Donsker-Varadhan dual characterization of Kullback-Leibler divergence states that KL(P∥Q)=sup f Z Ω fdP − log Z Ω e f dQ , (G.25) wheref :Ω →R is a measurable function, such that both integrals above exist. LemmaG.2. IfX is aσ -subgaussian random variable with zero mean, then Ee λX 2 ≤ 1+8λσ 2 , ∀λ ∈ 0, 1 4σ 2 . (G.26) Proof of Lemma G.2. As X is σ -subgaussian andEX = 0, the k-th moment of X can be bounded the following way [Vershynin, 2018]: E|X| k ≤ (2σ 2 ) k/2 kΓ( k/2), ∀k∈N, (G.27) whereΓ( ·) is the Gamma function. Continuing, Ee λX 2 =E " ∞ X k=0 (λX 2 ) k k! # (G.28) =1+ ∞ X k=1 E ( √ λ |X|) 2k k! ! (by Fubini’s theorem) (G.29) ≤ 1+ ∞ X k=1 (2λσ 2 ) k · 2k· Γ( k) k! (by Eq. (G.27)) (G.30) =1+2 ∞ X k=1 (2λσ 2 ) k . (G.31) Whenλ ≤ 1/(4σ 2 ), the infinite sum of Eq. (G.31) converges to a value that is at most twice of the first element of the sum. Therefore Ee λX 2 ≤ 1+8λσ 2 , ∀λ ∈ 0, 1 4σ 2 . (G.32) Proof of Lemma 4.2.1. We use the Donsker-Varadhan inequality forI(Φ;Ψ) andλg (ϕ,ψ ), whereλ ∈R is any constant: I(Φ;Ψ)=KL( P Φ ,Ψ ∥P Φ ⊗ P Ψ ) (by definition) (G.33) ≥ E P Φ ,Ψ [λg (Φ ,Ψ)] − logE P Φ ⊗ P Ψ h e λg (Φ ,Ψ) i (by Fact G.1). (G.34) The subgaussianity ofg(Φ ,Ψ) underP Φ ⊗ P Ψ implies that logE P Φ ⊗ P Ψ [exp{λ (g(Φ ,Ψ) − E P Φ ⊗ P Ψ [g(Φ ,Ψ)]) }]≤ λ 2 σ 2 2 , ∀λ ∈R. (G.35) 99 Plugging this into Eq. (G.34), we get that I(Φ;Ψ) ≥ λ E P Φ ,Ψ [g(Φ ,Ψ)] − E P Φ ⊗ P Ψ [g(Φ ,Ψ)] − λ 2 σ 2 2 . (G.36) Pickingλ to maximize the right-hand side, we get that I(Φ;Ψ) ≥ 1 2σ 2 E P Φ ,Ψ [g(Φ ,Ψ)] − E P Φ ⊗ P Ψ [g(Φ ,Ψ)] 2 , (G.37) which proves the first part of the lemma. To prove the second part of the lemma, we are going to use Donsker-Varadhan inequality again, but for a different function. Let λ ∈ 0, 1 4σ 2 and define ˜ g(ϕ,ψ )≜λ g(ϕ,ψ )− E Ψ ′ ∼ P Ψ g(ϕ, Ψ ′ ) 2 . (G.38) By assumptionE P Φ ,Ψ [˜ g(Φ ,Ψ)] exists. Note that for each fixed ϕ , the random variableg(ϕ, Ψ) − E Ψ g(ϕ, Ψ) has zero mean and is σ -subgaussian, by the additional assumptions of the second part of the lemma. As a result, E P Φ ⊗ P Ψ exp(˜ g(Φ ,Ψ)) = E P Φ [E P Ψ exp(˜ g(Φ ,Ψ))] also exists (by Lemma G.2). Therefore, Donsker-Varadhan is applicable for˜ g and gives the following: I(Φ;Ψ) ≥ E P Φ ,Ψ [˜ g(Φ ,Ψ)] − logE P Φ ⊗ P Ψ [exp{˜ g(Φ ,Ψ) }] (G.39) =λ E P Φ ,Ψ h g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 i − logE P Φ ⊗ P Ψ exp n λ g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 o (G.40) ≥ λ E P Φ ,Ψ h g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 i − log 1+8λσ 2 (by Lemma G.2). (G.41) Pickingλ →1/(4σ 2 ), we get I(Φ;Ψ) ≥ 1 4σ 2 E P Φ ,Ψ h g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 i − log3, (G.42) which proves the desired inequality. To prove the last part of the lemma, we just use the Markov’s inequality and combine with this last result: P Φ ,Ψ g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) ≥ ϵ =P Φ ,Ψ g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ ) 2 ≥ ϵ 2 (G.43) ≤ E P Φ ,Ψ (g(Φ ,Ψ) − E Ψ ′ ∼ P Ψ g(Φ ,Ψ ′ )) 2 ϵ 2 (G.44) ≤ 4σ 2 (I(Φ;Ψ)+log3) ϵ 2 . (G.45) G.6 ProofofTheorem4.2.3 Before proceeding to the proof of Theorem 4.2.3, we prove a simple lemma that will be used also in the proofs of Theorem 4.2.8 and Theorem 4.3.1. LemmaG.3. LetX andY be independent random variables. Ifg is a measurable function such thatg(x,Y) isσ -subgaussian andEg(x,Y)=0 for allx∈X , theng(X,Y) is alsoσ -subgaussian. 100 Proof of Lemma G.3. AsEg(X,Y)=0, we have that E X,Y exp{t(g(X,Y)− E X,Y g(X,Y))}=E X,Y exp{tg(X,Y)} (G.46) =E X [E Y exp{tg(X,Y)}] (by independence ofX andY ) (G.47) ≤ E X e t 2 σ 2 (by subgaussianity ofg(x,Y)) (G.48) =e t 2 σ 2 . (G.49) Proof of Theorem 4.2.3. Let us fix a value of U. Conditioning onU =u keeps the distribution ofW andS intact, asU is independent ofW andS. Let us setΦ= W,Ψ= J u , and g(w,J u )= 1 m m X i=1 ℓ(w,s u i )− E Z ′ ∼ P Z ℓ(w,Z ′ ) . (G.50) Note that for each value ofw, the random variableg(w,J u ) is σ √ m -subgaussian, as it is a sum ofm i.i.d. σ -subgaussian random variables. Furthermore,∀w, Eg(w,J u )=0. These two statements together and Lemma G.3 imply thatg(W,J u ) is also σ √ m -subgaussian underP W ⊗ P Ju . Therefore, with these choices of Φ ,Ψ , andg, Lemma 4.2.1 gives that E P S,W " 1 m m X i=1 ℓ(W,S u i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) # ≤ r 2σ 2 m I(W;J u ). (G.51) Taking expectation overu on both sides, then swapping the order between expectation overu and absolute value (using Jensen’s inequality), we get E P S,W,U " 1 m m X i=1 ℓ(W,S u i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) # ≤ E P U r 2σ 2 m I U (W;J u ). (G.52) This proves the first part of the theorem as the left-hand side is equal to the absolute value of the expected generalization gap, E P S,W [R(W)− r S (W)] . The second part of Lemma 4.2.1 gives that E P S,W,U 1 m m X i=1 ℓ(W,S u i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 ≤ 4σ 2 m (I(W;J u )+log3). (G.53) Whenu=[n], Eq. (G.53) becomes E P S,W (R(W)− r S (W)) 2 ≤ 4σ 2 n (I(W;S)+log3), (G.54) proving the second part of the theorem. RemarkG.1. Note that in case ofm=1 andu={i}, Eq. (G.53) becomes E P S,W ℓ(W,Z i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) 2 ≤ 4σ 2 (I(W;Z i )+log3). (G.55) 101 Unfortunately, this result is not useful for boundingE P W,S (R(W)− r S (W)) 2 , as for largen thelog3 term will likely dominate overI(W;Z i ). G.7 ProofofProposition4.2.4 Before we prove Proposition 4.2.4, we establish two useful lemmas that will be helpful also in the proofs of Proposition 4.2.9, Proposition 4.3.4, Theorem 4.2.5, Theorem 4.2.10 and Theorem 4.3.5. LemmaG.4. LetΨ=(Ψ 1 ,...,Ψ n ) be a collection ofn independent random variables andΦ be another random variable defined on the same probability space. Then ∀i∈[n], I(Φ;Ψ i )≤ I(Φ;Ψ i |Ψ − i ), (G.56) and I(Φ;Ψ) ≤ n X i=1 I(Φ;Ψ i |Ψ − i ). (G.57) Proof of Lemma G.4. First, for alli∈[n], I(Φ;Ψ i |Ψ − i )=I(Φ;Ψ i )− I(Ψ i ;Ψ − i )+I(Ψ i ;Ψ − i |Φ) (chain rule of MI) (G.58) =I(Φ;Ψ i )+I(Ψ i ;Ψ − i |Φ) (Ψ i ⊥ ⊥Ψ − i ) (G.59) ≥ I(Φ;Ψ i ) (nonnegativity of MI). (G.60) Second, I(Φ;Ψ)= n X i=1 I(Φ;Ψ i |Ψ i )+I(Ψ i ;Ψ >i |Ψ i |Ψ i )− I(Ψ i ;Ψ >i |Ψ i ) (G.64) = n X i=1 I(Φ;Ψ i |Ψ − i ). (G.65) The first two equalities above use the chain rule of mutual information, while the third one uses the independence of Ψ 1 ,...,Ψ n . The inequality of the fourth line relies on the nonnegativity of mutual information. The quantity P n i=1 I(Φ;Ψ i |Ψ − i ) is also known as erasure information [Verdu and Weissman, 2008]. LemmaG.5. LetS =(Z 1 ,...,Z n ) be a collection ofn independent random variables andΦ be an arbitrary random variable defined on the same probability space. Then for any subset u ′ ⊆ { 1,2,...,n} of size m+1 the following holds: I(Φ; J u ′)≥ 1 m X k∈u ′ I(Φ; S u ′ \{k} ). (G.66) 102 Proof of Lemma G.5. (m+1)I(Φ; J u ′)= X k∈u ′ I(Φ; S u ′ \{k} )+ X k∈u ′ I(Φ; Z k |S u ′ \{k} ) (chain-rule of MI) (G.67) ≥ X k∈u ′ I(Φ; S u ′ \{k} )+I(Φ; J u ′) (second part of Lemma G.4). (G.68) Proof of Proposition 4.2.4. By Lemma G.5 withΦ= W , for any subsetu ′ of sizem+1 the following holds: I(W;J u ′)≥ 1 m X k∈u ′ I(W;S u ′ \{k} ). (G.69) Therefore, ϕ 1 m+1 I(W;J u ′) ≥ ϕ 1 m(m+1) X k∈u ′ I(W;S u ′ \{k} ) ! (G.70) ≥ 1 m+1 X k∈u ′ ϕ 1 m I(W;S u ′ \{k} ) . (by Jensen’s inequality) (G.71) Taking expectation overu ′ on both sides, we have that E P U ′ ϕ 1 m+1 I U ′ (W;J u ′) ≥ E P U ′ " 1 m+1 X k∈U ′ ϕ 1 m I U ′ (W;S U ′ \{k} ) # (G.72) = X u α u ϕ 1 m I(W;J u ) . (G.73) For each subsetu of sizem, the coefficient α u is equal to α u = 1 n m+1 · 1 m+1 · (n− m)= 1 n m . (G.74) Therefore X u α u ϕ 1 m I(W;J u ) =E P U ϕ 1 m+1 I U (W;J u ) . (G.75) G.8 ProofofTheorem4.2.5 Proof. Using Lemma G.4 withΦ= W andΨ= S, we get that I(W;Z i )≤ I(W;Z i |Z − i ) and I(W;S)≤ X i=1 I(W;Z i |Z − i ). (G.76) Plugging these upper bounds into Theorem 4.2.3 completes the proof. 103 G.9 ProofofTheorem4.2.8 Proof. Let us consider the joint distribution of W and J given ˜ Z = ˜ z and U = u: P W,J| ˜ Z=˜ z,U=u . Let Φ= W ,Ψ= J u , and g(ϕ,ψ )= 1 m m X i=1 ℓ(ϕ, (˜ z u ) i,ψ i )− ℓ(ϕ, (˜ z u ) i, ¯ ψ i ) . (G.77) Note that by our assumption, for any w ∈ W each summand of g(w,J u ) ∈ [− 1,+1], hence is a 1- subgaussian random variable underP Ju| ˜ Z=˜ z,U=u =P Ju . Furthermore, each of these summands has zero mean. As the average ofm independent and zero-mean1-subgaussian variables is 1 √ m -subgaussian, then g(w,J u ) is 1 √ m -subgaussian for each w ∈ W. Additionally,∀w ∈ W, Eg(w,J u ) = 0. Therefore, by Lemma G.3,g(W,J u ) is 1 √ m -subgaussian underP W| ˜ Z=˜ z,U=u ⊗ P Ju| ˜ Z=˜ z,U=u . With these choices ofΦ ,Ψ , andg(ϕ,ψ ), we use Lemma 4.2.1. First, E P W,Ju| ˜ Z=˜ z,U=u [g(W,J u )]=E P W,Ju| ˜ Z=˜ z,U=u " 1 m m X i=1 ℓ(W,(˜ z u ) i,(Ju) i )− ℓ(W,(˜ z u ) i,( ¯ Ju) i ) # . (G.78) Second, E P W| ˜ Z=˜ z,U=u ⊗ P Ju| ˜ Z=˜ z,U=u [g(W,J u )]=0. (G.79) Therefore, Lemma 4.2.1 gives E P W,Ju| ˜ Z=˜ z,U=u " 1 m m X i=1 ℓ(W,(˜ z u ) i,(Ju) i )− ℓ(W,(˜ z u ) i,( ¯ Ju) i ) # ≤ r 2 m I ˜ Z=˜ z,U=u (W;J U ). (G.80) Taking expectation overu on both sides and using Jensen’s inequality to switch the order of absolute value and expectation ofu, we get E P U E P W,J| ˜ Z=˜ z,U " 1 m m X i=1 ℓ(W,(˜ z U ) i,(J U ) i )− ℓ(W,(˜ z U ) i,( ¯ J U ) i ) # ≤ E P U r 2 m I ˜ Z=˜ z,U (W;J U ), (G.81) which reduces to E P W,J| ˜ Z=˜ z " 1 n n X i=1 ℓ(W,˜ z i,J i )− ℓ(W,˜ z i, ¯ J i ) # ≤ E P U r 2 m I ˜ Z=˜ z,U (W;J U ). (G.82) This can be seen as bounding the expected generalization gap for a fixed ˜ z. Taking expectation over ˜ z on both sides, and then using Jensen’s inequality to switch the order of absolute value and expectation of ˜ z, we get E P ˜ Z,W,J " 1 n n X i=1 ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) # ≤ E ˜ Z,U r 2 m I ˜ Z,U (W;J U ). (G.83) Finally, noticing that left-hand side is equal to the absolute value of the expected generalization gap, E P S,W [R(W)− r S (W)] , completes the proof of the first part of this theorem. 104 Whenu=[n], applying the second part of Lemma 4.2.1 gives E P W,J| ˜ Z=˜ z 1 n n X i=1 ℓ(W,˜ z i,J i )− ℓ(W,˜ z i, ¯ J i ) ! 2 ≤ 4 n (I ˜ Z=˜ z (W;J)+log3). (G.84) Taking expectation over ˜ z, we get E P ˜ Z,W,J 1 n n X i=1 ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) ! 2 | {z } B ≤ 4 n E P ˜ Z (I ˜ Z (W;J)+log3). (G.85) Continuing, E P W,S (R(W)− r S (W)) 2 =E P ˜ Z,W,J 1 n n X i=1 ℓ(W, ˜ Z i,J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.86) ≤ 2B+2E P ˜ Z,W,J 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.87) =2B+2E P ˜ Z,W,J 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.88) =2B+2E P ˜ Z ¯ J ,W 1 n n X i=1 ℓ(W,( ˜ Z¯ J ) i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.89) =2B+2E P W E P ˜ Z ¯ J |W 1 n n X i=1 ℓ(W,( ˜ Z¯ J ) i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 . (G.90) Note that as ˜ Z¯ J is independent ofW , conditioning onW does not change its distribution, implying that its components stay independent of each other. For each fixed value W =w the inner part of the outer expectation in Eq. (G.90) becomes E P ˜ Z ¯ J 1 n n X i=1 w,( ˜ Z¯ J ) i )− E Z ′ ∼ P Z ℓ(w,Z ′ ) ! 2 , (G.91) which is equal to E P Z ′ 1 ,Z ′ 2 ,...,Z ′ n 1 n n X i=1 ℓ(w,Z ′ i )− E Z ′ ∼ P Z ℓ(w,Z ′ ) ! 2 , (G.92) 105 whereZ ′ 1 ,...,Z ′ n aren i.i.d. samples fromP Z . The expression in Eq. (G.92) is simply the variance of the average ofn i.i.d[0,1]-bounded random variables. Hence, it can be bounded by1/(4n). Connecting this result with Eq. (G.90), we get E P S,W (R(W)− r S (W)) 2 ≤ 2B+ 1 2n (G.93) ≤ 2E P ˜ Z 4 n (I ˜ Z (W;J)+log3) + 1 2n (G.94) ≤ E P ˜ Z 8 n (I ˜ Z (W;J)+2) . (G.95) G.10 ProofofProposition4.2.9 The proof follows that of Proposition 4.2.4 but with conditioning on ˜ Z. G.11 ProofofTheorem4.2.10 Proof. For a fixed ˜ z, using Lemma G.4 we get that I ˜ Z=˜ z (W;J i )≤ I ˜ Z=˜ z (W;J i |J − i ), (G.96) and I ˜ Z=˜ z (W;J)≤ n X i=1 I ˜ Z=˜ z (W;J i |J − i ). (G.97) Using these upper bounds in Theorem 4.2.8 proves the theorem. G.12 ProofofTheorem4.3.1 Proof. First, with a slight abuse of notation, we will use ℓ(b y,y) to denote the average loss between a collection of predictionsb y and collection of labelsy. Let us consider the joint distribution of ˜ F u andJ u given ˜ Z = ˜ z andU =u. We are going to use Lemma 4.2.1 forP ˜ Fu,Ju| ˜ Z=˜ z,U=u withΦ= ˜ F u ,Ψ= J u , g(ϕ,ψ )=ℓ(ϕ ψ ,(˜ y u ) ψ )− ℓ(ϕ ¯ ψ ,(˜ y u )¯ ψ ) (G.98) = 1 m m X i=1 ℓ(ϕ i,ψ i ,(˜ y u ) i,ψ i )− ℓ(ϕ i, ¯ ψ i ,(˜ y u ) i, ¯ ψ i ) ! . (G.99) The function g(ϕ,ψ ) computes the generalization gap measured on pairs of the examples specified by subset u, assuming that predictions are given by ϕ and the training/test set split is given by ψ . Note that by our assumption, for anyϕ each summand ofg(ϕ,J u ) is a1-subgaussian random variable under P ˜ Fu,Ju| ˜ Z=˜ z,U=u . Furthermore, each of these summands has zero mean. As the average ofm independent and zero-mean 1-subgaussian variables is 1 √ m -subgaussian, then g(ϕ,J u ) is 1 √ m -subgaussian for each possibleϕ . Additionally,∀ϕ ∈ K m× 2 , E Ju g(ϕ,J u ) = 0. By Lemma G.3, g( ˜ F u ,J u ) is 1 √ m -subgaussian 106 underP ˜ Fu| ˜ Z=˜ z,U=u ⊗ P Ju| ˜ Z=˜ z,U=u . Hence, these choices ofΦ ,Ψ , andg(ϕ,ψ ) satisfy the assumptions of Lemma 4.2.1. We have that E ˜ Fu,Ju| ˜ Z=˜ z,U=u g( ˜ F u ,J u )=E ˜ Fu,Ju| ˜ Z=˜ z,U=u h ℓ(( ˜ F u ) Ju ,(˜ y u ) Ju )− ℓ(( ˜ F u )¯ Ju ,(˜ y u )¯ Ju i , (G.100) and E P ˜ Fu| ˜ Z=˜ z,U=u ⊗ P Ju| ˜ Z=˜ z,U=u h g( ˜ F u ,J u ) i =0. (G.101) Therefore, the first part of Lemma 4.2.1 gives E ˜ Fu,Ju| ˜ Z=˜ z,U=u h ℓ(( ˜ F u ) Ju ,(˜ y u ) Ju )− ℓ(( ˜ F u )¯ Ju ,(˜ y u )¯ Ju i ≤ r 2 m I ˜ Z=˜ z ( ˜ F u ;J u ). (G.102) Taking expectation overU on both sides, and then using Jensen’s inequality to swap the order of absolute value and expectation ofU, we get E U E U, ˜ F U ,J U | ˜ Z=˜ z,U h ℓ(( ˜ F U ) J U ,(˜ y U ) J U )− ℓ(( ˜ F U )¯ J U ,(˜ y U )¯ J U i ≤ E U r 2 m I ˜ Z=˜ z,U ( ˜ F U ;J U ). (G.103) which reduces to E ˜ F,J| ˜ Z=˜ z h ℓ( ˜ F J ,˜ y J )− ℓ( ˜ F¯ J ,˜ y¯ J ) i ≤ E U r 2 m I ˜ Z=˜ z,U ( ˜ F U ;J U ). (G.104) This can be seen as bounding the expected generalization gap for a fixed ˜ z. Taking expectation over ˜ Z on both sides and using Jensen’s inequality to switch the order of absolute value and expectation of ˜ Z, we get E ˜ Z,J, ˜ F h ℓ( ˜ F J , ˜ Y J )− ℓ( ˜ F¯ J , ˜ Y¯ J ) i ≤ E ˜ Z,U r 2 m I ˜ Z,U ( ˜ F U ;J U ). (G.105) Noticing that the left-hand side is equal to|E S,R [R(f)− r S (f)]| completes the proof of the first part of the theorem. Whenu=[n], applying the second part of Lemma 4.2.1 gives E ˜ F,J| ˜ Z=˜ z ℓ( ˜ F J ,˜ y J )− ℓ( ˜ F¯ J ,˜ y¯ J ) 2 ≤ 4 n I ˜ Z=˜ z ( ˜ F;J)+log3 . (G.106) Taking expectation over ˜ Z, we get E ˜ Z,J, ˜ F ℓ( ˜ F J , ˜ Y J )− ℓ( ˜ F¯ J , ˜ Y¯ J ) 2 | {z } B ≤ E ˜ Z 4 n I ˜ Z ( ˜ F;J)+log3 . (G.107) 107 Continuing, E S,R (R(f)− r S (f)) 2 =E ˜ Z,J,R ℓ( ˜ F J , ˜ Y J )− E Z ′ ∼ P Z ℓ(f( ˜ Z J ,X ′ ,R),Y ′ ) 2 (G.108) ≤ 2B+2E ˜ Z,J,R ℓ( ˜ F¯ J , ˜ Y¯ J )− E Z ′ ∼ P Z ℓ(f( ˜ Z J ,X ′ ,R),Y ′ ) 2 (G.109) =2B+2E ˜ Z,J,R 1 n n X i=1 ℓ( ˜ F i, ¯ J i , ˜ Y i, ¯ J i )− E Z ′ ∼ P Z ℓ(f( ˜ Z J ,X ′ ,R),Y ′ ) ! 2 (G.110) =2B+2E ˜ Z J ,R, ˜ Z¯ J 1 n n X i=1 ℓ(f( ˜ Z J , ˜ X¯ J ,R) i ,( ˜ Y¯ J ) i )− E Z ′ ∼ P Z ℓ(f( ˜ Z J ,X ′ ,R),Y ′ ) ! 2 (G.111) =2B+2E ˜ Z J ,R E ˜ Z¯ J | ˜ Z J ,R 1 n n X i=1 ℓ(f( ˜ Z J , ˜ X¯ J ,R) i ,( ˜ Y¯ J ) i )− E Z ′ ∼ P Z ℓ(f( ˜ Z J ,X ′ ,R),Y ′ ) ! 2 . (G.112) Note that as ˜ Z¯ J is independent of( ˜ Z J ,R), conditioning on( ˜ Z J ,R) does not change its distribution, implying that its components stay independent of each other. For each fixed values ˜ Z J =s andR=r, the inner part of the expectation in Eq. (G.112) becomes E ˜ Z¯ J 1 n n X i=1 ℓ(f(s, ˜ X¯ J ,r) i ,( ˜ Y¯ J ) i )− E Z ′ ∼ P Z ℓ(f(s,X ′ ,r),Y ′ ) ! 2 , (G.113) which is equal to E Z ′ 1 ,Z ′ 2 ,...,Z ′ n 1 n n X i=1 ℓ(f(s,X ′ i ,Y ′ i )− E Z ′ i ℓ(f(s,X ′ i ,r),Y ′ i ) ! 2 , (G.114) whereZ ′ 1 ,...,Z ′ n aren i.i.d. samples fromP Z . The expression in Eq. (G.114) is simply the variance of the average ofn i.i.d[0,1]-bounded random variables. Hence, it can be bounded by1/(4n). Connecting this result with Eq. (G.112), we get E S,R (R(f)− r S (f)) 2 ≤ 2B+ 1 2n (G.115) ≤ 2E ˜ Z 4 n I ˜ Z ( ˜ F;J)+log3 + 1 2n (G.116) ≤ E ˜ Z 8 n I ˜ Z ( ˜ F;J)+2 . (G.117) G.13 ProofofProposition4.3.4 Proof. The proof closely follows that of Proposition 4.2.9. The only important difference is that ˜ F u depends onu, whileW does not. 108 Let us fix a value of ˜ Z and consider the conditional joint distributionP ˜ F,J| ˜ Z . If we fix a subset u ′ of sizem+1, setΦ= ˜ F u ′, and use Lemma G.5 underP ˜ F,J| ˜ Z , we get I ˜ Z ( ˜ F u ′;J u ′)≥ 1 m X k∈u ′ I ˜ Z ˜ F u ′;J u ′ \{k} (G.118) ≥ 1 m X k∈u ′ I ˜ Z ˜ F u ′ \{k} ;J u ′ \{k} , (G.119) as removing predictions on pairk can not increase the mutual information. Therefore, ϕ 1 m+1 I ˜ Z ( ˜ F u ′;J u ′) ≥ ϕ 1 m(m+1) X k∈u ′ I ˜ Z ( ˜ F u ′ \{k} ;J u ′ \{k} ) ! (G.120) ≥ 1 m+1 X k∈u ′ ϕ 1 m I ˜ Z ( ˜ F u ′ \{k} ;J u ′ \{k} ) . (by Jensen’s inequality) (G.121) Taking expectation overU ′ on both sides, we have E U ′ϕ 1 m+1 I ˜ Z,U ′ ˜ F U ′;J U ′ ≥ E U ′ " 1 m+1 X k∈U ′ ϕ 1 m I ˜ Z,U ′ ( ˜ F U ′ \{k} ;J U ′ \{k} ) # (G.122) = X u α u ϕ 1 m I ˜ Z ( ˜ F u ;J u ) . (G.123) For each subsetu of sizem, the coefficient α u is equal to α u = 1 n m+1 · 1 m+1 · (n− m)= 1 n m . (G.124) Therefore X u α u ϕ 1 m I ˜ Z ( ˜ F u ;J u ) =E U ϕ 1 m I ˜ Z,U ( ˜ F U ;J U ) . (G.125) G.14 ProofofTheorem4.3.5 Proof. Let us fix ˜ Z = ˜ z. SettingΦ= ˜ F i ,Ψ= J, and using the first part of Lemma G.4 under P ˜ F,J| ˜ Z=˜ z , we get that I ˜ Z=˜ z ( ˜ F i ;J i )≤ I ˜ Z=˜ z ( ˜ F i ;J i |J − i ). (G.126) Next, settingΦ= ˜ F ,Ψ= J, and using the second part of Lemma G.4 underP ˜ F,J| ˜ Z=˜ z , we get that I ˜ Z=˜ z ( ˜ F;J)≤ n X i=1 I ˜ Z=˜ z ( ˜ F;J i |J − i ). (G.127) Using these upper bounds in Theorem 4.3.1 proves this theorem. 109 G.15 ProofofTheorem4.4.1 Proof. Letk denote the number of distinct values ˜ F can take by varyingJ andR for a fixed ˜ Z = ˜ z. Clearly, k is not more than the growth function ofH evaluated at2n. Applying the Sauer-Shelah lemma [Sauer, 1972, Shelah, 1972], we get that k≤ d X i=0 2n i . (G.128) The Sauer-Shelah lemma also states that if2n>d+1 then d X i=0 2n i ≤ 2en d d . (G.129) If2n≤ d+1, one can upper boundk by2 2n ≤ 2 d+1 . Therefore k≤ max ( 2 d+1 , 2en d d ) . (G.130) Finally, as a ˜ F is a discrete variable withk states, f-CMI(f,˜ z)≤ H( ˜ F | ˜ Z = ˜ z)≤ log(k). (G.131) 110 G.16 ProofofProposition4.4.2 Proof. The proof below uses the independence ofJ 1 ,...,J n and the convexity of KL divergence, once for the first and once for the second argument. I(f(˜ z J ,x ′ ,R);J i |J − i )=KL f(˜ z J ,x ′ ,R)|J∥f(˜ z J ,x ′ ,R)|J − i (G.132) = 1 2 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥f( ˜ Z J ,x ′ ,R)|J − i + 1 2 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥f( ˜ Z J ,x ′ ,R)|J − i (G.133) = 1 2 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥ 1 2 f(˜ z J i← 0,x ′ ,R)+f(˜ z J i← 1,x ′ ,R) |J − i + 1 2 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥ 1 2 f(˜ z J i← 0,x ′ ,R)+f(˜ z J i← 1,x ′ ,R) |J − i (G.134) ≤ 1 4 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥f(˜ z J i← 0,x ′ ,R)|J − i + 1 4 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥f(˜ z J i← 1,x ′ ,R)|J − i + 1 4 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥f(˜ z J i← 0,x ′ ,R)|J − i + 1 4 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥f(˜ z J i← 1,x ′ ,R)|J − i (G.135) = 1 4 KL f(˜ z J i← 0,x ′ ,R)|J − i ∥f(˜ z J i← 1,x ′ ,R)|J − i + 1 4 KL f(˜ z J i← 1,x ′ ,R)|J − i ∥f(˜ z J i← 0,x ′ ,R)|J − i . (G.136) G.17 ProofofTheorem4.4.3 Proof. Given a deterministic algorithm f, we consider the algorithm that adds Gaussian noise to the predictions off: f σ (s,x,R)=f(s,x)+ξ (s,x), (G.137) whereξ (s,x)∼N (0,σ 2 I k ). The functionf σ is constructed in a way that the noise terms are independent for each possible combination ofs andx. This can be achieved by viewingR as an infinite collection of independent Gaussian variables, one of which is selected for each possible combination ofs andx. 111 Let us consider the random subsample setting and letZ ′ ∼ P Z be a test example independent of ˜ Z,J and randomnessR. First we relate the generalization gap off σ to that off: |E S [R(f)− r S (f)]| = E ˜ Z,J,R,Z " ℓ(f σ ( ˜ Z J ,X ′ ,R),Y ′ )− 1 n n X i=1 ℓ(f σ ( ˜ Z J , ˜ X i,J i ,R), ˜ Y i,J i ) # (G.138) = E ˜ Z,J,R,Z " ℓ(f( ˜ Z J ,X ′ )+ξ ′ ( ˜ Z J ,X ′ ),Y ′ )− 1 n n X i=1 ℓ(f( ˜ Z J , ˜ X i,J i )+ξ ( ˜ Z J , ˜ X i,J i ), ˜ Y i,J i ) # (G.139) = E ˜ Z,J,R,Z " ℓ(f( ˜ Z J ,X ′ ),Y ′ )+∆ ′ − 1 n n X i=1 ℓ(f( ˜ Z J , ˜ X i,J i ), ˜ Y i,J i )+∆ i # , (G.140) where ∆ ′ =ℓ(f( ˜ Z J ,X ′ )+ξ ( ˜ Z J ,X ′ ),Y ′ ) | {z } ≜ξ ′ − ℓ(f( ˜ Z J ,X ′ ),Y ′ ), ∆ i =ℓ(f( ˜ Z J , ˜ X i,J i )+ξ ( ˜ Z J , ˜ X i,J i ) | {z } ≜ξ i , ˜ Y i,J i )− ℓ(f( ˜ Z J , ˜ X i,J i ), ˜ Y i,J i ). Asℓ(b y,y) isγ -Lipschitz in its first argument, |∆ ′ |≤ γ ∥ξ ′ ∥ and|∆ i |≤ γ ∥ξ i ∥. Connecting this to Eq. (G.140) we get |E S,R [R(f σ )− r S (f σ )]|≥| E S [R(f)− r S (f)]|− γ E ξ ′ − γ n n X i=1 E∥ξ i ∥ (G.141) =|E S [R(f)− r S (f)]|− 2 √ dγσ. (G.142) Similarly, we relate the expected squared generalization gap off σ to that off: E S,R (R(f σ )− r S (f σ )) 2 =E ˜ Z,J,R E Z ′ ∼ P Z h ℓ(f( ˜ Z J ,X ′ ),Y ′ )+∆ ′ i − 1 n n X i=1 ℓ(f( ˜ Z J , ˜ X i,J i ), ˜ Y i,J i )+∆ i ! 2 (G.143) =E S (R(f)− r S (f)) 2 +E ˜ Z,J,R E Z ′ ∼ P Z [∆ ′ ]− 1 n n X i=1 ∆ i ! 2 +2E ˜ Z,J,R " (R(f)− r S (f)) E Z ′ ∼ P Z [∆ ′ ]− 1 n n X i=1 ∆ i !# (G.144) ≥ E S (R(f)− r S (f)) 2 − 2E ˜ Z,J,R " |R(f)− r S (f)| E Z ′ ∼ P Z [∆ ′ ]− 1 n n X i=1 ∆ i # (G.145) =E S (R(f)− r S (f)) 2 − 2E ˜ Z,J " |R(f)− r S (f)|E R " E Z ′ ∼ P Z [∆ ′ ]− 1 n n X i=1 ∆ i ## . (G.146) 112 As E R " E Z ′ ∼ P Z [∆ ′ ]− 1 n n X i=1 ∆ i # ≤ E R E Z ′ ∼ P Z ∆ ′ + 1 n n X i=1 E R |∆ i | (G.147) ≤ E R E Z ′ ∼ P Z γ ξ ′ + 1 n n X i=1 E R [γ ∥ξ ∥] (G.148) =2γ √ dσ, (G.149) we can write Eq. (G.146) as E S,R (R(f σ )− r S (f σ )) 2 ≥ E S (R(f)− r S (f)) 2 − 4γ √ dσ E ˜ Z,J [|R(f)− r S (f)|] (G.150) ≥ E S (R(f)− r S (f)) 2 − 4γ √ dσ q E S (R(f)− r S (f)) 2 , (G.151) where the second line follows from Jensen’s inequality ((E|X|) 2 ≤ EX 2 ). Summarizing, Eq. (G.142) and Eq. (G.151) relate expected generalization gap and expected squared generalization gap off σ to those off. Boundingexpectedgeneralizationgapoff. |E S [R(f)− r S (f)]| ≤| E S,R [R(f σ )− r S (f σ )]|+2 √ dγσ (by Eq. (G.142)) (G.152) ≤ 1 n n X i=1 E ˜ Z q 2I ˜ Z (f σ ( ˜ Z J , ˜ X i ,R);J i |J − i )+2 √ dγσ (by Theorem 4.3.5) (G.153) ≤ 1 n n X i=1 E ˜ Z v u u u t 1 2 KL ˜ Z f σ ( ˜ Z J i← 1, ˜ X i ,R)|J − i ∥f σ ( ˜ Z J i← 0, ˜ X i ,R)|J − i + 1 2 KL ˜ Z f σ ( ˜ Z J i← 0, ˜ X i ,R)|J − i ∥f σ ( ˜ Z J i← 1, ˜ X i ,R)|J − i +2 √ dγσ (G.154) = 1 n n X i=1 E ˜ Z r 1 2σ 2 E J − i f( ˜ Z J i← 0, ˜ X i )− f( ˜ Z J i← 1, ˜ X i ) 2 2 +2 √ dγσ (G.155) ≤ 1 n n X i=1 r 1 2σ 2 E ˜ Z,J − i f( ˜ Z J i← 0, ˜ X i )− f( ˜ Z J i← 1, ˜ X i ) 2 2 +2 √ dγσ (G.156) ≤ r β 2 σ 2 +2 √ dγσ (byβ self-stability off). (G.157) Pickingσ 2 = β 2 √ dγ , we get |E S [R(f)− r S (f)]|≤ 2 3 2 d 1 4 p γβ. (G.158) 113 Boundingexpectedsquaredgeneralizationgapoff. For brevity below, letG≜E S (R(f)− r S (f)) 2 . Starting with Eq. (G.151), we get G≤ E S,R (R(f σ )− r S (f σ )) 2 +4γ √ dσ √ G (G.159) ≤ 8 n E ˜ Z " n X i=1 I ˜ Z (f σ ( ˜ Z J , ˜ X,R);J i |J − i ) # +2 ! +4γ √ dσ √ G (by Theorem 4.3.5) (G.160) ≤ 16 n + 8 n n X i=1 1 4 KL ˜ Z f σ ( ˜ Z J i← 1, ˜ X,R)|J − i ∥f σ ( ˜ Z J i← 0, ˜ X,R)|J − i + 1 4 KL ˜ Z f σ ( ˜ Z J i← 0, ˜ X,R)|J − i ∥f σ ( ˜ Z J i← 1, ˜ X,R)|J − i +4γ √ dσ √ G (G.161) = 16 n + 8 n n X i=1 E ˜ Z,J 1 4σ 2 f( ˜ Z J i← 0, ˜ X)− f( ˜ Z J i← 1, ˜ X) 2 2 +4γ √ dσ √ G (G.162) ≤ 16 n + 2 σ 2 2β 2 +nβ 2 1 +nβ 2 2 +4γ √ dσ √ G. (G.163) The optimalσ is given by σ = 2β 2 +nβ 2 1 +nβ 2 2 γ √ G √ d 1 3 , (G.164) and gives G≤ 16 n +6d 1 3 γ 2 3 2β 2 +nβ 2 1 +nβ 2 2 1 3 G 1 3 . (G.165) We discuss 2 cases. (i) 16 n ≥ 6d 1 3 γ 2 3 2β 2 +nβ 2 1 +nβ 2 2 1 3 G 1 3 . In this caseG≤ 32 n . (ii) 16 n <6d 1 3 γ 2 3 2β 2 +nβ 2 1 +nβ 2 2 1 3 G 1 3 . In this case, we have G≤ 12d 1 3 γ 2 3 2β 2 +nβ 2 1 +nβ 2 2 1 3 G 1 3 , (G.166) which simplifies to G≤ 12 3 2 √ dγ q 2β 2 +nβ 2 1 +nβ 2 2 . (G.167) Combining these cases we can write that G≤ max 32 n ,12 3 2 √ dγ q 2β 2 +nβ 2 1 +nβ 2 2 (G.168) ≤ 32 n +12 3 2 √ dγ q 2β 2 +nβ 2 1 +nβ 2 2 . (G.169) RemarkG.2. The bounds of this theorem work even whenY =K =[a,b] k instead ofR k . To see this, we first clip the noisy predictions to be in [a,b] k : f c σ (z,x) i ≜clip(f σ (z,x),a,b) i , ∀i∈[k]. (G.170) Inequalities Eq. (G.142) and Eq. (G.151) that relate the expected generalization gap and expected squared generalization gap off σ to those off stay true when replacingf σ withf c σ . Furthermore, by data processing 114 inequality, mutual informations measured withf c σ can always be upper bounded by the corresponding mutual informations informations measures withf σ . Therefore, generalization bounds that hold forf σ will also forf c σ , allowing us to follow the exact same proofs above. RemarkG.3. In the construction off σ we used Gaussian noise with zero mean andσ 2 I covariance matrix. A natural question arises whether a different type of noise would give better bounds. Inequalities Eq. (G.142) and Eq. (G.151) only use the facts that noise components are independent, have zero-mean andσ 2 variance. Therefore, if we restrict ourselves to noise distributions with independent components, each of which has zero mean andσ 2 variance, then the best bounds will be produced by noise distributions that result in the smallest KL divergence of formKL ˜ Z f σ ( ˜ Z J i← 1,x ′ ,R)|J − i ∥f σ ( ˜ Z J i← 0,x ′ ,R)|J − i . An informal argument below hints that the Gaussian distribution might be the optimal choice of the noise distribution when for fixed ˜ Z = ˜ z,f σ (˜ z J i← 1,x ′ ,R) andf σ (˜ z J i← 0,x ′ ,R) are close to each other. Let us fix σ 2 and consider two means µ 1 < µ 2 ∈ R. LetF = {p(x;µ )|µ ∈R} be a family of probability distributions with one mean parameterµ , such that every distribution of it has varianceσ 2 and KL divergences between members ofF exist. LetX 1 ∼ p(x,µ 1 ) andX 2 ∼ p(x,µ 2 ). We are interested in finding such a family F thatKL(X∥Y) is minimized. For smallµ 2 − µ 1 , we know that KL(P X ∥P Y )≈ 1 2 (µ 2 − µ 1 )I(µ 1 )(µ 2 − µ 1 ), (G.171) whereI(µ ) is the Fisher information ofp(x;µ ). Furthermore, letc µ 1 ≜X. AsEc µ 1 =µ 1 and var[c µ 1 ]=σ 2 , the Cramer-Rao bound gives σ 2 =var[c µ 1 ]≥ 1 I(µ 1 ) . (G.172) This gives us the following lower bound on the KL divergence betweenX andY : KL(P X ∥P Y )⪆ 1 2σ 2 (µ 2 − µ 1 ) 2 , (G.173) which is matched by the Gaussian distribution. 115 G.18 ProofofTheorem5.4.1 Proof. Let ¯ J≜(1− J 1 ,...,1− J n ) be the negation ofJ. We have that E P W,S h (R(W)− r S (W)) 2 i =E P W,S 1 n n X i=1 ℓ(W,Z i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.174) =E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.175) =E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− E Z ′ ∼ P Z ℓ(W,Z ′ )+ 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i ) ! 2 (G.176) ≤ 2E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i ) ! 2 | {z } B (G.177) +2E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.178) =2B+2E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 (G.179) =2B+2E W E ˜ Z,J|W 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) ! 2 . (G.180) For any fixed w∈W, the termsℓ(W, ˜ Z i, ¯ J i ) are independent of each other underP ˜ Z,J|W=w . Furthermore, W and ˜ Z i, ¯ J i are independent. Therefore, the average in Eq. (G.180) is an average ofn i.i.d. random variables with zero mean. The variance of this average is at most 1 4n . Hence, E P W,S h (R(W)− r S (W)) 2 i ≤ 2B+ 1 2n . (G.181) 116 Let us boundB now: B =E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) ! 2 (G.182) = 1 n 2 n X i=1 E ˜ Z,J,W ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) 2 (G.183) + 1 n 2 X i̸=k E ˜ Z,J,W h ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) ℓ(W, ˜ Z k,J k )− ℓ(W, ˜ Z k, ¯ J k ) i (G.184) ≤ 1 n +E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) 1 n X k̸=i ℓ(W, ˜ Z k,J k )− ℓ(W, ˜ Z k, ¯ J k ) . (G.185) Let us consider a fixed i∈[n]. Then E ˜ Z,J,W ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) 1 n X k̸=i ℓ(W, ˜ Z k,J k )− ℓ(W, ˜ Z k, ¯ J k ) =E ˜ Z E J i ,W| ˜ Z E J − i |J i ,W, ˜ Z ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) 1 n X k̸=i ℓ(W, ˜ Z k,J k )− ℓ(W, ˜ Z k, ¯ J k ) | {z } f(W,J i , ˜ Z) . (G.186) Note thatf(w,j i ,˜ z)∈[− 1,+1] for allw∈W,j∈{0,1} n , and ˜ z∈Z n× 2 . Therefore, by Lemma 4.2.1, for any value of ˜ Z E W,J i | ˜ Z h f(W,J i , ˜ Z) i ≤ q 2I ˜ Z (W;J i )+E W| ˜ Z E J i | ˜ Z h f(W,J i , ˜ Z) i . (G.187) It is left to notice that for anyw ∈ W,E J i | ˜ Z h f(w,J i | ˜ Z) i = 0, as underP J i | ˜ Z the termℓ(w, ˜ Z i,J i )− ℓ(w, ˜ Z i, ¯ J i ) has zero mean. Therefore, Eq. (G.187) reduces to E W,J i | ˜ Z h f(W,J i , ˜ Z) i ≤ q 2I ˜ Z (W;J i ). (G.188) Putting together Eq. (G.181), Eq. (G.185), Eq. (G.186) and Eq. (G.188), we get that E P W,S h (R(W)− r S (W)) 2 i ≤ 5 2n + 2 n n X i=1 E ˜ Z q 2I ˜ Z (W;J i ). (G.189) 117 G.19 ProofofTheorem5.4.2 Proof. Let ˜ Λ ∈[0,1] n× 2 be the losses on examples of ˜ Z: ˜ Λ i,c =ℓ(W, ˜ Z i,c ), ∀i∈[n],c∈{0,1}. (G.190) Let ¯ J≜(1− J 1 ,...,1− J n ) be the negation ofJ. We have that E P W,S [r S (W)− R(W)]= 1 n n X i=1 E ℓ(W,Z i )− E Z ′ ∼ P Z ℓ(W,Z ′ ) (G.191) = 1 n n X i=1 E ℓ(W,Z i,J i )− ℓ(W,Z i, ¯ J i ) (G.192) = 1 n n X i=1 E h ˜ Λ i,J i − ˜ Λ i, ¯ J i i . (G.193) If we use Lemma 4.2.1 withΦ= ˜ Λ i ,Ψ= J i , andf(Φ ,Ψ)= ˜ Λ i,J i − ˜ Λ i, ¯ J i , we get that E h ˜ Λ i,J i − ˜ Λ i, ¯ J i i − E ˜ Λ i E J i h ˜ Λ i,J i ˜ Λ i, ¯ J i i ≤ q 2I( ˜ Λ i ,J i ) (G.194) This proves the first part of the theorem, as E ˜ Λ i E J i h ˜ Λ i,J i ˜ Λ i, ¯ J i i = 0. The second part can be proven by first conditioning on ˜ Z i : E h ˜ Λ i,J i − ˜ Λ i, ¯ J i i =E ˜ Z i E ˜ Λ i ,J i | ˜ Z i h ˜ Λ i,J i − ˜ Λ i, ¯ J i i , (G.195) and then applying the lemma to upper bound E ˜ Λ i ,J i | ˜ Z i h ˜ Λ i,J i − ˜ Λ i, ¯ J i i with q 2I ˜ Z i ( ˜ Λ i ;J i ). Finally, the third part can be proven by first conditioning on ˜ Z: E h ˜ Λ i,J i − ˜ Λ i, ¯ J i i =E ˜ Z E ˜ Λ i ,J i | ˜ Z h ˜ Λ i,J i − ˜ Λ i, ¯ J i i , (G.196) and then applying the lemma to upper bound E ˜ Λ i ,J i | ˜ Z h ˜ Λ i,J i − ˜ Λ i, ¯ J i i with q 2I ˜ Z ( ˜ Λ i ;J i ). RemarkG.4. As ˜ Z andJ i are independent, we have that I ℓ(W, ˜ Z i );J i ≤ E ˜ Z i h I ˜ Z i ℓ(W, ˜ Z i );J i i ≤ E ˜ Z h I ˜ Z ℓ(W, ˜ Z i );J i i . (G.197) However, if we consider expected square root of disintegrated mutual informations (as in the bound of this theorem), then this relation might not be true. G.20 ProofofTheorem5.5.3 Proof. It is enough to prove only Eq. (5.52), Eq. (5.53), and Eq. (5.54). Inequality Eq. (5.55) can be derived from Eq. (5.53) using data processing inequality, while Eq. (5.56) can be derived from Eq. (5.54). 118 As in the proof of Theorem 5.4.1, E P W,S h (R(W)− r S (W)) 2 i ≤ 2E ˜ Z,J,W 1 n n X i=1 ℓ(W, ˜ Z i,J i )− 1 n n X i=1 ℓ(W, ˜ Z i, ¯ J i ) ! 2 | {z } B + 1 2n . (G.198) where ¯ J≜(1− J 1 ,...,1− J n ) and B≤ 1 n + 1 n 2 X i̸=k E ˜ Z,J,W h ℓ(W, ˜ Z i,J i )− ℓ(W, ˜ Z i, ¯ J i ) ℓ(W, ˜ Z k,J k )− ℓ(W, ˜ Z k, ¯ J k ) i | {z } C i,k . (G.199) Let ˜ Λ ∈[0,1] n× 2 be the losses on examples of ˜ Z: ˜ Λ i,c =ℓ(W, ˜ Z i,c ), ∀i∈[n],c∈{0,1}. (G.200) Then we can write C i,k =E ˜ Λ i , ˜ Λ k ,J i ,J k h ˜ Λ i,J i − ˜ Λ i, ¯ J i ˜ Λ k,J k − ˜ Λ k, ¯ J k i . (G.201) and use Lemma 4.2.1 to arrive at: C i,k ≤ E ˜ Λ i , ˜ Λ k E J i ,J k h ˜ Λ i,J i − ˜ Λ i, ¯ J i ˜ Λ k,J k − ˜ Λ k, ¯ J k i + q 2I( ˜ Λ i , ˜ Λ k ;J i ,J k ) (G.202) = q 2I( ˜ Λ i , ˜ Λ k ;J i ,J k ), (G.203) where the last equality holds as for any fixed value of ( ˜ Λ i , ˜ Λ k ) the difference terms ˜ Λ i,J i − ˜ Λ i, ¯ J i and ˜ Λ k,J k − ˜ Λ k, ¯ J k are independent and have zero mean underP J i ,J k . To derive Eq. (5.53), one can condition on ˜ Z i , ˜ Z k : C i,k =E ˜ Z i , ˜ Z k E ˜ Λ i , ˜ Λ k ,J i ,J k | ˜ Z i , ˜ Z k h ˜ Λ i,J i − ˜ Λ i, ¯ J i ˜ Λ k,J k − ˜ Λ k, ¯ J k i , (G.204) and then apply Lemma 4.2.1 for the inner expectation. Similarly, Eq. (5.54) can be derived by conditioning on ˜ Z and then applying Lemma 4.2.1. RemarkG.5. Wit the expectation inside the square root, inequalities Eq. (5.52), Eq. (5.53), and Eq. (5.54) would be in non-decreasing order. With the expectation outside the square root, this relation might not be true. G.21 ProofofProposition6.2.1 Proof. AsK is a full rank matrix almost surely, then with probability 1 there exists a vectorα ∈R n , such that Kα = Y . Consider the function f(x) = P n i=1 α i k(X i ,x) ∈ H. Clearly, f(X i ) = Y i , ∀i ∈ [n]. Furthermore,∥f∥ 2 H =α ⊤ Kα =Y ⊤ K − 1 Y . The existence of suchf ∈H with zero empirical loss and the assumptions on the loss function imply that any optimal solution of problem Eq. (6.4) has a norm at mostY ⊤ K − 1 Y . 119 G.22 ProofofTheorem6.2.2 To prove Theorem 6.2.2 we need the following definition of Rademacher complexity [Mohri et al., 2018]. DefinitionG.1 (Rademacher complexity). LetG be a family of functions fromZ toR, andZ 1 ,...,Z n be n i.i.d. examples from a distributionP Z onZ. Then, theempiricalRademachercomplexity ofG with respect to(Z 1 ,...,Z n ) is defined as b R n (G)=E σ 1 ,...,σ n " sup g∈G 1 m n X i=1 σ i g(Z i ) # , (G.205) whereσ i are independent Rademacher random variables (i.e., uniform random variables taking values in {− 1,1}). The Rademacher complexity ofG is then defined as R n (G)=E Z 1 ,...,Zn h b R n (G) i . (G.206) Remark G.6. Shifting the hypothesis class G by a constant function does not change the empirical Rademacher complexity: b R n ({f +g :g∈G})=E σ 1 ,...,σ n " sup g∈G 1 m n X i=1 σ i (f(Z i )+g(Z i )) # (G.207) =E σ 1 ,...,σ n " 1 m n X i=1 σ i f(Z i )+sup g∈G 1 m n X i=1 σ i g(Z i ) # (G.208) =E σ 1 ,...,σ n " sup g∈G 1 m n X i=1 σ i g(Z i ) # = b R n (G). (G.209) Given the kernel classification setting described in Section 6.2.1, we first prove a slightly more general variant of a classical generalization gap bound in Bartlett and Mendelson [2002, Theorem 21]. LemmaG.6. Assumesup x∈X k(x,x)<∞. Fix any constantM >0. Then with probability at least1− δ , every functionf ∈H with∥f∥ H ≤ M satisfies P X,Y (Yf(X)≤ 0)≤ 1 n n X i=1 ϕ γ (sign(Y i )f(X i ))+ 2M γn p Tr(K)+3 r ln(2/δ ) 2n . (G.210) Proof of Lemma G.6. LetF ={f ∈H :∥f∥≤ M} and consider the following class of functions: G ={(x,y)7→ϕ γ (sign(y)f(x)):f ∈F}. (G.211) By the standard Rademacher complexity classification generalization bound [Mohri et al., 2018, Theorem 3.3], for anyδ > 0, with probability at least1− δ , the following holds for allf ∈F: E X,Y [ϕ γ (sign(Y)f(X))]≤ 1 n n X i=1 ϕ γ (sign(Y i )f(X i ))+2 b R n (G)+3 r log(2/δ ) 2n . (G.212) 120 Therefore, with probability at least1− δ , for allf ∈F P X,Y (Yf(X)≤ 0)≤ 1 n n X i=1 ϕ γ (sign(Y i )f(X i ))+2 b R n (G)+3 r log(2/δ ) 2n . (G.213) To finish the proof, we upper bound b R n (G): b R n (G)=E σ 1 ,...,σ n " sup g∈G 1 n n X i=1 σ i g(X i ,Y i ) # (G.214) =E σ 1 ,...,σ n " sup f∈F 1 n n X i=1 σ i ϕ γ (sign(Y i )f(X i )) # (G.215) ≤ 1 γ E σ 1 ,...,σ n " sup f∈F 1 n n X i=1 σ i sign(Y i )f(X i ) # (G.216) = 1 γ E σ 1 ,...,σ n " sup f∈F 1 n n X i=1 σ i f(X i ) # (G.217) = 1 γ b R n (F), (G.218) where the third line is due to Ledoux and Talagrand [1991]. By Lemma 22 of Bartlett and Mendelson [2002], we thus conclude that b R n (F)≤ M n p Tr(K). (G.219) Proof of Theorem 6.2.2. To get a generalization bound forf ∗ it is tempting to use Lemma G.6 withM =∥f ∗ ∥. However,∥f ∗ ∥ is a random variable depending on the training data and is an invalid choice for the constant M. This issue can be resolved by paying a small logarithmic penalty. For anyM ≥ M 0 = l γ √ n 2 √ κ m the bound of Lemma G.6 is vacuous. Let us consider the set of integers M={1,2,...,M 0 } and write Lemma G.6 for each element ofM withδ/M 0 failure probability. By union bound, we have that with probability at least1− δ , all instances of Lemma G.6 withM chosen fromM hold simultaneously. IfY ⊤ K − 1 Y ≥ M 0 , then the desired bound holds trivially, as the right-hand side becomes at least 1. Otherwise, we setM = l √ Y ⊤ K − 1 Y m ∈M and consider the corresponding part of the union bound. We thus have that with at least1− δ probability, every functionf ∈F with∥f∥≤ M satisfies P X,Y (Yf(X)≤ 0)≤ 1 n n X i=1 ϕ γ (sign(Y i )f(X i ))+ 2M γn p Tr(K)+3 r ln(2M 0 /δ ) 2n . (G.220) As by Proposition 6.2.1 any optimal solutionf ∗ has norm at most √ Y ⊤ K − 1 Y andM ≤ √ Y ⊤ K − 1 Y +1, we have with probability at least1− δ , P X,Y (Yf ∗ (X)≤ 0)≤ 1 n n X i=1 ϕ γ (sign(Y i )f ∗ (X i ))+ 2 √ Y ⊤ K − 1 Y +2 γn p Tr(K)+3 r ln(2M 0 /δ ) 2n . 121 G.23 ProofofTheorem6.2.3 The proof of Theorem 6.2.3 follows closely that of Theorem 6.2.2. We first introduce some concepts related to vector-valued positive semidefinite kernels. For the given matrix-valued positive definite kernel k :X ×X → R d× d , inputx∈X , anda∈R d , let k x a=k(·,x)a be the function fromX toR d defined the following way: k x a(x ′ )=k(x ′ ,x)a, for allx ′ ∈X. (G.221) With any such kernelk there is a unique vector-valued RKHSH of functions fromX toR d . This RKHS is the completion ofspan k x a:x∈X,a∈R d , with the following inner product: * n X i=1 k x i a i , m X j=1 k x ′ j a ′ j + H = n X i=1 m X j=1 a ⊤ i k(x i ,x ′ j )a ′ j . (G.222) For anyf ∈H, the norm∥f∥ H is defined as p ⟨f,f⟩ H . Therefore, iff(x)= P n i=1 k x i a i then ⟨f,f⟩ 2 H = n X i,j=1 a ⊤ i k(x i ,x j )a j (G.223) =a ⊤ Ka, (G.224) wherea ⊤ =(a ⊤ 1 ,...,a ⊤ n )∈R nd and K = k(x 1 ,x 1 ) ··· k(x 1 ,x n ) ··· ··· ··· k(x n ,x 1 ) ··· k(x n ,x n ) ∈R nd× nd . (G.225) Suppose{(X i ,Y i )} i∈[n] aren i.i.d. examples sampled from some probability distribution onX ×Y , whereY ⊂ R d . As in the binary classification case, we consider the regularized kernel problem Eq. (6.4). LetY ⊤ = (Y ⊤ 1 ,...,Y ⊤ n ) be the concatenation of targets. The following proposition is the analog of Proposition 6.2.1 in this vector-valued setting. Proposition G.7. Assume K is full rank almost surely. Assume also ℓ(y,y ′ ) ≥ 0,∀y,y ′ ∈ Y, and ℓ(y,y)=0,∀y∈Y. Then, with probability 1, for any solutionf ∗ of Eq. (6.4), we have that ∥f ∗ ∥ 2 H ≤ Y ⊤ K − 1 Y. (G.226) Proof of Proposition G.7. With probability 1, the kernel matrix K is full rank. Therefore, there exists a vectora ⊤ = (a ⊤ 1 ,...,a ⊤ n ) ∈ R nd , with a i ∈ R d , such that Ka = Y . Consider the function f(x) = P n i=1 k X i a i ∈H. Clearly,f(X i )=Y i , ∀i∈[n]. Furthermore, ∥f∥ 2 H =a ⊤ Ka (G.227) =Y ⊤ K − 1 Y. (G.228) The existence of suchf(x)∈H with zero empirical loss and assumptions on the loss function imply that any optimal solution of problem Eq. (6.4) has a norm at mostY ⊤ K − 1 Y . 122 Proof of Theorem 6.2.3. Consider the class of functionsF = {f ∈H :∥f∥≤ M} for some M > 0. By Theorem 2 of Kuznetsov et al. [2015], for anyγ > 0 andδ > 0, with probability at least1− δ , the following bound holds for allf ∈F: ∗ P X,Y (ρ f (X,Y)≤ 0)≤ 1 n n X i=1 1{ρ f (X i ,Y i )≤ γ }+ 4d γ b R n ( ˜ F)+3 r log(2/δ ) 2n , (G.229) where ˜ F ={(x,y)7→f(x) y :f ∈F,y∈[d]}. Next we upper bound the empirical Rademacher complexity of ˜ F: b R n ( ˜ F)=E σ 1 ,...,σ n " sup y∈[d],h∈H,∥h∥≤ M 1 n n X i=1 σ i h(X i ) y # (G.230) =E σ 1 ,...,σ n " sup y∈[d],h∈H,∥h∥≤ M 1 n n X i=1 σ i h(X i ) ⊤ y # (y is the one-hot enc. ofy) (G.231) =E σ 1 ,...,σ n " sup y∈[d],h∈H,∥h∥≤ M * h, 1 n n X i=1 σ i k X i y + H # (reproducing property) (G.232) ≤ M n E σ 1 ,...,σ n " sup y∈[d] n X i=1 σ i k X i y H # (Cauchy-Schwarz) (G.233) = M n v u u u tE σ 1 ,...,σ n sup y∈[d] n X i=1 σ i k X i y 2 H (Jensen’s inequality) (G.234) ≤ M n v u u u tE σ 1 ,...,σ n d X y=1 n X i=1 σ i k X i y 2 H (G.235) = M n v u u u t d X y=1 E σ 1 ,...,σ n n X i=1 ∥σ i k X i y∥ 2 H + X i̸=j σ i k X i y,σ j k X j y (G.236) = M n v u u t d X y=1 E σ 1 ,...,σ n " n X i=1 ∥σ i k X i y∥ 2 H # (independence ofσ i ) (G.237) = M n v u u t d X y=1 " n X i=1 y ⊤ k(X i ,X i )y # (G.238) = M n p Tr(K). (G.239) The proof is concluded with the same reasoning of the proof of Theorem 6.2.2. ∗ Note that their result is in terms of Rademacher complexity rather than empirical Rademacher complexity. The variant we use can be proved with the same proof, with a single modification of bounding R(g) with empirical Rademacher complexity of ˜ G using Theorem 3.3 of Mohri et al. [2018]. 123 G.24 ProofofProposition6.3.1 Proof. We have that P X,Y ∗ (Y ∗ f ∗ (X)≤ 0)=P X,Y ∗ (Y ∗ f ∗ (X)≤ 0∧Y ∗ g(X)≤ 0) +P X,Y ∗ (Y ∗ f ∗ (X)≤ 0∧Y ∗ g(X)>0) (G.240) ≤ P X,Y ∗ (Y ∗ g(X)≤ 0)+P X (g(X)f ∗ (X)≤ 0). (G.241) The rest follows from boundingP X (g(X)f ∗ (X)≤ 0) using Theorem 6.2.2. H Derivations H.1 SGDnoisecovariance Assume we haven examples and the batch size isb. Letg i ≜∇ w ℓ i (w),g≜ 1 n P n i=1 g i , and ˜ G≜ 1 b P b i=1 g k i , wherek i are sampled independently from{1,...,n} uniformly at. Then Cov[ ˜ G, ˜ G]=E 1 b b X i=1 g k i − g ! 1 b b X i=1 g k i − g ! T (H.1) = 1 b 2 b X i=1 E (g k i − g)(g k i − g) T (H.2) = 1 bn n X i=1 (g i − g)(g i − g) T (H.3) = 1 b 1 n n X i=1 g i g T i ! − gg T ! . (H.4) We denote the per-sample covariance,b· Cov h ˜ G, ˜ G i withΛ( w): Λ( w)= 1 n n X i=1 g i g T i ! − gg T . (H.5) We can see that whenever the number of samples times number of outputs is less than number of parameters (nk < d), thenΛ( w) will be rank deficient. Also, note that if we add weight decay to the total loss then covarianceΛ( w) will not change, as all gradients will be shifted by the same vector. H.2 Steady-statecovarianceofSGD In this section we verify that the normal distributionN(·;w ∗ ,Σ) , with w ∗ being the global minimum of the regularization MSE loss and covariance matrix Σ satisfying the continuous Lyapunov equation Σ H +HΣ= η b Λ( w ∗ ), is the steady-state distribution of the stochastic differential equation of Eq. (3.16). We assume that (a)Λ( w) is constant in a small neighborhood ofw ∗ and (b) the steady-state distribution is unique. We start with the Fokker-Planck equation: ∂p(w,t) ∂t = n X i=1 ∂ ∂w i [η ∇ w i L(w)p(w,t)]+ η 2 2b n X i=1 n X j=1 ∂ 2 ∂w i ∂w j [Λ( w) i,j p(w,t)]. (H.6) 124 Ifp(w)=N(w;w ∗ ,Σ)= 1 Z exp − 1 2 (w− w ∗ ) T Σ − 1 (w− w ∗ ) is the steady-state distribution, then the Fokker-Planck becomes: 0= d X i=1 ∂ ∂w i [η ∇ w i L(w)p(w)]+ η 2 2b d X i=1 d X j=1 ∂ 2 ∂w i ∂w j [Λ( w) i,j p(w)]. (H.7) In the case of MSE loss: ∇ w L(w)= n X k=1 ∇f 0 (x k )(f(x k )− y k )+λw =∇f 0 (x)(f(x)− y)+λw, (H.8) ∇ 2 w L(w)=∇f 0 (x)∇f 0 (x) T +λI. (H.9) Additionally, forp(w) the following two statements hold: ∂ ∂w i p(w)=− p(w)Σ − 1 i (w− w ∗ ), (H.10) ∂ 2 ∂w i ∂w j p(w)=− p(w)Σ − 1 i,j +p(w)Σ − 1 j (w− w ∗ )Σ − 1 i (w− w ∗ ), (H.11) whereΣ − 1 i is thei-th row ofΣ − 1 . Let’s compute the first term of Eq. (H.7): d X i=1 ∂ ∂w i [∇ w i L(w)p(w)]= d X i=1 p(w) ∇f 0 (x) i ∇f 0 (x) T i +λw i −∇ w i L(w)· p(w)Σ − 1 i (w− w ∗ ) (H.12) =p(w)Tr ∇f 0 (x)∇f 0 (x) T +λI − p(w) d X i=1 (∇f 0 (x) i (f(x)− y)+λw i )Σ − 1 i (w− w ∗ ) (H.13) =p(w)Tr(H)− p(w) (f(x)− y) T ∇f 0 (x) T +λw T Σ − 1 (w− w ∗ ). (H.14) Asw ∗ is a critical point ofL(w), we have that∇f 0 (x)(f w ∗ (x)− y)+λw ∗ =0. Therefore, we can subtract p(w) (f w ∗ (x)− y) T ∇f 0 (x) T +λ (w ∗ ) T Σ − 1 (w− w ∗ ) from Eq. (H.14): d X i=1 ∂ ∂w i [∇ w i L(w)p(w)]= =p(w)Tr(H)− p(w) (f(x)− f w ∗ (x)) T ∇f 0 (x) T +λ (w− w ∗ ) T Σ − 1 (w− w ∗ ) (H.15) =p(w)Tr(H)− p(w)(w− w ∗ ) T ∇f 0 (x)∇f 0 (x) T +λI Σ − 1 (w− w ∗ ) (H.16) =p(w)Tr(H)− p(w)(w− w ∗ ) T HΣ − 1 (w− w ∗ ). (H.17) Isotropiccase:Λ( w)=σ 2 I d . In the case whenΛ( w)=σ 2 I d , we have X i,j ∂ 2 ∂w i ∂w j [Λ( w) i,j p(w)]=σ 2 Tr(∇ 2 w p(w))=− σ 2 p(w)Tr(Σ − 1 )+σ 2 p(w)∥Σ − 1 (w− w ∗ )∥ 2 2 . (H.18) 125 Putting everything together in the Fokker-Planck we get: η p(w)Tr(H)− p(w)(w− w ∗ ) T HΣ − 1 (w− w ∗ )) + η 2 2b − σ 2 p(w)Tr(Σ − 1 )+σ 2 p(w)∥Σ − 1 (w− w ∗ )∥ 2 2 =0. (H.19) It is easy to verify thatΣ − 1 = 2b ησ 2 H is a valid inverse covariance matrix and satisfies the equation above. Hence, it is the unique steady-state distribution of the stochastic differential equation. The result confirms that variance is high when batch size is low or learning rate is large. Additionally, the variance is low along directions of low curvature. Non-isotropic case. We assume Λ( w) is constant around w ∗ and is equal to Λ . This assumption is acceptable to a degree, because SGD converges to a relatively small neighborhood, in which we can assume Λ( w) to not change much. With this assumption, X i,j ∂ 2 ∂w i ∂w j [Λ i,j p(w)]= X i,j Λ i,j h − p(w)Σ − 1 i,j +p(w)Σ − 1 j (w− w ∗ )Σ − 1 i (w− w ∗ ) i (H.20) =− p(w)Tr(Σ − 1 Λ)+ p(w) X i,j Λ i,j (Σ − 1 (w− w ∗ )(w− w ∗ ) T Σ − 1 ) i,j (H.21) =− p(w)Tr(Σ − 1 Λ)+ p(w)Tr(Σ − 1 (w− w ∗ )(w− w ∗ ) T Σ − 1 Λ) (H.22) =− p(w)Tr(Σ − 1 Λ)+ p(w)(w− w ∗ ) T Σ − 1 ΛΣ − 1 (w− w ∗ )). (H.23) It is easy to verify that ifΣ H +HΣ= η b Λ , then terms in (H.17) and (H.23) will be negatives of each other up to a constant η 2b , implying thatp(w) satisfies the Fokker-Planck equation. Note that Σ − 1 = 2b η HΛ − 1 also satisfies the Fokker-Planck, but will not be positive definite unless H andΛ commute. H.3 FastupdateofNTKinverseafterdataremoval For computing weights or predictions of a linearized network at some timet, we need to compute the inverse of the NTK matrix. To compute the informativeness scores, we need to do this inversionn time, each time with one data point excluded. In this section, we describe how to update the inverse of NTK matrix after removing one example inO(n 2 k 3 ) time, instead of doing the straightforwardO(n 3 k 3 ) computation. Without loss of generality let’s assume we remove lastr rows and corresponding columns from the NTK matrix. We can represent the NTK matrix as a block matrix: Θ 0 = A 11 A 12 A 21 A 22 . (H.24) The goal is to computeA − 1 11 fromΘ − 1 0 . We start with the block matrix inverse formula: Θ − 1 0 = A 11 A 12 A 21 A 22 − 1 = F − 1 11 − F − 1 11 A 12 A − 1 22 − A − 1 22 A 21 F − 1 11 F − 1 22 , (H.25) where F 11 =A 11 − A 12 A − 1 22 A 21 , (H.26) F 22 =A 22 − A 21 A − 1 11 A 12 . (H.27) 126 From Eq. (H.26) we haveA 11 =F 11 +A 12 A − 1 22 A 21 . Applying the Woodbury matrix identity on this we get: A − 1 11 =F − 1 11 − F − 1 11 A 12 (A 22 +A 21 F − 1 11 A 12 ) − 1 A 21 F − 1 11 . (H.28) This Eq. (H.28) gives the recipe for computingA − 1 11 . Note thatF − 1 11 can be read fromΘ − 1 0 using Eq. (H.25), A 12 ,A 21 , and A 22 can be read from Θ . Finally, the complexity of computing A − 1 11 using Eq. (H.28) is O(n 2 k 3 ) if we remove one example. H.4 Addingweightdecaytolinearizedneuralnetworktraining Let us consider the loss functionL(w)= P n i=1 ℓ i (w)+ λ 2 ∥w− w 0 ∥ 2 2 . In this case continuous-time gradient descent is described by the following ODE: ˙ w(t)=− η ∇ w f 0 (x)(f t (x)− y)− ηλ (w(t)− w 0 ) (H.29) =− η ∇ w f 0 (x)(∇ w f 0 (x) T (w(t)− w 0 )+f 0 (x)− y)− ηλ (w(t)− w 0 ) (H.30) =− η (∇ w f 0 (x)∇ w f 0 (x) T +λI ) | {z } A (w(t)− w 0 )+η ∇ w f 0 (x)(− f 0 (x)+y) | {z } b . (H.31) Letω(t)≜w(t)− w 0 , then we have ˙ ω(t)=Aω(t)+b. (H.32) Since all eigenvalues ofA are negative, this ODE is stable and has steady-state ω ∗ =− A − 1 b (H.33) =(∇ w f 0 (x)∇ w f 0 (x) T +λI ) − 1 ∇ w f 0 (x)(y− f 0 (x)). (H.34) The solutionω(t) is given by: ω(t)=ω ∗ +e At (ω 0 − ω ∗ ) (H.35) =(I− e At )ω ∗ . (H.36) Let Θ w ≜ ∇ w f 0 (x)∇ w f 0 (x) T and Θ 0 ≜ ∇ w f 0 (x) T ∇ w f 0 (x). If the SVD of∇ w f 0 (x) is UDV T , then Θ w =UDU T andΘ 0 =VDV T . Additionally, we can extend the columns ofU to full basis ofR d (denoted with ˜ U) and append zeros toD (denoted with ˜ D) to write down the eigendecompositionΘ w = ˜ U ˜ D ˜ U T . With this, we have(Θ w +λI ) − 1 = ˜ U( ˜ D+λI ) − 1 ˜ U T . Continuing Eq. (H.36) we have ω(t)=(I− e At )ω ∗ (H.37) =(I− e At )(Θ w +λI ) − 1 ∇ w f 0 (x)(y− f 0 (x)) (H.38) =(I− e At ) ˜ U( ˜ D+λI ) − 1 ˜ U T UDV T (y− f 0 (x)) (H.39) = ˜ U(I− e − ηt ( ˜ D+λI ) ) ˜ U T ˜ U( ˜ D+λI ) − 1 ˜ U T UDV T (y− f 0 (x)) (H.40) = ˜ U(I− e − ηt ( ˜ D+λI ) )( ˜ D+λI ) − 1 I d× nk DV T (y− f 0 (x)) (H.41) = ˜ U(I− e − ηt ( ˜ D+λI ) ) ˜ ZV T (y− f 0 (x)), (H.42) 127 Table 6.7: Experimental details for MNIST 4 vs 9 classification in the case of standard training. Network The 5-layer CNN of Table 2.1 with 2 output units. Optimizer ADAM with0.001 learning rate andβ 1 =0.9. Batch size 128 Number of examples (n) [75, 250, 1000, 4000] Number of epochs 200 Number of samples for ˜ Z (k 1 ) 5 Number of samplings forS for each ˜ z (k 2 ) 30 Table 6.8: Experimental details for MNIST 4 vs 9 classification in the case of SGLD training. Network The 5-layer CNN of Table 2.1 with 2 output units. Learning rate schedule Starts at 0.004 and decays by a factor of 0.9 after each 100 iterations. Inverse temperature schedule min(4000,max(100,10e t/100 )), wheret is the it- eration. Batch size 100 Number of examples (n) 4000 Number of epochs 40 Number of samples for ˜ Z (k 1 ) 5 Number of samplings forS for each ˜ z (k 2 ) 30 where ˜ Z = (D+λI ) − 1 D 0 ∈R d× nk . DenotingZ≜(D+λI ) − 1 D and continuing, ω(t)= ˜ U(I− e − ηt ( ˜ D+λI ) ) ˜ ZV T (y− f 0 (x)) (H.43) =U(I− e − ηt (D+λI ) )ZV T (y− f 0 (x)) (H.44) =UZ(I− e − ηt (D+λI ) )V T (y− f 0 (x)) (H.45) =UZV T V(I− e − ηt (D+λI ) )V T (y− f 0 (x)) (H.46) =UZV T (I− e − ηt (Θ 0 +λI ) )(y− f 0 (x)) (H.47) =∇ w f 0 (x)(Θ 0 +λI ) − 1 (I− e − ηt (Θ 0 +λI ) )(y− f 0 (x)). (H.48) Solvingforoutputs. Havingw(t) derived, we can write down dynamics off t (x) for anyx: f t (x)=f 0 (x)+∇ w f 0 (x) T ω(t) (H.49) =f 0 (x)+∇ w f 0 (x) T ∇ w f 0 (x)(Θ 0 +λI ) − 1 (I− e − ηt (Θ 0 +λI ) )(y− f 0 (x)) (H.50) =f 0 (x)+Θ 0 (x,X)(Θ 0 +λI ) − 1 (I− e − ηt (Θ 0 +λI ) )(y− f 0 (x)). (H.51) I Additionalexperimentaldetails This appendix presents some of the experimental details that were not included in the main text for better readability. 128 Table 6.9: Experimental details for CIFAR-10 classification using fine-tuned ResNet-50 networks. Network ResNet-50 pretrained on ImageNet. Optimizer SGD with0.01 learning rate and0.9 momentum. Data augmentations Random horizontal flip and random 28x28 cropping. Batch size 64 Number of examples (n) [1000, 5000, 20000] Number of epochs 40 Number of samples for ˜ Z (k 1 ) 1 Number of samplings forS for each ˜ z (k 2 ) 40 Table 6.10: Experimental details for the CIFAR-10 classification experiment where a pretrained ResNet-50 is fine-tuned using SGLD. Network ResNet-50 pretrained on ImageNet. Data augmentations Random horizontal flip and random 28x28 crop- ping. Learning rate schedule Starts at 0.01 and decays by a factor of 0.9 after each 300 iterations. Inverse temperature schedule min(16000,max(100,10e t/300 )), where t is the iteration. Batch size 64 Number of examples (n) 20000 Number of epochs 16 Number of samples for ˜ Z (k 1 ) 1 Number of samplings forS for each ˜ z (k 2 ) 40 J Additionalresults Mislabeledandconfusingexamples. Figures 6.9 and 6.10 present incorrectly labeled or confusing examples for each class in MNIST, CIFAR-10, and Clothing1M datasets. Teacher-studentNTKsimilarity. Figures 6.11 and 6.12 present additional evidence that (a) training NTK similarity of the final student and the teacher is correlated with the final student test accuracy; and (b) that online distillation manages to transfer teacher NTK better. 129 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 0 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 1 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 2 y = 3 y = 3 y = 3 y = 3 y = 3 y = 8 y = 3 y = 3 y = 3 y = 3 y = 3 y = 3 y = 3 y = 5 y = 3 y = 9 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 4 y = 5 y = 5 y = 5 y = 5 y = 5 y = 5 y = 5 y = 6 y = 5 y = 5 y = 5 y = 5 y = 5 y = 8 y = 5 y = 5 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 6 y = 7 y = 7 y = 7 y = 7 y = 7 y = 9 y = 7 y = 7 y = 7 y = 7 y = 7 y = 1 y = 7 y = 2 y = 7 y = 9 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 9 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 8 y = 1 y = 9 y = 7 y = 9 y = 1 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 9 y = 4 (a) MNIST y = airplane y = airplane y = airplane y = airplane y = airplane y = ship y = airplane y = car y = airplane y = frog y = airplane y = airplane y = airplane y = airplane y = airplane y = airplane y = car y = truck y = car y = truck y = car y = truck y = car y = truck y = car y = frog y = car y = truck y = car y = truck y = car y = truck y = bird y = bird y = bird y = frog y = bird y = frog y = bird y = airplane y = bird y = cat y = bird y = bird y = bird y = frog y = bird y = deer y = cat y = frog y = cat y = airplane y = cat y = frog y = cat y = frog y = cat y = truck y = cat y = frog y = cat y = frog y = cat y = frog y = deer y = deer y = deer y = frog y = deer y = deer y = deer y = frog y = deer y = deer y = deer y = deer y = deer y = dog y = deer y = deer y = dog y = cat y = dog y = horse y = dog y = dog y = dog y = dog y = dog y = horse y = dog y = bird y = dog y = cat y = dog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = frog y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = horse y = cat y = horse y = horse y = horse y = horse y = horse y = horse y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = ship y = airplane y = ship y = ship y = ship y = ship y = ship y = ship y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = truck y = bird y = truck y = truck (b) CIFAR-10 Figure 6.9: Most confusing 8 labels per class in the MNIST (on the left) and CIFAR-10 (on the right) datasets, according to the distance between predicted and cross-entropy gradients. The gradient predictions are done using the best instances of LIMIT. 130 y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Dress y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Jacket y = T-Shirt y = Hoodie y = T-Shirt y = Jacket y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = Jacket y = Shirt y = Jacket y = Shirt y = Jacket y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = Vest y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Shirt y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = Jacket y = Knitwear y = Jacket y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = Jacket y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Knitwear y = T-Shirt y = Knitwear y = Shirt y = Knitwear y = T-Shirt y = Chiffon y = Downcoat y = Chiffon y = Dress y = Chiffon y = Jacket y = Chiffon y = Knitwear y = Chiffon y = Vest y = Chiffon y = Dress y = Chiffon y = Dress y = Chiffon y = Hoodie y = Chiffon y = Dress y = Chiffon y = Suit y = Chiffon y = Suit y = Chiffon y = Vest y = Chiffon y = Vest y = Chiffon y = Downcoat y = Chiffon y = Vest y = Chiffon y = Shirt y = Sweater y = Shirt y = Sweater y = Jacket y = Sweater y = Knitwear y = Sweater y = Underwear y = Sweater y = T-Shirt y = Sweater y = Shirt y = Sweater y = Underwear y = Sweater y = Underwear y = Sweater y = Knitwear y = Sweater y = T-Shirt y = Sweater y = Knitwear y = Sweater y = Knitwear y = Sweater y = Knitwear y = Sweater y = Underwear y = Sweater y = T-Shirt y = Sweater y = Knitwear y = Hoodie y = T-Shirt y = Hoodie y = Jacket y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = Downcoat y = Hoodie y = Jacket y = Hoodie y = Jacket y = Hoodie y = Vest y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = Vest y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Hoodie y = T-Shirt y = Windbreaker y = Dress y = Windbreaker y = Downcoat y = Windbreaker y = Jacket y = Windbreaker y = Jacket y = Windbreaker y = Suit y = Windbreaker y = T-Shirt y = Windbreaker y = Jacket y = Windbreaker y = Suit y = Windbreaker y = T-Shirt y = Windbreaker y = Downcoat y = Windbreaker y = T-Shirt y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Windbreaker y = Suit y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = Suit y = Jacket y = Suit y = Jacket y = Shirt y = Jacket y = Suit y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Jacket y = Shirt y = Jacket y = T-Shirt y = Jacket y = Shirt y = Jacket y = Suit y = Jacket y = Downcoat y = Jacket y = T-Shirt y = Jacket y = T-Shirt y = Downcoat y = Windbreaker y = Downcoat y = Jacket y = Downcoat y = Windbreaker y = Downcoat y = Windbreaker y = Downcoat y = Dress y = Downcoat y = Shirt y = Downcoat y = Jacket y = Downcoat y = Shirt y = Downcoat y = T-Shirt y = Downcoat y = Hoodie y = Downcoat y = Shirt y = Downcoat y = Suit y = Downcoat y = Hoodie y = Downcoat y = Dress y = Downcoat y = T-Shirt y = Downcoat y = T-Shirt y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = Jacket y = Suit y = Shirt y = Suit y = Jacket y = Suit y = Dress y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = Shirt y = Suit y = T-Shirt y = Suit y = Shirt y = Suit y = T-Shirt y = Suit y = T-Shirt y = Suit y = T-Shirt y = Shawl y = Suit y = Shawl y = Jacket y = Shawl y = Downcoat y = Shawl y = Knitwear y = Shawl y = Downcoat y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Downcoat y = Shawl y = Knitwear y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Windbreaker y = Shawl y = Suit y = Shawl y = Downcoat y = Shawl y = Windbreaker y = Shawl y = Jacket y = Dress y = T-Shirt y = Dress y = Vest y = Dress y = Suit y = Dress y = Suit y = Dress y = Jacket y = Dress y = Jacket y = Dress y = Suit y = Dress y = T-Shirt y = Dress y = Suit y = Dress y = Suit y = Dress y = Suit y = Dress y = Vest y = Dress y = Shirt y = Dress y = Suit y = Dress y = Vest y = Dress y = Vest y = Vest y = T-Shirt y = Vest y = Underwear y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = Jacket y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Vest y = T-Shirt y = Underwear y = Vest y = Underwear y = Vest y = Underwear y = Vest y = Underwear y = Jacket y = Underwear y = T-Shirt y = Underwear y = Dress y = Underwear y = Vest y = Underwear y = Jacket y = Underwear y = Vest y = Underwear y = T-Shirt y = Underwear y = Jacket y = Underwear y = Vest y = Underwear y = Hoodie y = Underwear y = Vest y = Underwear y = T-Shirt y = Underwear y = Vest Figure 6.10: Most confusing 16 labels per class in the Clothing1M dataset, according to the distance between predicted and cross-entropy gradients. The gradient predictions are done using the best instance of LIMIT. 131 68 69 70 71 Test accuracy 0.80 0.82 0.84 0.86 0.88 0.90 Train fidelity correlation: -0.74 No KD offline KD online KD (a) CIFAR-100, ResNet-20 student, ResNet-110 teacher 82 84 86 88 Test accuracy 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 Train NTK similarity correlation: 0.80 No KD offline KD online KD (b) CIFAR-10, LeNet-5x8 student, ResNet- 56 teacher 82 84 86 88 Test accuracy 0.986 0.988 0.990 0.992 0.994 0.996 0.998 1.000 Train fidelity correlation: 0.24 No KD offline KD online KD (c) CIFAR-10, LeNet-5x8 student, ResNet- 56 teacher 91.5 92.0 92.5 93.0 Test accuracy 0.68 0.70 0.72 0.74 0.76 0.78 Train NTK similarity correlation: 0.90 No KD offline KD online KD (d) CIFAR-10, ResNet-20 student, ResNet- 101 teacher 72 74 76 Test accuracy 0.45 0.50 0.55 0.60 0.65 Train NTK similarity correlation: 0.71 No KD offline KD online KD (e) Binary CIFAR-100, LeNet-5x8 student, ResNet-56 teacher 72 74 76 Test accuracy 0.980 0.982 0.984 0.986 0.988 0.990 0.992 0.994 Train fidelity correlation: -0.57 No KD offline KD online KD (f) Binary CIFAR-100, LeNet-5x8 student, ResNet-56 teacher Figure 6.11: Relationship between test accuracy, train NTK similarity, and train fidelity for various teacher, student, and dataset configurations. 132 59 60 61 62 Test accuracy 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 Train NTK similarity correlation: 0.40 No KD offline KD online KD (a) Tiny ImageNet, MobileNet-V3-35 stu- dent, MobileNet-V3-125 teacher 59 60 61 62 Test accuracy 0.75 0.80 0.85 0.90 0.95 Train fidelity correlation: -0.38 No KD offline KD online KD (b) Tiny ImageNet, MobileNet-V3-35 stu- dent, MobileNet-V3-125 teacher 50 55 60 65 Test accuracy 0.30 0.35 0.40 0.45 0.50 0.55 Train NTK similarity correlation: 0.98 No KD offline KD online KD (c) Tiny ImageNet, VGG-16 student, ResNet-101 teacher 50 55 60 65 Test accuracy 0.94 0.96 0.98 1.00 Train fidelity correlation: -0.42 No KD offline KD online KD (d) Tiny ImageNet, VGG-16 student, ResNet-101 teacher 50 55 60 65 Test accuracy 0.35 0.40 0.45 0.50 0.55 Train NTK similarity correlation: 0.99 No KD offline KD online KD (e) Tiny ImageNet, VGG-16 student, MobileNet-V3-125 teacher 50 55 60 65 Test accuracy 0.997 0.998 0.999 1.000 Train fidelity correlation: -0.58 No KD offline KD online KD (f) Tiny ImageNet, VGG-16 student, MobileNet-V3-125 teacher Figure 6.12: Relationship between test accuracy, train NTK similarity, and train fidelity for various teacher, student, and dataset configurations. 133
Abstract (if available)
Abstract
Despite the popularity and success of deep learning, there is limited understanding of when, how, and why neural networks generalize to unseen examples. Since learning can be seen as extracting information from data, we formally study information captured by neural networks during training. Specifically, we start with viewing learning in presence of noisy labels from an information-theoretic perspective and derive a learning algorithm that limits label noise information in weights. We then define a notion of unique information that an individual sample provides to the training of a deep network, shedding some light on the behavior of neural networks on examples that are atypical, ambiguous, or belong to underrepresented subpopulations. We relate example informativeness to generalization by deriving nonvacuous generalization gap bounds. Finally, by studying knowledge distillation, we highlight the important role of data and label complexity in generalization. Overall, our findings contribute to a deeper understanding of the mechanisms underlying neural network generalization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Controlling information in neural networks for fairness and privacy
PDF
Fast and label-efficient graph representation learning
PDF
Learning at the local level
PDF
Learning controllable data generation for scalable model training
PDF
Mutual information estimation and its applications to machine learning
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Tractable information decompositions
PDF
Heterogeneous federated learning
PDF
Information geometry of annealing paths for inference and estimation
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Representation problems in brain imaging
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Hashcode representations of natural language for relation extraction
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Learning to optimize the geometry and appearance from images
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Towards learning generalization
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
Asset Metadata
Creator
Harutyunyan, Hrayr
(author)
Core Title
On information captured by neural networks: connections with memorization and generalization
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-08
Publication Date
06/14/2023
Defense Date
05/15/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,generalization,information theory,learning theory,memorization,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee chair
), Luo, Haipeng (
committee member
), Soltanolkotabi, Mahdi (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
harhro@gmail.com,hrayrhar@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113170491
Unique identifier
UC113170491
Identifier
etd-Harutyunya-11955.pdf (filename)
Legacy Identifier
etd-Harutyunya-11955
Document Type
Dissertation
Format
theses (aat)
Rights
Harutyunyan, Hrayr
Internet Media Type
application/pdf
Type
texts
Source
20230613-usctheses-batch-1055
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
generalization
information theory
learning theory
memorization