Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Leveraging training information for efficient and robust deep learning
(USC Thesis Other)
Leveraging training information for efficient and robust deep learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEVERAGING TRAINING INFORMATION FOR EFFICIENT AND ROBUST DEEP LEARNING by Zhiyun Lu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2020 Copyright 2020 Zhiyun Lu Acknowledgements I would like to express my gratitude to my amazing mentor, colleagues, friends and family for their support during my PhD study and completion of this thesis. Firstly, I would like to express my thanks to my advisor Prof. Fei Sha for the continuous support of my Ph.D study and research, for his patience, motivation, and immense knowledge. I have started not knowing much about machine learning nor how to conduct research. Fei has graciously helped me in expanding my knowledge and developing my skills in machine learning from the ground up, and has patiently guided me step by step in reading, thinking, experimen- tation and communication to become a much better researcher today. His consistent dedication to high standards, his emphasis on being honest and critical to oneself, and his conviction in challenging oneself outside the comfort zone, have been incredibly influential to me, and have reshaped me beyond the academic area. I am also grateful for the stimulating reading groups and research environment Fei curates, where we are exposed to a broad research field, including machine learning, computer vision, NLP and reinforcement learning. Besides my advisor, I would like to thank my dissertation defense committee members, Prof. Haipeng Luo and Prof. C.-C. Jay Kuo, for their helpful comments and suggestions in improving this dissertation. Thanks also goes to my proposal committee members, Prof. Haipeng Luo, Prof. Nora Ayanian, Prof. Leana Golubchik, Prof. Jason Lee, and Prof. Nanyun Peng, for their insightful comments and encouragement, and also for their hard questions which motivated me to improve my research from various perspectives. My sincere thanks also goes to Dr. Tara Sainath, Dr. Liangliang Cao, Dr. Yu Zhang, and Dr. Vikas sindhwani who provided me great opportunities to join their team as an intern. In particular, I would like to thank Tara, who inspired me for her insights in pushing cutting-edge research into production level systems, and her extreme efficiency. The same thanks go to Liangliang and Yu for providing me great support in exploring various research directions during the internship, and for generously sharing with me career advice, both related and unrelated to research, that have helped me grow. I thank ShaLab members for their generous help and great support, both in research and in life, throughout my graduate years. My collaboration with Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aur´ elien Bellet and Anqi Wu on the BABEL project helped develop my experimentation skills. I learnt the meaning of grit from my collaboration with Zi Wang during her short visit at USC. Chao-Kai Chiang and Liyu Chen provided me valuable feedbacks which help me improve my paper writing and communication skill. Wei-Lun (Harry) Chao, Soravit (Beer) Changpinyo, and Hexiang (Frank) Hu have taught me precious life lessons throughout the years. Harry’s “fight on” spirit will always be a source of inspiration for me in times of hardship. Beer’s “spread love” motto plants its seed in my mind. Frank taught me the important quality of being optimistic and courageous when facing difficulties which I’m lack of. I would like to express my special ii thanks to Melissa Ailem, Yiming Yan and Ivy Xiao for their “girl power” support. I also thank S´ eb Arnold, Aaron Chan, Boqing Gong, Chin-Cheng (Jeremy) Hsu, Shariq Iqbal, Michiel de Jong, Yury Zemlyanskiy, Bowen Zhang, Ke Zhang, and Prof. Marius Kloft for their stimulating discussions about research and their encouragement in me. I will forever remember the sleepless nights we worked together before deadlines, and all the conversations, drinks and songs we have had in and outside the lab. I am deeply thankful to have worked with such gifted and caring people. I cannot imagine my PhD life without the love and support from my friends. I thank Joan Chen, Wen Ma, and Sunny Gao for their love and patience over the phone, and for the simple check-ins when they haven’t heard from me for a while. I thank Will Oakley and Bon-Soon Lin for providing me a solid Green Bay packers offensive line. I thank Chi-Chi Huang, Yacoub Kureh and Jean-michel Maldague for sharing with me their pot of curiosity. I thank Heming Zhang and Liu Yang for the big and small things they helped with. I thank the Oxygen community for giving me a “home” when I feel lonely. I would also like to thank Zhaojun Yang, Chen Sun, Tomer Levinboim, Franziska Meier, Yevgen Chebotar, and Liam Li. Lastly, I am very grateful for having supportive parents who give me unconditional love and always believe in me. This thesis is dedicated to them. I would also like to thank myself for believing in myself and not giving up. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures ix Abstract xi Chapter 1 INTRODUCTION 1 1.1 Deep Learning Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Challenge I: Computation Efficiency (AutoML) . . . . . . . . . . . . . . 3 1.2.2 Challenge II: Data Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Challenge III: Robustness (Uncertainty Estimation) . . . . . . . . . . . . 5 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 I Leverage Training Information for Efficient Deep Learning 7 Chapter 2 AUTOML: HYPER-PARAMETER OPTIMIZATION 8 2.1 Introduction: Hyper-parameter Tuning under a Budget Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Optimal Solution with Perfect Information . . . . . . . . . . . . . . . . 11 2.3.3 Sequential Decision Making . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Action-value Function Q . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.2 Belief Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 Behavior of Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.4 Practical Budgeted Hyper-parameter Tuning Algorithm . . . . . . . . . . 17 2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Experiment Description . . . . . . . . . . . . . . . . . . . . . . . . . . 20 iv 2.5.2 Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.3 Results on Real-world Data . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.1 Freeze-thaw GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.7.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.3 Other Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 3 SPEECH SENTIMENT ANALYSIS 30 3.1 Introduction: Speech Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . 30 3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 End-to-end ASR Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.2 Sentiment Decoder on ASR Features . . . . . . . . . . . . . . . . . . . 32 3.2.3 Spectrogram Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.2 Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.4 Attention Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 II Leverage Training Information for Robust Deep Learning 41 Chapter 4 BACKGROUND: UNCERTAINTY IN MACHINE LEARNING 42 4.1 What is Uncertainty? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.1 Frequentist Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 How Do We Model Uncertainty? . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Bayesian neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1.1 What is a Bayesian Neural Network? . . . . . . . . . . . . . . 46 4.2.1.2 Scale up Bayesian Neural Networks . . . . . . . . . . . . . . 47 4.2.2 Frequentist Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.3 Discussion and Comparison . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 How Do We Evaluate Uncertainty? . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.2 Out-of-distribution (OOD) Samples Detection . . . . . . . . . . . . . . . 50 4.3.3 Adversarial Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4 Improving Uncertainty and Robustness for Deep Learning . . . . . . . . . . . . 52 4.5 Uncertainty Beyond Supervised Learning . . . . . . . . . . . . . . . . . . . . . 53 Chapter 5 UNCERTAINTY ESTIMATION WITH INFINITESIMAL JACKKNIFE AND MEAN- FIELD APPROXIMATION 55 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 v 5.2.1 Infinitesimal Jackknife and Its Distribution . . . . . . . . . . . . . . . . 57 5.2.2 Sampling Based Uncertainty Estimation with the Pseudo-ensemble Dis- tributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.3 Mean-field Approximation for Gaussian-softmax Integration . . . . . . . 59 5.2.4 Other Implementation Considerations . . . . . . . . . . . . . . . . . . . 61 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.2 Infinitesimal Jackknife on MNIST . . . . . . . . . . . . . . . . . . . . . 64 5.4.3 Comparison to Other Approaches . . . . . . . . . . . . . . . . . . . . . 65 5.5 Details and More Experiment Results . . . . . . . . . . . . . . . . . . . . . . . 66 5.5.1 Derivation of Mean-Field Approximation for Gaussian-softmax Integration 66 5.5.2 Definitions of Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 68 5.5.3 Hyper-parameters of Main Experiments . . . . . . . . . . . . . . . . . . 68 5.5.4 Apply the Mean-field Approximation to Other Gaussian Posteriors Infer- ence Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5.5 More Results on CIFAR-10 and CIFAR-100 . . . . . . . . . . . . . . . . 70 5.5.6 Regression Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.7 More Results on Comparing All Layers versus Just the Last Layer . . . . 72 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 III Large-scale Kernel Learning 73 Chapter 6 RANDOM KITCHEN SINKS FOR LARGE-SCALE MACHINE LEARNING 74 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.3 Kernel Methods and Random Features . . . . . . . . . . . . . . . . . . . . . . . 76 6.3.1 Use Random Features in Classifiers . . . . . . . . . . . . . . . . . . . . 79 6.4 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.4.1 Parallel Optimization for Large-scale Kernel Models . . . . . . . . . . . 79 6.4.2 Learning Kernel Features . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.5.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5.2 Computer Vision Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.5.3 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . . . . . . 84 6.5.4 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.5.5 Do Kernel and Deep Models Learn the Same Thing? . . . . . . . . . . . 87 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7.1 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.7.2 Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.7.3 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 91 Reference List 97 vi List of Tables 2.1 Comparison of tuning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Data and Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Computation time of different algorithms. . . . . . . . . . . . . . . . . . . . . . 23 3.1 Speech sentiment datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Speech sentiment analysis performances of different methods. . . . . . . . . . . 35 3.3 Ablation study on IEMOCAP dataset . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Examples of disagreements between annotators. “Context” provides our own interpretation of disagreements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Notations in Chapter 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Datasets for Classification Tasks and Out-of-Distribution (OOD) Detection . . . 63 5.2 Performance of Infinitesimal jackknife Pseudo-Ensemble Distribution on MNIST 64 5.3 Comparing different uncertainty estimate methods on in-domain tasks (lower is better) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Comparing different methods on out-of-distribution detection (higher is better) . 66 5.5 Hyper-parameters of neural network trainings . . . . . . . . . . . . . . . . . . . 69 5.6 Uncertainty estimation with SWAG on MNIST . . . . . . . . . . . . . . . . . . . 69 5.7 Uncertainty estimation on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . 70 5.8 Out of Distribution Detection on CIFAR-100 . . . . . . . . . . . . . . . . . . . 70 5.9 NLLs and RMSEs of different methods on regression benchmark datasets. . . . . 71 5.10 Performance of infinitesimal jackknife pseudo-ensemble distribution on NotM- NIST comparing all layers versus just the last layer . . . . . . . . . . . . . . . . . . . 72 6.1 Gaussian and Laplacian Kernels, together with their sampling distributionsp(!) 78 6.2 Handwritten digit recognition error rates (%) . . . . . . . . . . . . . . . . . . . 83 6.3 Object recognition error (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4 Best Token Error Rates on Test Set (%) . . . . . . . . . . . . . . . . . . . . . . 85 6.5 Combining best performing kernel and DNN models . . . . . . . . . . . . . . . 87 6.6 Kernel Methods on MNIST-6.7M (error rates %) . . . . . . . . . . . . . . . . . 89 6.7 DNN on MNIST-6.7M (error rates %) . . . . . . . . . . . . . . . . . . . . . . . 90 6.8 Kernel Methods on CIFAR-10 (error rate%) . . . . . . . . . . . . . . . . . . . . 91 6.9 DNN on CIFAR-10 (error rate %) . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.10 Best perplexity and accuracy by different models (see texts for description of different models) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.11 Performance ofrbm acoustic models . . . . . . . . . . . . . . . . . . . . . . . . 94 6.12 Performance of single Laplacian kernel . . . . . . . . . . . . . . . . . . . . . . 95 vii 6.13 Best Token Error Rates on Test Set (%) . . . . . . . . . . . . . . . . . . . . . . 95 6.14 Detailed Comparison on TER for Bengali . . . . . . . . . . . . . . . . . . . . . 96 6.15 Token Error Rates (%) for Combined Models . . . . . . . . . . . . . . . . . . . 96 viii List of Figures 1.1 Illustration of a deep learning training pipeline: Given a training setD and a con- figuration of hyper-parameters, we run stochastic optimization on min L(;D;), which leads to an optimal solution . We can view as a point estimate of the optimal model parameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Illustration of hyper-parameters optimization: We run the training pipeline for a set of K different hyper-parameter configurationsf 1 ; 2 ;:::; K g, and ob- tain K output modelsf 1 ;:::; K g. We select the best one m among the K according to some evaluation metrics. . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 For applications where we don’t have adequate labeled training data, like speech sentiment analysis, overfitting is a big challenge. . . . . . . . . . . . . . . . . . 4 1.4 Examples of uncertainty estimation: We have a classifier trained on handwrit- ten digit imagesD. At test time, we are fed in different input images. In one case (the top image in the first column), an image of digit “6” is adversarially perturbed, but is still recognizable to human as “6”. The classifier , however, wrongly predict it as “5” with 92.7% confidence. In the second case (the bottom image in the first column), the image is a letter “J” that does not belong to the training data distribution, i.e. not a digit. The classifier , however, wrongly pre- dicts it as “3” with 99.9% confidence. The predictions do not faithfully reflect the uncertainty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Visualization of Q[a] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Normalized RegretR B on 100 Synthetic Sets. For BHPT,R B ! 0 as B in- creases. BO is better under smallB, while Hyperband is better under largeB. . . 22 2.3 Budgeted Optimization on 100 Synthetic Sets: The hit rate ofBO and our meth- ods improves asB increases. The percentage of budget on the output arm ofBO and our methods decreases asB increases. . . . . . . . . . . . . . . . . . . . . . 22 2.4 Real-world Hyper-parameter Tuning Tasks: BHPT (blue) converges to the global optimal model at the rightmost budget for all tasks, i.e. regrets go to 0. BHPT-" is better under small budgets, while BHPT is better under large budgets. . . . . . 23 2.5 Training curves of AlexNets and ResNets. ResNet converges slower compared to AlexNet, but it reaches a smaller error rate. . . . . . . . . . . . . . . . . . . . . 24 2.6 On an architecture selection task, the proposed BHPT adapts its behavior to dif- ferent budget constraints, while Hyperband employs a fixed strategy. . . . . . . . 25 ix 2.7 Budgeted Optimization on 100 Synthetic Sets: (a) BHPT without budget ex- haustion heuristic. Q - alone does less exploitation than BHPT, thus results in worse output performance. (b) BHPT with full Bayesian treatment of GP hyper- parameters. The results are similar to GP hyper-parameters set to the ground-truth values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 We propose to use pre-trained features from e2e ASR model to solve sentiment analysis. The best performed sentiment decoder is RNN with self-attention. We apply SpecAugment to reduce overfitting in the training. . . . . . . . . . . . . . 31 3.2 Class distribution of speech sentiment datasets. . . . . . . . . . . . . . . . . . . 34 3.3 Attention visualization on IEMOCAP utterances from different sentiment classes. There are two notable patterns that have larger weights: specific vocalizations, like [LAUGHTER] and [BREATHING], and indicating words, like “great” and “damn”. This supports our hypothesis that ASR features contain both acoustic and text information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Confusion matrix for inter-annotator agreement. . . . . . . . . . . . . . . . . . . 39 3.5 Confusion matrix for evaluation using different input feature types. . . . . . . . . 39 4.1 Overview of Chapter 4. Machine learning community studies uncertainty quan- tification from two major statistical methods, namely frequentist (classical) and Bayesian inference. They use different languages to measure uncertainty, and de- velop different algorithms and models. The quality of uncertainty estimation can be evaluated on tasks from supervised learning, active learning and RL. . . . . . 43 4.2 Illustration of a Bayesian neural network (right), compared to a conventional de- terministic neural network (left). . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Demonstration of undesired predictions of popular neural network architectures on out-of-distribution samples. The networks are trained on ImageNet [187]. The red predictions are wrong, the green predictions are reasonable, while the orange predictions are in the middle. See [194] for more details. . . . . . . . . . . . . . 50 4.4 caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 An example of adversarial examples. A slightly perturbed image, that is almost indistinguishable from natural images to human eyes, can cause state-of-the-art classifiers to make incorrect predictions. See [113] for more details. . . . . . . . 52 5.1 Best viewed in colors. (a-b) Effects of the softmax and ensemble temperatures on NLL and AuROC. The yellow star marks the best pairs of temperatures. See texts for details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.1 t-SNE embeddings of data representation learnt by kernel (Left) and by DNN (Right). The relative positioning of those well-separated clusters are different, suggesting that these two models learn to represent data in quite different ways (Kernel embedding is computed using DNN’s embedding as initialization to avoid the randomness in t-SNE.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Contrast the learned representations by kernel models (left plot) and DNNs (right plot): kernel models’ representations tend to form clumps for data from the same class, while DNNs are more spread out. . . . . . . . . . . . . . . . . . . . . . . 97 x Abstract Deep neural nets have exhibited great success on a wide range of machine learning problems across various domains, such as speech, image, and text. Despite decent prediction performances, there are rising concerns for the “in-the-lab” machine learning models to be vastly deployed in the wild. In this thesis, we study two of the main challenges in deep learning: efficiency, computational as well as statistical, and robustness. We describe a set of techniques to solve the challenges by utilizing information from the training process intelligently. The solutions go beyond the common recipe of a single point estimate of the optimal model. The first part of the thesis studies the efficiency challenge. We propose a budgeted hyper- parameter tuning algorithm to improve the computation efficiency of hyper-parameter tuning in deep learning. It can estimate and utilize the trend of training curves to adaptively allocate resources for tuning, which demonstrates improved efficiency over state-of-the-art tuning algo- rithms. Then we study the statistical efficiency on tasks with limited labeled data. Specifically we focus on the task of speech sentiment analysis. We apply pre-training using automatic speech recognition data, and solve sentiment analysis as a downstream task, which greatly improves the data efficiency of sentiment labels. The second part of the thesis studies the robustness challenge. Motivated by the resampling method in statistics, we propose mean-field infinitesimal jackknife (mfIJ) – a simple, efficient, and general-purpose plug-in estimator for uncertainty estimation. The main idea is to approxi- mate the ensemble of an infinite number of infinitesimal jackknife samples with a closed-form Gaussian distribution, and derive an efficient mean-field approximation to classification predic- tions, when the softmax layer is sampled from Gaussian. mfIJ improves the model’s robustness against invalid inputs by leveraging the curvature information of the training loss. xi Chapter 1 INTRODUCTION Deep neural networks (NNs) are powerful black-box predictors that have achieved the state-of- the-art performance on a wide range of machine learning problems. New architectures, like skip connections [78], Transformers [212], as well as “tricks of the trade”, like dropout [204], batch normalization [96] are frequently invented to further improve the prediction accuracies on par- ticular applications, sometimes even exceeding human benchmarks [148]. However, there are challenges and rising concerns for machine learning systems to be deployed in the real-world scenarios, beyond the high test accuracies. For instance, multiple fatalities from autonomous driving systems have been reported so far [218], which raises the issue of AI safety. As another example, a policy report [165] points out that the discrimination can sometimes “be the inadver- tent outcome of the way big data technologies are structured and used”, which raises the concern of machine learning applications that involve data about people and “the potential of encoding discrimination in automated decisions”. An incomplete list of challenges include computation as well as data efficiency, robustness, security, fairness, privacy, and interpretability. In this chapter, we first describe the common recipe used in deep neural networks training in Section 1.1, and then introduce three challenges in the current pipeline: computation efficiency, data efficiency and robustness in Section 1.2 . In Section 1.3 we briefly summarize our contribu- tion to address the challenges. 1.1 Deep Learning Recipe We first introduce necessary notations, and describe the problem setup. We use supervised learn- ing as a concrete example to illustrate the idea, but the recipe can be generalized to other settings with proper modifications. Training data Assume we are given a training setD =f(x i ;y i )g N i=1 with (x i ;y i ) drawn i.i.d. from an unknown joint distributionP XY , whereY can be a discrete domain in a classification problem, or a continuous domain in a regression problem. Predictive model Assume we have a deep model which defines a prediction function: f : X !Y, where is the parameter of the neural network, i.e. weights and biases. We will also callf (x) as a predictor. 1 Loss function We measure the success of a prediction by a loss function`(f (x);y), for ex- ample, the squared losskyf (x)k 2 in a regression problem. We want to find the parameter which minimizes` D (f ), the sum of empirical losses over the training set, min ` D (f ) = min X (x i ;y i )2D `(f (x i );y i ): (1.1) The minimization in Eq. 1.1 is usually done by stochastic optimization algorithms, for example stochastic gradient descent (SGD), and Adam [105], due to their scalability and great generaliza- tion performance. Hyper-parameters Since the loss` is non-convex w.r.t. in neural networks, there exist many local minimizers of the loss landscape. Therefore, the minimizer obtained from the training can be greatly effected by some high-level knobs of the optimization, like learning rate, regularization, and etc. We refer to these high-level design choices as hyper-parameters, and denote them as . Sometimes can also include design choices about f , for example the architectures of f (#layers, #hidden units, types of activation functions). Most of the time, is a vector. Training pipeline To sum up, given a training setD, and a configuration of hyper-parameters , the training of neural network models can be described as, = arg min L(;D;): (1.2) The output model depends on bothD and. Here we didn’t explicitly specify the form ofL(;D;), but we emphasize that it is deter- mined by both the empirical risk` D (f ) (Eq. 1.1) and (some dimensions of) the hyper-parameters . For example,L(;D;) =` D (f ) +kk 2 , is the sum of the empirical risk and weight decay regularizer. As another example,L can be a stochastic version of` D (f ) when dropout is applied. The whole pipeline is summarized in Figure 1.1. INPUT training set ! KNOB hyper-parameter " training OUTPUT model # ∗ stochastic optimization Figure 1.1: Illustration of a deep learning training pipeline: Given a training setD and a configu- ration of hyper-parameters, we run stochastic optimization on min L(;D;), which leads to an optimal solution . We can view as a point estimate of the optimal model parameter. 2 Test phase At test time, we are given a test inputx 0 , the model outputs a predictionf (x 0 ). Ideally x 0 should be drawn independently from the same distribution as the training set P XY (only thaty 0 is not revealed to us), in which case statistical guarantees over the behavior off (x 0 ) are provided. However, note that in real life deployment of the learning systems, we would have little or no control over the test input, where the i.i.d. assumption can be violated. This simple yet highly effective training recipe has achieved unprecedented success in terms of test prediction performance, in both well-benchmarked tasks and real-world applications, and generates major impact. Some of the factors that contribute to its success are: (1) learning from a huge amount of (labeled) training data (2) highly flexible models with millions to billions of parameters; (3) simple but effective optimization methods such as stochastic gradient descent; (4) combatting overfitting with new schemes such as dropout [204]; and (5) computing with massive parallelism on CPUs and GPUs. 1.2 Challenges Despite the high test prediction performance, there are many challenges rising when deploying the deep learning pipeline in the wild to solve real-world business problems. In what follows, we describe three challenges: computation efficiency, data efficiency, and robustness challenge. 1.2.1 Challenge I: Computation Efficiency (AutoML) INPUT training set ! KNOB hyper-parameter {# $ ,…,# ' } training OUTPUT model {) $ ∗ ,…,) ' ∗ } SELECT the best model ) + ∗ Figure 1.2: Illustration of hyper-parameters optimization: We run the training pipeline for a set of K different hyper-parameter configurationsf 1 ; 2 ;:::; K g, and obtain K output models f 1 ;:::; K g. We select the best one m among theK according to some evaluation metrics. Nowadays development of the deep learning pipeline (Figure 1.1) typically requires sub- stantial human efforts, sometimes even expert knowledge, to determine the high-level knob , such as types of data preprocessing, the choice of network architectures, and optimization hyper- parameters, which can effect the output performance significantly. There is an ever-growing demand to automate this process so that the learning system can be used off-the-shelf without human in the loop [174]. Automated Machine Learning, or AutoML [56], which automatically 3 (without human input) outputs test set predictions given a training dataset as input, possibly sub- ject to some computational constraints, is proposed to achieve this goal. One of the key task in AutoML is hyper-parameters optimization (HPO): search for a config- uration of optimal hyper-parameters, which can generate a best output model , for example in terms of generalization performance on a separate validation set. See Figure 1.2 for an illus- tration. We introduce the notation, = arg min L(;D;) :=T (); (1.3) to denote the mapping of hyper-parameter to output model . This is a challenging optimization task as is related to throughT () (Eq. 1.3), a black- box function which has no simple closed form (nor its gradient), and we have no knowledge regarding its properties such as smoothness, differentiability and so on. Besides,T () is very expensive to evaluate—it involves running the primary training algorithm to completion (one full training curve in the blue box of Figure 1.2). How to efficiently search for a good under practical computational constraints remains an open problem. 1.2.2 Challenge II: Data Efficiency A critical component in deep learning pipeline, is to have hundreds of thousands of labeled train- ing data, for the models to learn from. However, labeling data requires significant human labor, which is very expensive. It is not always feasible to have labels available for a specific task. For example, in speech sentiment analysis the largest public dataset consists of only 12 hour of annotated audio. INPUT training set OVERFITTING OUTPUT model ∗ training error test error Figure 1.3: For applications where we don’t have adequate labeled training data, like speech sentiment analysis, overfitting is a big challenge. Under this limited label regime, neural network is notoriously prone to overfitting. Figure 1.3 demonstrates this phenomenon by plotting out the error curves of the training set and test set respectively throughout the optimization procedure. There is a critical point, after which the learning overfits, namely the decrease in training error leads to increased test error. How to push backward the overfitting point, or in another word to improve the data efficiency under limited label, is a big challenge in practice. 4 1.2.3 Challenge III: Robustness (Uncertainty Estimation) TEST input image model ∗ PREDICTION training set predict “5” with 92.7% predict “3” with 99.9% Figure 1.4: Examples of uncertainty estimation: We have a classifier trained on handwritten digit imagesD. At test time, we are fed in different input images. In one case (the top image in the first column), an image of digit “6” is adversarially perturbed, but is still recognizable to human as “6”. The classifier , however, wrongly predict it as “5” with 92.7% confidence. In the second case (the bottom image in the first column), the image is a letter “J” that does not belong to the training data distribution, i.e. not a digit. The classifier , however, wrongly predicts it as “3” with 99.9% confidence. The predictions do not faithfully reflect the uncertainty. Uncertainty estimation is a long-standing topic in machine learning, both studied in Bayesian community, including Bayesian neural network [140] and Gaussian Process [182], and from fre- quentist perspective, as in statistical bootstrap [22, 50]. There is a resurgent interest on this topic recently, especially regarding popular deep learning models, due to its close connection to issues like robustness—for example, can our handwritten digit classifier output “I do not know” if we feed in a cat picture at test time, or even more critical issues like AI safety—can the network pro- tect users against adversarial attacks [170]. All these questions are no longer futuristic academic inquiries, but are urgent practical concerns if we want to advance the impact of machine learning in real-world applications, like autonomous driving, and medical diagnosis. In supervised learning, uncertainty estimation characterizes the property and behavior of our predictions at test time. See Figure 1.4 for two illustrating examples. Moreover, uncertainty estimation can be crucial in tasks where only small amount data is available. For example in active learning, where the model chooses what data (labels) to learn from, good uncertainty es- timates can help identify the most informative data points to query and train on [59]. In (deep) reinforcement learning, the correct notion of uncertainty is critical for efficient exploration [168]. Despite the importance of accurate uncertainty estimation, people have found out that deep learning recipes can fail miserably on related tasks, such as calibration [73, 156], and out-of- distribution sample (test input which violates the i.i.d. assumption) detection Shafaei et al. [194]. Albeit rapid progress recently, obtaining accurate uncertainty estimate for deep neural networks is far from solved. 5 1.3 Thesis Organization In this thesis, we investigate methods to leverage information during neural network training, such as intermediate model parameters and curvature of the loss function, to solve the efficiency and robustness challenges. The insight is that many information available in the stochastic opti- mization of the neural network training is helpful in improving efficiency and model robustness. The dissertation is divided into three parts. In the first part (Chapter 2, and 3), we study the efficiency challenges. Chapter 2 focuses on the computation efficiency challenge (AutoML) and Chapter 3 the data efficiency challenge. Concretely, in Chapter 2 we examine the trend in neural network training curves, for example error rate versus training epochs in a classification task, and use the trend in training curves to speed up hyper-parameter optimization. In Chapter 3, we investigate the benefit of pre-trained features from end-to-end ASR models to improve the data efficiency in speech sentiment analysis as a down-steam task. In the second part (Chapter 4, and 5), we study the task of uncertainty estimation and robustness. In particular, we propose to utilize the curvature of the training loss to reason about predictive uncertainty. In the third part (Chapter 6), we present a separate research effort on scaling up kernel methods and compare with deep neural networks. The remaining of the dissertation is outlined as follows. Chapter 2 introduces a novel approach to speed up hyper-parameter optimization by examin- ing trends of learning curves. Chapter 3 proposes a novel pre-training approach to solve speech sentiment analysis, where labeled data is scarce. Chapter 4 is a survey on methods for uncertainty estimation and robustness in deep learning, from both Bayesian and frequentist point of views. Chapter 5 proposes a method to estimate uncertainty for deep neural networks based on the curvature information of the loss. Chapter 6 presents a separate work on large-scale kernel machines. 6 Part I Leverage Training Information for Efficient Deep Learning 7 Chapter 2 AUTOML: HYPER-PARAMETER OPTIMIZATION In this chapter, we study the AutoML challenge, and improve the computation efficiency of hyper- parameter optimization in deep learning systems, by leveraging the trend in training curves. Specifically, we introduce Budgeted Hyper-parameter Tuning algorithm (BHPT), a state-of- the-art hyper-parameter tuning learning algorithm. We formulate a sequential decision making problem, and propose an algorithm, which uses long-term predictions with an action-value func- tion to balance the exploration exploitation tradeoff. It exhibits budget adaptive tuning behavior. We empirically validate the proposed approach on real-world tuning tasks, and compare against various strong baselines. 2.1 Introduction: Hyper-parameter Tuning under a Budget Constraint Hyper-parameter tuning is of crucial importance to designing and deploying machine learning systems. Broadly, hyper-parameters include the architecture of the learning models, regular- ization parameters, optimization methods and their parameters, and other “knobs” to be tuned. It is challenging to explore the vast space of hyper-parameters efficiently to identify the op- timal configuration. Quite a few approaches have been proposed and investigated: random search, Bayesian Optimization (BO) [200, 195], bandits-based Hyperband [100, 130], and meta- learning [30, 11, 58]. Many of those prior studies have focused on the aspect of reducing as much as possible the computation cost to obtain the optimal configuration. In this work, we look at a different but important perspective to hyper-parameter optimization – under a fixed time/computation cost, how we can improve the performance as much as possible. Concretely, we study the problem of hyper-parameter tuning under a budget constraint. The budget offers the practitioners a tradeoff: affordable time and resource balanced with models that are good – best models that one can afford. Often, this is a more realistic scenario in developing large-scale learning systems, and is especially applicable, for example when the practitioner searches for a best model under the pressure of a deadline. The budget constraint certainly complicates the hyper-parameter tuning strategy. While the strategy without the constraints is to explore and exploit in the hyper-parameter configuration space, a budget-aware strategy needs to decide how much to explore and exploit with respect to the resource/time. As most learning algorithms are iterative in nature, a human operator would 8 monitor the training progress of different configurations, and make judgement calls on their po- tential future performance, based on what the tuning procedure has achieved so far, and how much resource remains. For example, as the deadline approaches, he/she might decide to exploit current best configuration to further establish its performance, instead of exploring a potentially better configuration as if he/she had unlimited time. Then how can we automate this process? We formalize this inquiry into a sequential decision making problem, and propose an algo- rithm to automatically achieve good resource utilization in the tuning. The algorithm uses a belief model to predict future performances of configurations. We design an action-value function, in- spired by the idea of the Value of Information [43], to select the configurations. The action-value function balances the tradeoff between exploration – gather information to estimate the training curves, and exploitation – achieve good performance under the budget constraint. We first discuss related work in Section 2.2. The rest of the chapter is organized as follows. In Section 2.3, we formally define the problem and introduce the sequential decision making formulation. In Section 2.4, we describe the proposed algorithm with analysis. Experiments can be found in Section 2.5 respectively. Implementation details and more discussions can be found in Section 2.7. 2.2 Related Work Automated design of machine learning system is an important research topic [200]. Traditionally, hyper-parameter optimization (HPO) is formulated as a black-box optimization and solved by Bayesian optimization (BO). It uses a probabilistic model together with an acquisition function, to adaptively select configurations to train in order to identify the optimal one [195]. However of- tentimes, fully training a single configuration can be expensive. Therefore recent advances focus to exploit cheaper surrogate tasks to speed up the tuning. For example, Fabolas [109] evaluates configurations on subsets of the data and extrapolates on the whole set; FreezeThaw [206] uses the partial training curve to predict the final performance. Recently Hyperband [130] formulates the HO as a non-stochastic best-arm identification problem. It proposes to adaptively evaluate a configuration’s intermediate results and quickly eliminate poor performing arms. In a latest work, [54] combines the benefits ofBO and Hyper- band to achieve fast convergence to the optimal configuration. Other HO work includes gradient- based approach [142, 57], meta-learning [98, 58], and the spectral approach [77]. Please also refer to Table 2.1 in Section 2.5 for a comparison. While all these works improve the efficiency of hyper-parameter tuning for large-scale learn- ing systems, none of them has an explicit notion of (hard) budget constraint for the tuning process. Neither do they consider to adapt the tuning strategy across different budgets. On the contrary, we propose to take the resource constraint as an input to the tuning algorithm, and balance the exploration and exploitation tradeoff w.r.t. a specific budget constraint. Table. 2.1 summarizes the comparisons of popular tuning algorithms from three perspectives: whether it uses a (probabilistic) model to adaptively predict and identify good configurations; whether it supports early stop and resume; and whether it adapts to different budgets. There are other budgeted optimization formulations solving problems in different domains. [124] proposed an approximate dynamic programming approach (Rollout) to solve BO under a finite budget, see Section 2.4.5 and 2.5 for discussions and comparisons. For the budget optimiza- tion in online advertising, [21] studies the constrained MDP [205], where in a Markov Decision 9 Table 2.1: Comparison of tuning algorithms algorithm early stop adaptive (future) budget and resume prediction aware random search BO (GP-EI/SMAC) X Fabolas X X FreezeThaw X X Hyperband X Rollout X X X BHPT, BHPT-" (ours) X X X Process (MDP) each action also incurs some cost, and the policy should satisfy constraints on the total cost. However, unlike in advertising that the cost of different actions can vary or even be random, in hyper-parameter tuning the human operator usually can pre-specify and determine the cost (of actions) – train the configuration for an epoch and stop. Thus, a finite-horizon formulation is simpler and more suitable for the tuning task compared to the constrained formulation. The sequential decision making problem introduced in this work is an instance of extreme bandits [162]. A known challenge in extreme bandits is that the optimal policy for the remaining horizon depends on the past history, which implies that the policy should not be state-less. On the other hand, if we formulate the hyper-parameter tuning as a MDP, we suffer from that the states are never revisited. Existing approaches in MDP [64] can not be directly applied. 2.3 Problem Statement In this section, we start by introducing notations, and formally define the budgeted hyper-parameter tuning problem. Finally we formulate the budgeted tuning task into a sequential decision making problem. 2.3.1 Preliminaries Configuration (arm) Configuration denotes the hyper-parameter setting, e.g. the architecture, the optimization method. We use [K] =f1; 2;:::;Kg to index the set of configurations. The term configuration and arm are used interchangeably. (See Table 2.2 below for a full list notations.) Model Model refers to the (intermediate) training outcome, e.g. the weights of neural nets, of a particular configuration. We evaluate the model on a heldout set periodically, for example every epoch. We consider loss or error rate as the evaluation metric. Assume we have run configuration k for b epochs, and we keep track of the loss of the best model as k b 2 [0; 1), the minimum loss fromb epochs. Note that k b is a non-increasing function inb. We always use superscript to denote the configuration, and the subscript for the budget/epoch. 10 Budget The budget defines a computation constraint imposed on the tuning process. In this work, we consider a training epoch as the budget unit 1 , as this is an abstract notion of computation resource in most empirical studies of iterative learning algorithms. Given a total budgetB2N + , a strategyb = (b 1 ;:::;b K )2 N K allocates the budget amongK configurations, i.e.k runsb k epochs respectively.b should satisfy that the total epochs fromK arms add up toB:b T 1 K =B. We use epoch and budget interchangeably when there is no confusion. Constrained Optimization The goal of the budgeted hyper-parameter tuning task is to obtain a well-optimized model under the constraint. Under the allocation strategyb, arm k returns a model with loss k b k . We search for the strategy, which optimizes the loss of the best model out ofK configurations,` B = minf 1 b 1 ;:::; K b K g. Concretely, the constrained optimization problem is min b ` B ; s.t.b T 1 K =B: (2.1) 2.3.2 Optimal Solution with Perfect Information Despite a combinatorial optimization, if we know the training curves of all configurations (known k b 8k andb B), the solution to Eq. 2.1 has a simple structure: (one of) the optimal planning path must be a greedy one – the budget is invested on a single configuration, due to the non- increasing nature of k b k w.r.t. b k . Hence the problem becomes to find the optimal armc, which attains the smallest loss afterB epochs. And the optimal planning is simply to invest allB units toc. ` B = min k k B ; c(B) = arg min k k B : (2.2) Namely, c B = ` B . Note c is budget dependent. However in practice, computing Eq. 2.2 is infeasible, since we need to know all K values of k B , which already uses up BK epochs, exceeding the budgetB! In other words, the challenge of the budgeted tuning lies in that we try to attain the minimum loss (ofBK epochs)` B , with partial information of the curves from onlyB observations/epochs. 2.3.3 Sequential Decision Making The challenge of unknown k B comes from that the training of the iterative learning algorithm is innate sequential – we cannot know theB-th epoch loss k B without actual running the firstB 1 epochs using up the budget. On the other hand, this challenge can be attacked naturally in the framework of sequential decision making: the partial training curve is indicative of its future, and thus informs us of decisions early on without wasting the budget to finish the whole curve – we all have had the experience of looking at a training curve and “kill” the running job as “this training is going nowhere”! The key insight is that we want to use the observed information from epochs in the past, to decide which configuration to run or stop in the future. This motivates us to formulate the budgeted tuning as a finite-horizon sequential decision making problem. The training curves can be seen as an oblivious adversary environment [23]: 1 In practice, we can generalize to other definitions of budget units. 11 loss sequences ofK configurations are pre-generated before the tuning starts. At then-th step, the tuning algorithm selects the action/configurationa n 2 [K] for the next budget/epoch, and the curve returns the corresponding observation/lossz n 2 [0; 1) from configurationa n . Hence we get a trajectory of configurations and losses, n = (a 1 ;z 1 ;:::;a n ;z n ). We define the policy to select actions as: a n = ( n1 ) = (a 1 ;z 1 ;:::;a n1 ;z n1 ), a function from the history to the next arm. This process is repeated forB steps until the budget exhausts, and the final tuning output is ` B = min 1nB z n : (2.3) Note` B is the same as the original objective` B in Eq. 2.1. In this way, we can solve the budgeted optimization problem (Eq. 2.1) in a sequential manner, such that we can leverage the information from partial training curves to make better decisions. The sequential formulation enables us to stop or resume the training of a configuration at any time, which is economical in budget. Besides, the algorithm can exploit the information of budgetB to make better judgements and decisions. The framework can be extended to other settings, e.g. multiple configurations in parallel, resource of heterogenous types (like cpu, gpu), and etc. While some of the extensions is straightforward, some requires more careful design, which is left as future work. Regret The goal of the sequential problem is to optimize the policy, such that the regret of our output against the optimal solution (Eq. 2.2) is minimized, min ` B ` B = min min 1nB z n c B : (2.4) Note` B assumes knowledge of all curves, which is infeasible in practice. Challenges First of all, to optimize under unknown curves leads to the well-known exploitation- exploration tradeoff, the same as in multi-armed bandits. On one hand, it is tempting to pick configurations that have achieved small losses (exploitation), but on the other hand, there is also an incentive to pick other configurations to see whether they can admit even smaller losses (ex- ploration). However, Eq. 2.4 is different than the typical bandits, because the output` B is in the form of the minimal loss, instead of the sum of losses. This leads to two key differences of the optimal policy: firstly the optimal policy/configuration depends on the horizonB. Secondly at every step, the optimal policy for the remaining horizon depends on the history actions. Therefore, the key step is to re-plan given what has been done so far every time. A complete list of notations is summarized in Table. 2.2. 12 Table 2.2: Notations section notation in hyperparameter tuning problem statement [K] set of configurations/arms B total budget, sum of epochs from all configurations b = (b 1 ;:::;b k ) b k epochs ran on configurationk b T 1 K =B r remaining budget at stepn r =Bn y k (t) loss of configurationk att-th epoch k b k best/minimum loss ofk afterb k epochs k b k = min 1tb k y k (t) ` B final loss ` B = min k k b k c optimal configuration c = arg min k k B ` B optimal final loss ` B = c B = min k k B sequential decision making a n action/configuration selected at then-th step a n 2 [K] z n observed loss at then-th step z n =y an (t) for somet n trajectory of configurations and losses n = (a 1 ;z 1 ;:::;a n ;z n ) ` B loss of policy ` B = min 1nB z n approach S n belief model at stepn S n =g( n ) a function of the past trajectory k r best (future) loss ofk withr more epochs k r = min 1tr y k (t 0 +t) 2 , r.v. in the belief k r expected best loss ofk withr more epochs k r =E h k r i b c predicted top configuration b c = arg min k k r 1st r predicted top configuration loss 1st r = min k k r = b c r 2nd r predicted runner-up configuration loss 2nd r = min k6=b c k r ? epochs short from convergence on armb c ? = arg min 1tr E h y b c (t 0 +t) i Others n minimum loss from the past n = min 1sn1 z s 13 2.4 Approach In this section, we describe our budgeted tuning algorithm which attains the same performance as` B asymptotically (Eq. 2.4). At every step, we re-plan for the remaining horizon by leveraging the optimal planning structure (Section 2.3.2), and employ the idea of Value of Information [43] to handle the exploration exploitation tradeoff. Specifically, we design an action-value function (Section 2.4.1) to select the next action or configuration. The action-value function takes future predictions of the curves as input, which we use a Bayesian belief model to compute (Section 2.4.2), and output a score measuring some expected future performance. We obtain a new observation from the selected arm, and update the belief model. We discuss how the action-value function balances the exploration-exploitation tradeoff in Section 2.4.3, and present the algorithm in Section 2.4.4. Lastly, in Section 2.4.5, we discuss and compare with the Bayes optimal solution. 2.4.1 Action-value FunctionQ Consider we are at then-th step, with remaining budgetr =Bn. Our goal is to find the policy which minimizes the tail sequence in Eq. 2.3:` r = min nsB z s . We use a belief state/model S n1 to estimate the unknown training curves. S n1 is a random process derived from the past trajectory n1 , which allows us to simulate, and predict future outcomes. We compute an action-value function Q r [a] for each arm, and select the next action by minimizing it, a n =( n1 ) = arg min a2[K] Q r [ajS n1 ]: (2.5) We drop the S n1 in Q r [a] to simplify the notation when it’s clear. In what follows, we define Q as a form of expected best loss we are likely to obtain of` r , if we follow configurationa. First of all, recall the optimal planning of hyper-parameter tuning in Section 2.3.2, there are K candidates k r (the minimum loss from k in r epochs 3 ) for the optimal ` r . This structure effectively reduces the combinatorial search space 4 of to a constant factor ofK. Thus we use the predictive value of a r to construct Q r [a]. Since the belief S n1 is uncertain about the true curves, there is a distribution over values of k r . We would like do exploration to improve the estimates of k r s. Note that our focus is to correctly identify the best arm, instead of estimating all arms equally well. Inspired by the idea of Value of Information exploration in Bayesian reinforcement learning [43, 44], we quantify the gains of exploration in Q, by the amount of improvement on future loss. Specifically, what can be an informative outcome that updates the agent’s knowledge and leads to an improved future loss? There are two scenarios where the outcomes contradict to our prior belief: (a) when the new sample surprises us by showing that an action previously considered sub-optimal turns out to be the best arm, and (b) when the new sample indicates a surprise that an action that previously considered best is actually inferior to other actions. 3 With a slight abuse of notation, we use k r to denote its best performance in r more epochs, starting from the current epoch. 4 There are exactly r +K 1 K 1 ! K! different possible outcomes (ignore the sequence ordering). 14 Before we formally define Q r [a], we introduce the following notations. Define k r =E h k r jS n1 i ask’s expected best future loss. We can sort all arms based on their expected loss, and we call b c = arg min k k r the predicted top arm, our current guess ofc (Eq. 2.2). Denote 1st r = b c r as the top arm’s expected long-term performance, and 2nd r that of the runner up. Now consider in case (a) for a sub-optimal arma6=b c, it will update our estimate of` r when the sample a r < 1st r outperforms the previously considered best expected loss. We expect to gain 1st r a r by takinga instead ofb c. Define Q r [a] =E minf a r ; 1st r g = 1st r E h 1st r a r + i : (2.6) The second term on the r.h.s. computes the area when a r falls smaller than 1st r , which is called the value of perfect information (V oI) in decision theory [93]. It is a numerical value that measures the reduction of uncertainty, thus can assess the value for exploring action a. In our problem, intuitively it quantifies the average surprise ofa over all draws from the belief. Minimizing Q r [a] favorsa with large surprise/V oI. Similarly in case (b) when we consider the predicted top arma =b c, there is a surprise that the sample b c r > 2nd r falls behind with other candidates. We define Q r [b c] =E h minf b c r ; 2nd r g i = 1st r E b c r 2nd r + ; (2.7) where the second term computes the area when b c r unluckily falls right to 2nd r . Continuing withb c is favorable if it has high V oI, i.e. gains us much knowledge from this surprise. See Figure 2.1 for visualization of the action-value function, where we assume Gaussian distribution for the random variables k r . Figure 2.1: Visualization of Q[a] 15 Combining Eq. 2.6 and 2.7, the action-value function is Q r [a] =E minf a r ; min k6=a k r g (2.8) Finally, note that Q depends on the remaining horizon r, through the index in k r and k r , and the history n1 , through the expectations w.r.t. the belief S n1 . To analyze the exploration exploitation tradeoff in Q, we need details of the belief model, which will be explained next. We will come back to the discussion on the properties and behaviors of Q in Section 2.4.3. 2.4.2 Belief Model In this section, we briefly describe the Bayesian belief model we use, and explain how Eq. 2.8 is computed with the posterior distribution. The belief model captures our current knowledge of the training curves to predict future k r , and gets updated as new observations become available. Our proposed algorithm can work with any Bayesian beliefs which model the training curves. In this work, we adopt the Freeze-Thaw GP [206]. We use (k;t) to index the hyper-parameters and epochs respectively, and the loss ofk from thet-th epoch isy k (t). The joint distribution of losses from all configurations and epochs,y = [y 1 (1);y 1 (2);:::;y 1 (n 1 );:::;y K (n K )] T (armk hasn k epochs/losses), is given by 5 Pr(yj(k;t)) =N y; 0; K t + OK x O T : (2.9) O = block diag(1 n 1 ;:::; 1 n K ) is a block diagonal matrix of vector ones. Kernel K x models the correlation of asymptote losses across different configurations, while kernel K t characterizes the correlation of losses from different epochs of the same configuration. The entry in K t is computed via a specific Freeze-Thaw kernel, to capture the decay of losses versus time. We can use the joint distribution (Eq. 2.9) to predict future performances ofy k (t);8k;t, and update the posteriors with new observations, by applying Bayes’ rule. Details of the belief model and the posterior distribution is postponed to Section 2.7.1 at the end of this Chapter. Computing Q Note that the future best loss is a random variable, k r = min 1tr y k (t 0 +t). How- ever, k r ’s distribution is non-trivial to compute, as it is the minimum ofr correlated Gaussians. To simplify the computation 6 , we approximate: k r y k (t 0 + k ) where we fix the time index k deterministically, to be the one which achieves the minimum loss in expectation, k = arg min 1tr E h y k (t 0 +t) i . The intuition is that for different random draws of the curve, k r can be any one ofy k (t 0 +t) for 1tr. But with high probability, Pr h k r =y k (t 0 + k ) i 5 AssumeN = X k n k . The dimensionality of variables are: y2R N ; Kx2R KK + ; O2f0; 1g NK ; and Kt2 R NN + . 6 We can always use a Monte Carlo approximation to compute this quantity. Asymptotically our simplification does not affect the behavior of the Q, see Section 2.4.3. 16 Pr h k r =y k (t 0 +t) i ;8t. Thus we use the Gaussiany k (t 0 + k ) as k r . The advantage is that Q r [a] can be computed efficiently in closed-form: Q r [a] =E [minf a r ;g] = [ a r ] s(s) +(s) (2.10) wheres = E [ a r ] [ a r ] is the normalized distance of a r to, and is a constant, either 1st r or 2nd r depending on whichevera we are looking at. [] is the standard deviation, and () and () are the cdf and pdf of standard Gaussian respectively. 2.4.3 Behavior ofQ In this section, we analyze the behaviors of the proposed action-value function. With Gaussian distributed k r s, there are nice properties of the proposed Q. We provide asymptotic analysis of the behaviors as the budget goes to infinity, nonetheless it sheds lights on the behaviors when applied under finite budget. Finite time analysis is more challenging, and is left as future work. Whenr goes to infinity, k 1 is the asymptote loss ofk. Besides, the variance update of k 1 given by GP, always decreases with more observations, and is independent of the observed values. This leads to the following three properties. First of all, all arms will be picked infinitely often under Q. Note the surprise area/V oI for any k (the second term of the r.h.s. in Eq. 2.6 and 2.7) is always greater than 0 for any finite time, and approaches 0 in the limit. Hence an arm, which has not been picked for a while, will have higher V oI relative to the rest (since the V oI of other arms go to 0), and get selected again. Therefore Q is asymptotically consistent: the best arm will be discovered as the budget goes to infinity, and the objective Eq. 2.4 has lim B!1 ` B ` B ! 0: Secondly, Q balances the exploration with exploitation, since both configurations with smaller mean (exploitation), or larger variance (exploration) are preferred—in either case the surprise area is large (2nd term on the r.h.s. of Eq. 2.6 and 2.7). Thirdly, the limiting ratio of the number of pulls between the top and the second best arm approaches 1 assuming the same observation noise parameter. And the limiting ratio between any pair of suboptimal arms is a function of the sub-optimality gap [188]. 2.4.4 Practical Budgeted Hyper-parameter Tuning Algorithm In this section, we discuss the practical use of Q in the budgeted tuning algorithm. Imagine the behavior of Q when applied to the hyper-parameter tuning. Intuitively, all con- figurations will get selected often at the beginning, due to the high uncertainty. Gradually as our curve estimation gets more accurate with smaller uncertainty, we will focus the actions on a few promising configurations with good future losses (small sub-optimality gap). Finally, Q will mostly allocate budget among the top two configurations. This in practice can be a waste of resource, because we aim to finalize on one model, instead of distinguishing the top two. We propose the following heuristics to fix it. 17 Budget Exhaustion Recallb c is the predicted best arm, and define ? = arg min 1tr E h y b c (t 0 +t) i , the number of epoch configurationb c short from convergence, if we expectb c to hit its minimum. We propose to check the condition ? < r, where r is the remaining budget, to keep track of the budget exhaustion. If it is false, we will pickb c (else statement in Alg. 1). Note that as we still update our belief after the new observation, we can switch back to the selection rule Q (if statement in Alg. 1). "-Greedy with Confident Top Arm Another drawback of Q is that when we have a large number ofK (arms), Q pulls the top arm less frequently, because the suboptimal arms aggregately could take up a considerable amount of the budget. In fact, we might want to cut down the exploration when we are confident of the top arm. Thus we design the following action selection rule: with probability ", we selectb c, and follow Q r [a] = E minf a r ; 1st r g to select the sub- optimal ones for the rest of the time. a n = " ( n1 ) = ( arg min a6=b c Q r [aj S n1 ]; w.p."; b c; otherwise: (2.11) We set" = 0:5 in the algorithm. Despite we do not tune", " performs well as demonstrated in the experiments. The algorithm is summarized in Alg. 1. We will refer to the proposed algorithm as BHPT, and BHPT-" in the experiment section. Algorithm 1: Budgeted Hyperparameter Tuning (BHPT) Input: BudgetB, and configurations [K]. 1 forn = 1; 2;:::B do 2 if ? <r then 3 a n =( n1 ), ora n = " ( n1 ) ; // Eq. 2.5 and 2.8, or 2.11 4 else 5 a n =b c: 6 Runa n and obtain lossz n =y an (t) (for somet). 7 Usez n to update the belief S n , ? andr. ; // Section 2.4.2 and 2.7.1 8 Output min 1nB z n . 9 2.4.5 Discussion In this section, we discuss and compare with the Bayes optimal solution to the budgeted-tuning problem. The Bayes optimal solution [64] handles the challenge of simultaneous estimation the unknown curves and optimization over the future horizon, by solving the following objective, min E 0 [` B ] = min E [` B j S 0 ] = min E min 1nB z n S 0 ; (2.12) where S 0 is our prior belief over the unknown curves, and we useE 0 [] for abbreviation. 18 To start with, Eq. 2.12 tries to minimize the loss average across all curves from the prior. On the contrary, we try to find the optimal solution` B w.r.t. a fixed set of unknown curves. Besides the conceptual difference, computationally Bayes optimal solution is taxing due to the exponential growth in the search space [72], as it considers subsequent belief updates. The reduction from the combinatorial to a constant as in our approach, is no longer applicable, because the optimal planning (Section 2.3.2) does not hold for Eq. 2.12. Rollout There is abundant literature of efficient approximate solutions to Eq. 2.12 in dynamic programming (DP) [15]. We briefly sketch the rollout method in [124], which is the most ap- plicable to our task. It applies the one-step lookahead technique, with two approximations to compute the future rewards: first it truncates the horizon of belief updates to at mosth rolling steps, whereh is a parameter. Secondly, they propose to use expected improvement (EI) from the BO literature as the sub-optimal rollout heuristics to collect the future rewards. The derivation follows as below. Starting from Eq. 2.12, we have to be careful when identi- fying the optimal sub-structure to write out the Bellman equation, since we cannot trivially swap the expectation with the minimum. Rewrite Eq. 2.12 as an accumulated sum of intermediate improvement, by applying the trick (cx) + =c minfc;xg, 1 minfz 1 ;:::;z B g = 1 z 1 + B X n=2 minfz 1 ;:::;z n1 g minfz 1 ;:::;z n g = B X n=1 n minf n ;z n g = B X n=1 ( n z n ) + where we define 1 = 1, and n = min 1sn1 z s , the best loss in the history prior to stepn. Then we have the following recursion, max V 1 = max E 0 [1` B ] = max E 0 h 1 min 1nB z n i : E n1 n minf n ; min nmB z m g | {z } Vn =E n1 h ( n z n ) + +E njn1 n+1 minf n+1 ; min n+1mB z m g | {z } V n+1 i : (2.13) The Bellman equation between the tail valueV n andV n+1 is given in Eq. 2.13. n is the “state” of the system, as it summarizes the past information (best loss), and determines the future reward. We useE n1 [] andE njn1 [] to distinguish different beliefs between steps. The computational challenges to solve this DP are two-fold: firstly the reachable states grow exponentially with the remaining horizon, due to the updateE njn1 []; secondly we do not have the optimal policy when computing the tail valueV n+1 . 19 The rollout solution proposed in [124] is as follows. At stepn, we expand to rolling horizon e N = minfn +h;Bg. H e N ( e S e N ) =E h ( e N z a e N ) + e S e N i = EI h z a e N ; e N e S e N i (2.14) H k ( e S k ) =E h ( k z a k ) + +H k+1 ( e S k+1 ) e S k i fork =n + 1;:::; e N 1: Nq X q=1 (q) k (z a k ) (q) + +H k+1 ( e S (q) k+1 ) (2.15) where in Eq. 2.14 the expectation can be computed exactly, and Eq. 2.15 is approximated by Gauss-Hermite quadrature. (q) are the quadrature weights. for EI is the expected improvement heuristics used in BO community. As we will see in the experiment section, the truncated horizon and the lack of long-term prediction is detrimental for the tuning performance. On the contrary, predicting k r directly in our approach is both computationally efficient and conceptually advantageous for the hyper- parameter tuning problem. 2.5 Experiment In this section, we first compare and validate the conceptual advantage of the BHPT algorithm over other methods on synthetic data, by analyzing the exploration exploitation tradeoff under different budgets. Then we demonstrate the performance of the BHPT algorithm on real-world hyper-parameter tuning tasks. Particularly, we include a task to tune network architectures be- cause selecting the optimal architecture under a budget constraint is of great practical importance. We start by describing the experimental setups, i.e. the data, evaluation metric and baseline methods, in Section 2.7.2, and then provide results and analysis in Section 2.5.2 and 2.5.3. 2.5.1 Experiment Description Table 2.3: Data and Evaluation Metric dataset K # hyper- evaluation # arms params. synthetic 84 NA Eq. 2.16 ResNet on CIFAR-10 96 5 error rate FCNet on MNIST 50 10 error rate V AE on MNIST 49 4 ELBO ResNet/AlexNet on CIFAR-10 49 6 error rate Data and Evaluation Metric For the data preparation, we generate and save the learning curves of all configurations, to avoid repeated training when tuning under different budget constraints. For synthetic set, we generate 100 sets of training curves drawn from a Freeze-Thaw GP. For 20 real-world set, we create 4 tuning tasks as summarized in Table 2.3. For more details, please refer to Section 2.7.2. For evaluation metric on synthetic data, we define the normalized regret R B = ` B ` B ` 0 ` B ; (2.16) where` 0 is the initial loss of all arms; ` B is the tuning output; ` B is the optimal solution with known curves. We normalize the regret over the data range ` 0 ` B . For real-world task, see Table 2.3 for the evaluation metrics. As all tasks are loss minimization problems, the reported measure is the smaller the better. Baseline Methods We compare to Hyperband, Bayesian Optimization (BO) and its variants (Fabolas and FreezeThaw), as well as the rollout solution described in Section 2.4.5. For descrip- tions and comparisons see Section 2.2, and refer to Table 2.1 for a summary. Implementation details can be found in Section 2.7.2. 2.5.2 Results on Synthetic Data On synthetic data, because we use the correct prior, we expect that the belief model learns to accurately predict the future as the budget increases. Thus we can examine the behaviors of the proposed algorithms under different budgets. All results are averaged over 100 synthetic sets. First, we plot out the normalized regretsR B (Eq. 2.16) over budgets in Figure 2.2. Our policies consistently outperform competing methods under different budget constraints. As the budget increases, the proposed BHPT methodsR B ! 0 as discussed in Section 2.4.3. Exploration vs. Exploitation There are two factors that determine the performance of an algo- rithm: whether it correctly identifies the optimal arm, and whether it spends sufficient budget to achieve small loss on such arm. Loosely speaking, the first task reflects if the algorithm achieves effective exploration, such that it can accurately estimate the curves and identify the top arm, and the second task indicates sufficient exploitation. To examine the “exploration”, we plot the hit rate of the output arm on the top 5 arms (out of all 84) across different budgets in Figure 2.3(a) 7 . To check the “exploitation”, we visualize the percentage of the budget spent on the output arm in Figure 2.3(b). In both (a) and (b), the higher of the bar the better. In (b), both BHPTs do better in exploration than all baselines. The “adaptive prediction” column in Table. 2.1 explains the exploration behavior. The probabilistic prediction module in BO and our methods improve with more budgets, which explains the increase in hit rate as budget increases. In (c), our methods perform well in terms of exploitation. The “early stop” column in Table. 2.1 partially explains the exploitation behavior. BO does strong exploitation under small budget because it does not early stop configurations, which also explains its (relatively) good performance under small budgets in (a). Despite the Rollout has a belief model and does future predictions as BHPT, it doesn’t per- form well on neither task: the hit rate does not improve with more budget, nor does it exploit sufficiently on the output arm. It is because the rollout truncates the planning horizon due to the 7 The three stacks in each bar are the hit rate @ top-1, top-3, and top-5 in ground-truth respectively. 21 BHPT (ours) BHPT-ε (ours) Hyperband BO (GP-EI) Rollout 20 40 60 80 100 120 140 160 180 200 budget unit (1 unit = 6 iterations) 10 -3 10 -2 10 -1 Figure 2.2: Normalized RegretR B on 100 Synthetic Sets. For BHPT,R B ! 0 asB increases. BO is better under smallB, while Hyperband is better under largeB. BHPT (ours) BHPT-ε (ours) Hyperband BO (GP-EI) Rollout 30 60 90 120 150 180 budget 0 10 20 30 40 50 60 70 80 90 100 hit rate on top-5 (%) 30 60 90 120 150 180 budget 0 10 20 30 40 50 60 70 percentage (%) (b) exploration: output arm hit rate @ top 5 (c) exploitation: % budget on the output arm Figure 2.3: Budgeted Optimization on 100 Synthetic Sets: The hit rate of BO and our methods improves asB increases. The percentage of budget on the output arm of BO and our methods decreases asB increases. 22 computation challenge, which leads to myopic behaviors and the poor results. Indeed in all two subplots of Figure 2.3, Rollout performs and behaves similarly to Hyperband, which only uses current performance to select actions. This demonstrates the importance of long-term predictions and planning in the budgeted tuning task. Comparing (a) and (b), there is a clear tradeoff between exploration and exploitation, that the hit rate decreases as the budget percentage increases. Note that BHPT adjusts this tradeoff auto- matically across different budgets. Compare the BHPT against its"-greedy variant, the BHPT-" does slightly better in exploitation, and worse in exploration, as expected. Computation In Table 2.4, we report the computation cost in seconds for the tuing algorithms underB = 100. Hyperband is the fastest because it’s based on random search. BO updates the Gaussian Process once a full curve is revealed (one fully trained configuration in hyper-parameter tuning), thus it is slightly more expensive. BHPT updates the GP and does re-planning at every step (100 times re-planning in this example), which adds to the computation cost. Rollout is significantly more expensive than our BHPT due to the imaginary rollout and belief updates. The proposed BHPT achieves better tuning performance with a moderate increase in computation cost. Table 2.4: Computation time of different algorithms. algorithm BHPT Hyperband BO Rollout time (s) 11.89 0.01 0.40 1007.06 2.5.3 Results on Real-world Data BHPT (ours) BHPT-ε (ours) Hyperband BO (SMAC) Fabolas FreezeThaw (a) ResNet for classification (b) FCnet for classification (c) V AE for generative model Figure 2.4: Real-world Hyper-parameter Tuning Tasks: BHPT (blue) converges to the global optimal model at the rightmost budget for all tasks, i.e. regrets go to 0. BHPT-" is better under small budgets, while BHPT is better under large budgets. In this section, we report the tuning performances on real-world tuning tasks across different budget constraints. We plot the tuning outcomes (error rate or ELBO) over budgets in Figure 2.4 and Figure 2.6(a). Each curve is the average of 10 runs from different random seeds, and the mean with one standard deviation is shown in the figures. 23 10 0 10 1 10 2 time (log scale) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 error rate on heldout set ResNet AlexNet Figure 2.5: Training curves of AlexNets and ResNets. ResNet converges slower compared to AlexNet, but it reaches a smaller error rate. BHPT methods work well under a wide range of budgets, and outperform BO, Hyperband, FreezeThaw and the state-of-the-art algorithm Fabolas, across 4 tuning tasks. The trend of differ- ent methods across budgets is consistent with the observations on the synthetic data. As explained in the synthetic data experiment, the vanilla BHPT does better in exploration while the"-greedy variant does more exploitation. This explains the superior performance of the "-greedy under small budgets. However, the lack of exploration results in worse belief model and damages the performance as the budget increases. This phenomenon is more salient on an architecture selection task, which is explained in the next section. In this task, the belief modeling is challenging as there are two different learning curve patterns, one from ResNet and one from AlexNet, and it is important to have the right amount of exploration. Figure 2.5 illustrates the training curves of all configurations from the two architectures (together with other different hyper-parameters). Budget Adaptive Behaviors An important motivation to study the budgeted tuning problem is that it is practically desirable to have adaptive tuning strategy under different constraints. For example, the optimal configuration might change under different budgets. We would like to examine whether the proposed BHPT exhibits such adaptive behavior. We take the architecture selection task between ResNet and AlexNet on CIFAR-10, and visualize the ratio of the resource spent on ResNet over budgets in Figure 2.6(b). ResNet converges slower than AlexNet, but reaches a smaller error rate. Thus it’s rewarding to focus the tuning on AlexNet under small budgets, and vice versa. Indeed, the BHPT allocates more resource to ResNet as the budget increases, compared to the baseline Hyperband, which samples the two networks more or less uniformly at random. 2.6 Conclusion In this work, we study the budgeted hyper-parameter tuning problem to tackle the AutoML chal- lenge. We formulate a sequential decision making problem, and propose an algorithm, which uses 24 BHPT (ours) BHPT-ε (ours) Hyperband BO (SMAC) Fabolas FreezeThaw 51 59 67 75 83 91 99 107 115 123 131 budget 35 40 45 50 55 60 65 70 budget on ResNet (%) (a) AlexNet and ResNet (b) budget spent on ResNet Figure 2.6: On an architecture selection task, the proposed BHPT adapts its behavior to different budget constraints, while Hyperband employs a fixed strategy. long-term predictions with an action-value function to balance the exploration exploitation trade- off. It exhibits budget adaptive tuning behavior, and achieves the state-of-the-art performance across different budgets on real-world tuning tasks. For future work, it would be interesting to have more theoretical understandings of the pro- posed action-value function. It is also promising to extend the sequential decision framework to more general setups, for example several configurations can run in parallel, and different config- urations can take different amount of budget units in each step. 2.7 Details In this section, we provide omitted details in our algorithm and experiments. 2.7.1 Freeze-thaw GP In this section, we provide details of how to compute the posterior distribution of future predic- tions in Freeze-Thaw Gaussian process [206]. We will also write out the log likelihood function, which is used to sample the hyper-parameters for GP in real world experiments. 25 We use (x k ;t) to index the hyper-parameters x k and epochs t respectively, and the loss of configuration k from the t-th epoch is y k (t). The joint distribution of losses from all configu- rations and epochs,y = [y 1 ;:::;y K ] = [y 1 (1);y 1 (2);:::;y 1 (n 1 );:::;y K (1);:::;y K (n K )] T (armk hasn k epochs/losses), is given by 8 Pr(yj(k;t)) = Z [ K k=1 N (y k ;f k 1 n ; K t k )]N (f;m; K x )df =N y; 0; K t + OK x O T : (2.17) O = block diag(1 n 1 ;:::; 1 n K ) is a block diagonal matrix, where each block is a vector of ones of length n k . Kernel K x models the correlation of asymptote losses across different configu- rations, i.e. final losses of similar hyper-parameters should be close, as in most conventional hyper-parameter tuning literature. Additionally, kernel K t characterizes the correlation of losses from different epochs on the same training curve. K t = block diag(K t 1 ;:::; K t K ), the k-th block computes the covariance of the losses from armk. Each entry in K t k is computed via a specific Freeze-Thaw kernel given by k(t;t 0 ) = (t +t 0 +) : It is derived from an infinite mixture of exponentially decaying basis functions, to capture the decay of losses versus time. Note that conditioned on asymptotesf, each training curve is drawn independently. The distribution of theN training curvesfy n g N n=1 is given by Pr(fy n g N n=1 jfx n g N n=1 ) = Z [ N n=1 N (y n ;f n 1 n ; K tn )]N (f;m; K x )df where we assume each training curvey n is drawn independently from a Gaussian process with kernel K tn conditioned on the prior mean f n , which is itself drawn from a global Gaussian process with kernel K x and meanm. K x can be any common covariance function which models the correlation of performances of different hyper-parameters. We summarize and introduce notations as follows, O = block diag(1 n 1 ; 1 n 2 ;:::; 1 n N ) K t = block diag(K t1 ;:::; K t N ) y = (y 1 ;:::;y N ) T = O T K 1 t O = O T K 1 t (y Om) = O T K 1 t y m C = (K 1 x + ) 1 =m +C 8 AssumeN = X k n k . The dimensionality of variables are: y2R N ; Kx2R KK + ; O2f0; 1g NK ; and Kt2 R NN + . 26 The log likelihood of the Gaussian process is given by logP (yjfx k g K k=1 ) = 1 2 (y Om) T K 1 t (y Om) + 1 2 T (K 1 x + ) 1 1 2 (logjK 1 x + j) + log(jK x j) + log(jK t j) + const: The posterior distribution of the predicted mean of the seen and unseen curves are given by Pr(fjy;fx k g K k=1 ) =N (f;m +C ;C); Pr(f jy;fx k g K k=1 ;x ) =N (f ;m + K T x K 1 x C ; K x K T x (K 1 x + 1 ) 1 K x ) respectively. The posterior distribution for a new observation on a seen training curve is given by Pr(y n jy;fx k g K k=1 ;t ) =N (y n ; K T tn K 1 tn y n + n ; K tn K T tn K 1 tn K tn + C nn T ) where = 1 K T tn K 1 tn K tn . And the posterior distribution for a new observation on an unseen curve is given by Pr(y n jy;fx k g K k=1 ;x ;t ) =N (y n ;m + K T x K 1 x C ; K t + K x K T x (K 1 x + 1 ) 1 K x ): 2.7.2 Experimental Details In this section, we provide more details of the experiment. Data We create synthetic data for budgeted optimization, and hyperparameter tuning tasks on well-established real-world datasets, CIFAR-10 [115] and MNIST [127]. i Synthetic loss functions. We generate 100 synthetic sets of time-varying loss functions, which resemble the learning curves structure over configurations in hyperparameter tuning. Each of the synthetic set consists ofK = 84 configurations with maximum length 288 train- ing epochs. The configurations’ converged performances are drawn from a zero-mean magnitude 1 Gaus- sian process with squared exponential kernel of length-scale 0:8. For each configuration, the loss decay curve is sampled from the Freeze-Thaw kernel with = 1:5; = 5, and magnitude 10. The budget unit is 6 epochs. For real-world data experiments, we create 4 hyperparameter tuning tasks using CIFAR- 10and MNIST datasets. We tune hyperparameters of convolutional neural networks on CIFAR- 10dataset [115]. We split the data into 40k; 10k for training and heldout sets, and report the 27 error rate on the heldout set. We create two hyperparmater tuning tasks on CIFAR-10. All the CNN models are trained on Tesla K20m GPU using Tensorflow. ii ResNet on CIFAR-10. We randomly sample 96 configurations of 5 hyperparameters of ResNet [78] model as follows, optimizers fromfstochastic gradient descent, momentum gra- dient descent, adagradg, batch size inf64; 128g, learning rate inf0:01; 0:05; 0:1; 0:5; 1:0g, momentum inf0:5; 0:9g, and different exponential learning rate decay rates. The budget unit is approximately 16 minute of GPU training time for this task. iii AlexNet and ResNet on CIFAR-10. We tune the architectural selection between ResNet [78] and AlexNet [117] as well as other hyperparameters in this task. We have 49 candidate configurations evenly split between the two architectures, with 24 ResNet and 25 AlexNet. The range of other 5 hyperparameters is the same as the ones described in (ii). The budget unit is approximately 14 minute of GPU training time on this task. For hyperparameter tuning tasks on MNIST, we use openly available dataset fromLC-dataset 9 [110], where training curves of different hyperparameters are provided. We choose the un- supervised learning task of learning image distribution using variational autoencoder (V AE), and classification task by fully connected neural net. Please see [110] for more details on the generation of the learning curves. iv V AE on MNIST. We subsample 49 configurations of 4 different hyperparameters for training V AE on MNIST. [110] The loss observations are lower bound of the heldout log-likelihood. The budget unit is 10 epochs. v FCNet on MNIST. We subsample 50 configurations of 10 different hyperparameters for classification using two layer fully connected network on MNIST. The loss observations are error rates on heldout set. The budget unit is 10 epochs. Implementation Details On synthetic task for both our method and GP-EI, we use the same GP hyperparameters as the ones in data generation. For real-world tasks, we use Freeze-Thaw GP in our algorithm. We assume independence between different configurations for computational efficiency. The hyperparameters of the GP are sampled using slice sampling [141] in a fully- Bayesian treatment. We find it in general not sensitive to the sampling parameter. We didn’t tune it and set it to be step size 0.5, burn in samples 10, and max attempts 10. We implemented Hyperband with = 3 as recommended by the author 10 . For SMAC and Fabolas, we adapt the implementations from https://github.com/automl/. For the Rollout [124] method, we use rolling horizonh = 3 and use Gauss-Hermite quadrature to approximate the imaginary belief updates. We adapt Fabolas [109] in the following way. The original Fabolas adaptively selects the training subset size, as well as the configurations to by an acquisition function, which trades off information gain with computation cost. We interpret the intermediate results from the curve as 9 http://ml.informatik.uni-freiburg.de/people/klein/index.html 10 We have also tried other values, but didn’t find significant difference in performance. 28 the subset training result, and input to Fabolas, to decide which configurations to run next and for how many epochs 11 . 2.7.3 Other Results and Analysis On budget exhaustion heuristics To understand the impact of budget exhaustion heuristics in our algorithm, we compare our methods (BHPT) with Q and Q" only (without the else statement in Alg. 1), in light blue and red correspondingly in Figure 2.7(a). Q-alone andQ "-alone select configurations by the action-value functions (Eqn. 2.8, Eq. 2.11) alone, without the heuristics of committing to improve the top configuration when the budget runs out. The performance deteriorates, especially forQ-alone. It has less impact on isQ"-alone due to that Q" has already had sufficient eploitation ("-greedy, where" = 0:5) in its design. On full Bayesian treatment of GP hyper-parameters In Figure 2.7(b), we demonstrate the effect of doing full Bayesian sampling on the hyper-parameters of the Freeze-Thaw GP versus a fixed GP hyper-parameter on the synthetic data set. 20 40 60 80 100 120 140 160 180 200 budget unit (1 unit = 6 iterations) 10 -3 10 -2 10 -1 20 40 60 80 100 120 140 160 180 200 budget unit (1 unit = 6 iterations) 10 -3 10 -2 10 -1 (a) BHPT without budget exhaustion heuristic (b) full Bayesian treatment of the GP hyper-parameters Figure 2.7: Budgeted Optimization on 100 Synthetic Sets: (a) BHPT without budget exhaustion heuristic. Q - alone does less exploitation than BHPT, thus results in worse output performance. (b) BHPT with full Bayesian treatment of GP hyper-parameters. The results are similar to GP hyper-parameters set to the ground-truth values. 11 In another word, we use epoch, instead of subset size, as the additional input to the black-box optimization problem. The cost of the surrogate task grows linearly w.r.t. the epoch, and can be modeled by the kernel for the computation cost in their work. 29 Chapter 3 SPEECH SENTIMENT ANALYSIS In this chapter, we study the task of speech sentiment analysis, and improve the data efficiency of deep learning systems under limited labeled data, by leveraging pre-trained representations. Concretely, we propose to use pre-trained representations from end-to-end ASR models to solve speech sentiment analysis as a down-stream task. We show that end-to-end ASR features, which integrate both acoustic and text information from speech, achieve promising results. We use RNN with self-attention as the sentiment classifier, which also provides an easy visualization through attention weights to help interpret model predictions. We use well benchmarked IEMO- CAP dataset and a new large-scale speech sentiment dataset SWBD-sentiment for evaluation. Our approach improves the-state-of-the-art accuracy on IEMOCAP from 66.6% to 71.7%, and achieves an accuracy of 70.10% on SWBD-sentiment with more than 49,500 utterances. 3.1 Introduction: Speech Sentiment Analysis Speech sentiment analysis is an important problem for interactive intelligence systems with broad applications in many industries, e.g., customer service, health-care, and education. The task is to classify a speech utterance into one of a fixed set of categories, such as positive, negative or neutral. Despite its importance, speech sentiment analysis remains a challenging problem, due to rich variations in speech, like different speakers and acoustic conditions. In addition, existing sentiment datasets are relatively small-scale, which has limited the research development. The key challenge in speech sentiment analysis is how to learn a good representation that captures the emotional signals and remains invariant under different speakers, acoustic conditions, and other natural speech variations. Traditional approaches employed acoustic features, such as band-energies, filter banks, and MFCC features [132, 131, 223, 224], or raw waveform [211] to predict sentiment. However, models trained on low-level features can easily overfit to noise or sentiment irrelevant signals. One way to remove variations in speech is to transcribe the audio into text, and use text features to predict sentiment [122]. Nonetheless, sentiment signals in the speech, like laughter, can be lost in the transcription. Latest works [104, 177, 33, 71] try to combine acoustic features with text, but it is unclear what is the best way to fuse the two modalities. Other general feature learning techniques, like unsupervised learning [53] and multi- task learning [230] have also been explored. In this work, we introduce a new direction to tackle the challenge. We propose to use end- to-end (e2e) automatic speech recognition (ASR) [177, 181, 32, 79] as pre-training, and solve the speech sentiment as a down-stream task. This approach is partially motivated by the success 30 Encoder (a) pretrain: end-to-end ASR Sentiment Decoder Decoder SPEECH TEXT Yeah. That is so awesome. SENTIMENT (b) sentiment analysis: ASR features + RNN with attention fT … … … f1 f2 Bi-LSTM h1 h2 hT Attention softmax Best Sentiment Decoder SpecAugment Figure 3.1: We propose to use pre-trained features from e2e ASR model to solve sentiment anal- ysis. The best performed sentiment decoder is RNN with self-attention. We apply SpecAugment to reduce overfitting in the training. of pre-training in solving tasks with limited labeled data in both computer vision and language. Moreover, the e2e model combines both acoustic and language models of traditional ASR, thus can seamlessly integrate the acoustic and text features into one representation. We hypothesize that the ASR pre-trained representation works well on sentiment analysis. We compare differ- ent sentiment decoders on ASR features, and apply spectrogram augmentation [171] to reduce overfitting. To further advance the study of the speech sentiment task, we annotate a subset of switchboard telephone conversations [67] with sentiment labels, and create the SWBD-sentiment dataset. It contains 140 hours speech with 49,500 labeled utterances, which is 10 times larger than IEMO- CAP [25], the current largest one. We evaluate the performance of pre-trained ASR features on both IEMOCAP and SWBD-sentiment. On IEMOCAP, we improve the state-of-the-art sentiment analysis accuracy from 66.6% to 71.7%. On SWBD-sentiment, we achieve 70.10% accuracy on the test set, outperforming strong baselines. 3.2 Approach In this section, we introduce our method to use ASR features for sentiment analysis. We first describe the end-to-end ASR model, and then introduce how we use ASR features for sentiment task. Lastly, we describe SpecAugment, a technique we use to reduce overfitting in training. Assume the input speech waveform has been transformed into a sequence of vectors, for example the log-mel features which we denote asx 1:T = (x 1 ;x 2 ;:::;x T ). For sentiment task, the output is a label for the sequence, denoted asy. 31 3.2.1 End-to-end ASR Models Automatic Speech Recognition (ASR) is the problem of finding the most likely word sequence given a speech sequence. Recently end-to-end models have produced state-of-the-art perfor- mances in ASR tasks [177], without a separate pronunciation or language model. In this work, we focus on one type of e2e model, RNN Transducer (RNN-T)[69], due to its simplicity. Letz 1:L be the output word sequence in ASR, wherez l 2Z graphemes. RNN-T predicts the output by an encoder-decoder model. The encoder maps the inputx 1:T to a hidden representation f 1:T , while the decoder maps the hidden representationf 1:T to the textz 1:L . The encoder is analogous to the acoustic model in traditional ASR systems, where informa- tion in the speech is encoded into a higher-level vector representation. The decoder serves two purposes: first, it functions as the language model in traditional ASR. Secondly, it computes a distribution over all possible input-output alignments [69]. In practice, the language model module is called prediction network, and the alignment module is called joint network in RNN- T [177]. The encoder-decoder network can be jointly trained end-to-end. Details on the training and inference algorithm can be found in [70]. The encoder can be viewed as a feature extractor, which encodes useful information in the speech intof 1:T . We hypothesize thatf 1:T contains rich sentiment signal, as it preserves both lin- guistic and acoustic characteristics. Next we will describe how we use the ASR encoder features to classify sentiment. 3.2.2 Sentiment Decoder on ASR Features Figure 3.1 demonstrates the proposed framework. We take the encoder from the end-to-end ASR system, and freeze the encoder weights as a feature extractor. Note ASR featuref 1:T is a sequence with variable length. Our sentiment classifier first transformsf 1:T into a fixed length embedding, and the embedding is then fed into a softmax classifier to predict the sentiment label y. We call the process of mapping ASR features into fix length embedding as sentiment decoder. The sentiment decoder is trained with cross-entropy loss. In what follows, we compare three sentiment decoders. MLP with pooling. A straightforward strategy is to build a multi-layer perceptron (MLP) on f 1:T , and apply average (max) pooling over time to convert to a fixed-length representation. RNN with pooling. However, MLP treats each time step in the sequence independently, which is not ideal, as sentiment depends on the context. RNN is a prominent tool in modeling sequential data and is particularly good at storing and accessing context information over a sequence. We feedf 1:T into a bi-directional LSTM (bi-LSTM), and concatenate the outputs of both forward and backward LSTMs intoh 1:T . We can either take the last hidden state h T , or the average (max) pooling overh 1:T , as the sequence embedding. RNN with multi-head self-attention. Self-attention [134] is good at handling long-term depen- dencies, which is critical for the sentiment task. To this end, we propose to use a multi-head self-attention layer to replace the pooling or last hidden state, on top of RNNs. The multi-head attention layer can jointly attend to different subspaces of the input representation. Furthermore, the attention alleviates the burden of learning long-term dependencies in LSTM via direct access to hidden states of the whole sequence [212]. As a convenient side product, the soft attention weight provides an easy-to-visualize interpretation for the prediction. 32 Concretely, the multi-head self-attention computes multiple weighted sum of the LSTM hid- den states, where the weights are given by soft attention vectors. Assume the attention layer has n a heads with d a dimensions for each head. n a and d a are hyper-parameters of our choice. The input to the attention head are the bi-LSTM hidden statesh 1:T 2R Td k , which has lengthT and dimensiond k . The attention vectora i and the outputv i of thei-th head is computed as a i = softmax w i Q (w i K h T 1:T ) p d a ! ; v i =w i V h T 1:T a i ; where the query tokenw i Q 2 R 1da , the key projection matrixw i K 2 R dad k and the value projection matrixw i V 2 R dad k , are learnable parameters. a i 2 R T is the scaled dot-product attention [212] probability. The weighted sum of hidden statesv i 2R da is output. By concate- nating the outputs from n a heads, we obtain a fixed size embedding of length n a d a , which is then fed into softmax to classify sentiment. 3.2.3 Spectrogram Augmentation To reduce the overfitting in training sentiment classifiers, we employ Spectrogram Augmentation (SpecAugment) [171], a data augmentation technique for speech. At training time, SpecAugment applies random transformations to the input, namely warping and masking on frequency channels or time steps, while keeps the label unchanged. Sentiment decoders trained with SpecAugment are invariant to small variations in the acoustic features, which improves the generalization. 3.3 Experiment In this section, we first describe the experimental setup in Section 3.3.1, and then show that our method using pre-trained ASR features can help sentiment analysis in Section 3.3.2. We also examine the contribution of different components in Section 3.3.3. Lastly in Section 3.3.4, we visualize the attention weights to interpret the predictions, which can shed light on why ASR features can benefit the sentiment task. 3.3.1 Experiment Setup Data. We use two datasets IEMOCAP and SWBD-sentiment in the experiments. IEMO- CAP [25] is a well-benchmarked speech emotion recognition dataset. It contains approximately 12 hours audiovisual recording of both scripted and improvised interactions performed by ac- tors 1 . Following the protocol in [132, 104, 33, 71], we experiment on a subset of the data, which contains 4 emotion classesfhappy+excited, neutral, sad, angryg, withf1708, 1084, 1636, 1103g utterances respectively. We report 10-fold (leave-one-speaker-out) cross validation results. To further investigate speech sentiment task, we annotate a subset of switchboard telephone conversations [67] with three sentiment labels, i.e. negative, neutral and positive, and create the 1 We use the speech data only in the experiments. We do not use the human annotated transcript. 33 Figure 3.2: Class distribution of speech sentiment datasets. SWBD-sentiment dataset. SWBD-sentiment has over 140 hours of speech which contains ap- proximately 49.5k utterances. We split 10% of SWBD-sentiment into a holdout set and a test set with 5% each. We report the accuracy on the test set. Table 3.1 provides a summary and comparison of two datasets. Emotional expressions in IEMOCAP are elicited through acting in hypothetical scenarios, while sentiment in SWBD- sentiment is from natural conversations between friends. As a result, the class distribution in SWBD-sentiment is lightly imbalanced, with neutral, positive, and negative take up 52.6%, 30.4%, and 17.0% respectively. See Figure 3.2 for the class distribution pie plot. Besides, ut- terances in SWBD-sentiment are generally longer. SWBD-sentiment is more challenging than IEMOCAP, but closer to real applications. Table 3.1: Speech sentiment datasets dataset # utterances / hours # speakers # classes elicitation IEMOCAP 10k / 12h 4 10 play-acted SWBD-sentiment 49.5k / 140h 3 543 conversation Metrics. We report both weighted accuracyWA and unweighted accuracyUA, commonly used for speech sentiment analysis [104, 71]. WA is the conventional classification accuracy, i.e. the percentage of samples correctly labeled by the model. To sidestep the effect of class imbalance onWA, we also reportUA, the average accuracy of different classes. Baselines. We compare the proposed approach to both single-modality and multi-modality models. A single-modality model uses audio inputs only, while a multi-modality model uses both audio and ASR transcription as inputs. On IEMOCAP, we use the state-of-the-art results[132, 104] from both approaches reported in the literature as baselines. On SWBD-sentiment, we implemented our own baselines. We thoroughly tuned the architectures from MLPs, CNNs, LSTMs, and a combination of them. We tuned the number of layers fromf2; 3; 4; 5g, and tuned the filter sizes and strides for CNNs, and number of hidden units for LSTMs and MLPs. We report the best results from tuning. Model architecture. All experiments use 80-dimensional features, computed with a 25ms window and shifted every 10ms. We use a pre-trained RNN-T model trained from YouTube videos described in [202]. The encoder stacks a macro layer 3 times, where the macro layer 34 consists of 512 1-D convolutions with filter width 5 and stride 1, a 1-D max pooling layer with width 2 and stride 2, and 3 bidirectional LSTM layers with 512 hidden units on each direction and a 1536-dimensional projection per layer. The prediction network has a unidirectional LSTM with 1024 hidden units. The joint network has 512 hidden units and the final output use graphemes. For the sentiment classifier, we use 1 bidirectional LSTM layer with 64 hidden units on each direction, and 1 multi-head self-attention, which has 8 heads with 32 units per head. For other classifiers in Section 3.3.3, we use respective layers matching the number of hidden nodes. The SpecAugment parameters are the same as the LibriSpeech basic (LB) policy of Table 1 in [171]. All models are trained with Adam optimizer in the open-source Lingvo toolkit [196] with learning rate 10 4 , and gradient clipping norm 4:0. Table 3.2: Speech sentiment analysis performances of different methods. IEMOCAP dataset SWBD-sentiment dataset Input features Architecture WA (%) UA (%) Architecture WA (%) UA (%) acoustic DRN + Transformer [132] - 67.4 CNN 54.23 39.63 acoustic + text DNN [104] 66.6 68.7 CNN and LSTM 65.65 54.59 e2e ASR RNN w/ attention 71.7 72.6 RNN w/ attention 70.10 62.39 - human 91.0 91.2 human 85.76 84.61 3.3.2 Sentiment Classification Table 3.2 summarizes the main results of our approach and state-of-the-art methods on IEMO- CAP and SWBD-sentiment. We provide a reference of the human performance, which is essen- tially the average agreement percentage among annotators. Since sentiment evaluation is subjec- tive, human performance is an upper bound on the accuracies. Comparing human performance on SWBD-sentiment with IEMOCAP, we confirm that SWBD-sentiment is more challenging. On IEMOCAP, ASR features with RNN and self-attention decoder achieves 71.7% accuracy (WA), which improves the state-of-the-art by 5.1%. On SWBD-sentiment, a naive baseline to predict the neutral class for all samples can achieve 52.6% WA and 33.3% UA. Training deep models directly on acoustic features suffers from severe ovefitting, and only improves over this baseline by 1.6% onWA. Fusing acoustic features with text can significantly improve the perfor- mance to 65.65%. Sentiment decoder with e2e ASR features achieves the best 70.10% accuracy. Experiments on both datasets demonstrate that our method can improve speech sentiment predic- tion. 3.3.3 Ablation Study In this section, we study the contribution of different components in the proposed method. We compare the performances of different sentiment decoders, and analyze the effect of SpecAug- ment. In Table 3.3, the first row is our best performing model, which trains a bi-LSTM with atten- tion decoder on ASR features using SpecAugment. We refer to it as the base model. The second and third rows are based on different sentiment decoders, MLP with pooling and RNN with pool- ing respectively. RNN with attention is better than RNN with pooling while RNN with pooling is 35 better than MLP with pooling. Somewhat surprisingly, a simple MLP with ASR features achieves 68.68% accuracy which already improves over the state-of-the-art 66.6% in the literature. This demonstrates the effectiveness of pre-trained ASR models. The last row reports the accuracy of the base model when trained without SpecAugment, which is roughly 1% worse. In all our exper- iments, we find SpecAugment consistently helpful on different sentiment decoders and datasets. Table 3.3: Ablation study on IEMOCAP dataset description WA (%) UA (%) RNN w/ attention + SpecAugment 71.72 72.56 decoders MLP w/ pooling 68.68 68.98 RNN w/ pooling 70.71 71.55 w/o SpecAugment 70.77 71.77 3.3.4 Attention Visualization We interpret the predictions by examining the attentiona i (see Section 3.2.2) on utterances. a i is of the same length as ASR features, and its element indicates the contribution of each frame to the prediction. However, it is hard to illustrate attention over audio on paper. Instead, we visualize attention over ASR transcripts, by running alignment and taking average attention of frames aligned to one word as its weight. We add tokens, like [LAUGHTER] or [BREATHING], to the ASR text to annotate non-verbal vocalizations for visualization purpose. We quantize the attention weights into three bins, and draw a heat map in Figure 3.3. The visualization result demonstrates how our model integrates both acoustic and language features to solve sentiment analysis. 36 1. Yeah, so [LAUGHTER] he’s calling now. 2. Yay, well congratulations, that’s so cool. [BREATHING] I can’t wait. 3. Exactly, [LAUGHTER] I think that'll go over great, don't you? 4. That would be wonderful, that would be great seriously. 1. I'm so sorry. 2. Listen, indeed. I am sick and tired of listening to you, you damn sadistic bully. 3. God damn it, Augie. Seriously, you always ask me that. Why do you ask me that? I hate it. It's so insulting. HAPPY SAD/ANGR Y Figure 3.3: Attention visualization on IEMOCAP utterances from different sentiment classes. There are two notable patterns that have larger weights: specific vocalizations, like [LAUGH- TER] and [BREATHING], and indicating words, like “great” and “damn”. This supports our hypothesis that ASR features contain both acoustic and text information. 3.4 Error Analysis In this section, we provide error analysis of different methods on SWBD-sentiment dataset. Sentiment is often subjected to the listener’s own interpretation. Sometimes there is no clear cut sentiment label for a given utterance. We first look into utterances in SWBD-sentiment dataset where three annotators do not agree on the same sentiment label, to understand the difficulty of the dataset. We categorize disagreements into the following two types: • 2 way agreement – When the majority of annotators, for example 2 out of 3 annotators, agree on a sentiment label. • 3 way disagreement – When every annotator disagrees on the sentiment label. Table 3.4 illustrates some examples of inter-annotator disagreements. To better present emo- tional cues embedded in the audio component of these samples, we augment each transcript with our own interpretation of disagreements. The distribution of sentiment labels, after annotators disagreement is reconciled, is that, 30.4% of the speech segments are labelled as positive, 17% of the segments are labelled as nega- tive, and 52.6% of the segments are labelled as neutral. For comparison, the IEMOCAP [25] cor- pus has 30.9% positive (happy), 49.5% negative (angry or sad), and 19.6% neutral. The intuition behind the larger representation of neutral sentiment labels in SWBD-sentiment is as follows: because Switchboard participants are paid to converse on a given topic not of their choosing, it is less likely for the participants to have strong emotional attachments to their topic, and participants are less inclined to have extreme emotional responses such as heated arguments and intense joy. This lack of extreme emotional variance makes labels in the SWBD-sentiment corpus harder to compare to other corpora. 37 Context Transcript The sentiment of “old” here (referring to It was really old. a song) is open for interpretation. Switching sentiment I do think the jury system works, but I also feel . . . Confusing tone I should say on the west side, I mean everything is on the west side . . . (a) 3-way disagreement Context Transcript Slight doubt in response to a Speaker 1: I think they will need me more when positive statement they are older (laughter). Speaker 2: Well (questioning tone). Influenced by the religious Videos, like music videos that go along with orientation of annotators songs about churches and Jesus . . . (b) 2-way agreed positive Context Transcript Switching sentiment I really don’t like this stuff, but my husband does, he loves to cook it . . . Ambiguous tone Gee (grumbling tone), we have so much going on here . . . (c) 2-way agreed negative Context Transcript “Advantage” is a positive word. He would have family close by, there are advantages . . . Slightly positive tone I imagine where you live, you wear warm clothing quite a bit of the year. (d) 2-way agreed neutral Table 3.4: Examples of disagreements between annotators. “Context” provides our own interpre- tation of disagreements. To further understand the inter-annotators disagreement, we illustrate the distribution of agree- ments by confusion matrix in Figure 3.4. We only consider samples with a clear majority label (i.e. , samples where all three annotators disagree are discarded). The vertical dimension repre- sents the sentiment of the most voted label, which can be interpreted as the ground truth label. The horizontal dimension represents sentiment labels from individual annotators. The top-left to bottom-right diagonal of Figure 3.4 (b) can be interpreted as the accuracy of human annotators. That is, the likelihood that any single human annotator can produce the same sentiment label voted for by the majority of annotators. The average human accuracy of the SWBD-sentiment corpus is around 85%. For contrast, the IEMOCAP [25] corpus has a 91% human accuracy. This 38 suggests that SWBD-sentiment is a harder sentiment prediction dataset even for humans. Be- sides, a notable observation from the confusion matrix is that negative class is generally harder than the neutral and positive classes, even for human. (a) Total (b) Normalized Figure 3.4: Confusion matrix for inter-annotator agreement. In what follows, we analyze the error patterns of different neural network sentiment analysis methods on the SWBD-sentiment corpus using confusion matrices, as shown in Figure 3.5. The details of the three models and their input features types are: 1. Using only acoustic features – CNN with 3 layers of convolution filters with max pooling. 2. Using only transcripts – RNN with word embedding layer followed by 2 LSTM layers and 2 fully connected layers. 3. Using multimodal features (acoustic features and transcripts) – 1 LSTM layer followed by a multi-head self-attention layer [212]. (a) Acoustic features (b) Transcript features (c) Multimodal features Figure 3.5: Confusion matrix for evaluation using different input feature types. 39 Several observations can be made. First, the model trained using only acoustic features failed to learn from negative examples. This seems to suggest that our baseline model could not ef- fectively utilize acoustic signals to predict negative sentiment. Furthermore, comparing Fig- ure 3.5 (b) with Figure 3.5 (c), we can see that adding additional information on top of transcript features does not improve the accuracy of negative sentiment prediction (top left cell). Human an- notators seem to have an easier time making such prediction (81.6% in Figure 3.4 (b) comparing to 43% in Figure 3.5 (b) and (c)). One possible explanation behind this is that switchboard par- ticipants are paid to conduct free-form conversations and have less incentives to get into heated arguments or display strong negative emotions; and humans excel at detecting subtle hints of negativity. Effectively detecting negative sentiment in SWBD-sentiment is an interesting prob- lem that warrants future research. 3.5 Conclusion In this work, we demonstrate pre-trained features from the end-to-end ASR model are effective on sentiment analysis. It improves the-state-of-the-art accuracy on IEMOCAP dataset from 66.6% to 71.7%. Moreover, we create a large-scale speech sentiment dataset SWBD-sentiment to facilitate future research in this field. Our future work includes experimenting with unsupervised learnt speech features, as well as applying end-to-end ASR features to other down-stream tasks like diarization, speaker identification, and etc. 40 Part II Leverage Training Information for Robust Deep Learning 41 Chapter 4 BACKGROUND: UNCERTAINTY IN MACHINE LEARNING Uncertainty quantification is the science of identifying, quantifying, and reducing uncertainties associated with models, numerical algorithms, experiments and predicted outcomes in compu- tational and real world systems [198]. Traditionally it was studied in various engineering and science disciplines, such as structural mechanics, material science, meteorology, geophysics, and aeronautics. Recently, uncertainty estimation has gained attention in machine learning commu- nity due to the increasing impact of machine learning in real-world applications and the rising concerns on the robustness and reliability of learning systems. In this chapter, we provide a survey on uncertainty estimation in machine learning, with emphasis on deep neural networks. We start by defining what is uncertainty from both schools of statistical inference: Bayesian and frequentist in Section 4.1. Then we discuss approaches in both views to model uncertainty in Section 4.2, their applications in deep learning and their challenges in doing so. In Section 4.3, we describe benchmark tasks in supervised learning to evaluate the quality of the uncertainty estimate. In Section 4.4, we summarize recent advances in deep learning community in addressing these challenges. Lastly in Section 4.5, we highlight the importance of uncertainty estimates in tasks beyond supervised learning. Figure 4.1 gives an overview of the structure. 4.1 What is Uncertainty? Recall the setup of supervised learning introduced in Chapter 1: we are given a training setD = f(x i ;y i )g N i=1 , drawn i.i.d. from the distributionP XY . We train a predictive modelf :X!Y, where is the parameters. From a probabilistic point of view, we can interpretf as a likelihood function, which defines a probability distribution over the output space. It characterizes how likely it is to observe certain outcomey givenx, under the model specified by. For example, in a classification task 1 , thek-th component,k2 [K], off (x) k represents the probability that the model believes samplex is from classk, f (x) k =p(y =kjx;): To adopt the conventional notations in statistical inference, f (x) plays the role of P (DjH), whereH is the hypothesis, value of in our case, andD is the observed data, training set in our case. Table 4.1 provides a summary of notations and their meanings used in this chapter. 1 We assumef 2 K1 , lies in theK-dimensional simplex whereK is the number of classes. 42 TASKS MODELS METHOD LANGUAGE SCHOOL OF STATISTICAL INFERENCE PROBLEM uncertainty Bayesian posterior sampling approx. inference frequentist p-value resampling GP , Bayesian NN ensemble bootstrap jackknife calibration OOD, adversarial examples active learning, RL Figure 4.1: Overview of Chapter 4. Machine learning community studies uncertainty quantifica- tion from two major statistical methods, namely frequentist (classical) and Bayesian inference. They use different languages to measure uncertainty, and develop different algorithms and mod- els. The quality of uncertainty estimation can be evaluated on tasks from supervised learning, active learning and RL. 4.1.1 Frequentist Inference Frequentist interprets probability as long-run limiting frequencies, and considers each experi- ment, i.e. training a predictor on a training set, as a hypothetically repeatable process. And the training set we observe is understood as samples representing the population of potential obser- vations. Thus the frequentist models the uncertainty arising from the randomness in the data generating process,P (DjH), and only that. Note that a frequentist refuses to describe the lack of knowledge as uncertainty: a hypothesis is either true or false, regardless of whether one knows which is the case. The frequentist measures the uncertainty as the probability or likelihood of the observation if the hypothesis is assumed to be correct. Concretely in our case, given a new sample at test time, frequentist gives prediction and provides the uncertainty estimate of this prediction usingp-value or confidence interval. Besides, instead of choosing a specific significance level, frequentist method usually explicitly provides the trade-off between type I (rejection of a true null hypothesis) and type II error (failing to reject a false null hypothesis). 43 supervised learning statistical inference Bayesian or frequentist model parameter H hypothesis – data D training set D observed data – likelihood f (x) =p(yjx;) P (DjH) both prior p() P (H) Bayesian posterior p(jD) P (HjD) Bayesian Table 4.1: Notations in Chapter 4. 4.1.2 Bayesian Inference In sharp contrast, the Bayesian associates probability with hypothesis, which describes how likely a hypothesis is true or false to him or her. Therefore, the uncertainty arises from the observer’s lack of knowledge about the world, as well as the randomness in data. The uncertainty lies in the observer’s mind and we use probability, prior and posterior, to measure a degree of belief. In our setting, the Bayesian assigns a subjective prior to the model parametersp(). After the data is observed, we conduct training and update our belief to the posterior distribution, following the Bayes’ rule, p(jD) = p()p(Dj) p(D) : The posterior distribution p(jD) measures the uncertainty in the model parameters. Multiple hypotheses, i.e. different values, can explain the limited observed dataD, and we assign prob- ability accordingly to all hypotheses to represent our confusion, based on how well they are compatible with the observedp(Dj) and our prior beliefp(). Two types of uncertainty At test time, we are given a new sample x 0 , the Bayesian makes predictions by integrating over all possible models, p(y 0 jx 0 ) = Z p(y 0 jx 0 ;) | {z } aleatoric uncertainty p(jD) | {z } epistemic uncertainty d: (4.1) In literature,p(y 0 jx 0 ) is referred to as predictive distribution. The uncertainty in the prediction, or predictive uncertainty arise from two sources. The first one is due to our lack of knowledge—the uncertainty in the model parameters after the training is done, p(jD), as described in the last paragraph. It is often referred to as epistemic uncertainty, or model uncertainty. It can be reduced in the limit of infinite training data. The second source is from the test point itself, inherent randomness in the datap(y 0 jx 0 ;) 2 , such as class overlap, label noise, homo-scedastic and hetero- scedastic observation noise. It is referred to as aleatoric or data uncertainty [48, 59, 146]. In 2 In all analysis in this chapter, we assume that the prior class, the posterior class, and the likelihood function is correctly specified w.r.t. the real-world data distribution. 44 problems like decision making, this is also called risk [166], as it measures the stochasticity in the data which we have no control over. posteriorp(jD), epistemic uncertainty , reducible uncertainty, uncertainty: likelihoodp(y 0 jx 0 ;), aleatoric uncertainty , irreducible uncertainty, risk: We conclude this section with an example of tossing a coin to demonstrate the procedure of uncertainty estimation as a frequentist and as a Bayesian. Example 1. Beer has a coin, and he suspects that it is biased towards the heads. He runs an ex- periment to figure it out. He tossed the coin for 6 times, and observed a sequence ofHHHHHT . Now Beer wants to estimate the uncertainty that the coin is indeed biased towards the head, given the data observed. Let be the probability of heads. And we explain the uncertainty in both frequentist and Bayesian practitioners’ eyes. Frequentist We have the null hypothesis, and a one-sided alternative, H 0 : = 0:5; H a :> 0:5: Then we compute the one sidedp-value as the probability as extreme as the sequenceHHHHHT , 5 or 6 heads in 6 tosses, and we get 0:109, greater than the common level of significance 0:05. As a result, a frequentist quantifies his uncertainty as the following statement: given the evidence of a sequence ofHHHHHT , there is not enough evidence to rejectH 0 —we cannot conclude that the coin is not fair. Or if the coin is fair, 10:9% of the time one would get similar or even worse observations thanHHHHHT , and wrongly reject it due to the randomness in coin flipping. Bayesian approach We choose an uninformative prior over as p() = Beta(1; 1). After observing the data, the posterior isp(jHHHHHT ) = Beta(6; 2). To estimate the uncertainty, a Bayesian would say: my posterior belief over a biased (towards the heads) coin is p( > 0:5jD) = 0:9375. In this example, the epistemic uncertainty is p(jD) = Beta(6; 2), and the aleatoric uncertainty isp(Hj) = . The epistemic uncertainty will collapse to a-distribution in the limit of infinite coin observations (reducible uncertainty), while the aleatoric uncertainty is Bernoulli() is irreducible, since the outcome of any single flip will always be random. 4.2 How Do We Model Uncertainty? In this section, we survey algorithms and models developed under both Bayesian and frequentist views, to solve uncertainty related problems in supervised learning. 4.2.1 Bayesian neural network Bayesian inference provides a principled way to model uncertainty, and represents our strength of belief in terms of posterior probability, on both the trained parameters, and the prediction outputs at test time. Moreover the uncertainty measure is handy to use when combined with an action 45 Figure 4.2: Illustration of a Bayesian neural network (right), compared to a conventional deter- ministic neural network (left). component in decision making applications. Therefore it gains much popularity in the machine learning community. Bayesian neural networks and Gaussian process are two most popular models for Bayesian inference. And the connections between Gaussian process and infinite neural networks has long been noted [159, 219]. Gaussian process provides a powerful non-parametric framework for reasoning over functions. Despite appealing theory, its super-linear computational and space complexities make it less suitable for large-scale datasets compared to neural networks. Thus we focus on Bayesian neural network in this survey, but many methodologies and techniques discussed below can be applied to Gaussian process as well. 4.2.1.1 What is a Bayesian Neural Network? Bayesian neural network (BNN) assigns probabilities to the hypothesis class of a neural network, where all weights are represented by probability distributions over possible values, instead of a deterministic fixed value. See Figure 4.2 (adapted from [17]) for demonstration. As discussed in Section 4.1, the uncertainty in a BNN is measured by the posterior distribution over the weight parameters p(jD). The paradigm shift from deterministic to random weights leads to computational challenges at both training and inference time, especially when the neural network has hundreds of thousands of parameters. At training time, the challenge is how we can learn the posterior distribution efficiently on a large dataset, which will be the main focus of Section 4.2.1.2. At test time, the predictive uncertainty is computed by sampling from the posterior, and then the predictions from the sampled networks are aggregated as in Eq. 4.2. This step can be computationally expensive when the model is large. p(y 0 jx 0 ) = Z p(y 0 jx 0 ;)p(jD)d 1 M M X m=1 p(y 0 jx 0 ; m ) (4.2) In Chapter 5, we propose a closed-form approximation to Eq. 4.2 when the likelihood is softmax function and the posterior is a Gaussian distribution, which is commonly used in deep learning classification models. The analytical approximation is shown to work well on uncertainty tasks. 46 4.2.1.2 Scale up Bayesian Neural Networks In this section, we describe popular approaches to learn posterior distributions in BNN. As the posterior distribution is generally intractable once we go beyond simple linear networks and con- jugate priors, we resort to approximation schemes. The approximations fall broadly into two categories. The first type is stochastic techniques such as Markov chain Monte Carlo (MCMC) sampling, which generates samples from the posterior without explicitly specifying the density function. And the second type is a broad range of deterministic approximation schemes, which use an analytical approximationsq() to estimate the posterior distributionp(jD) by calculus of variations. Recent advances have focused on improving the scalability of these methods to both large-scale datasets and to deep models with thousands of parameters. We summarize these methods and recent results as follows. Posterior Sampling. For the sampling-based approach, the pioneer work by Neal [158] first proposed “hybrid Monte Carlo”, now mostly referred to as Hamiltonian Monte Carlo (HMC), and tested on a small Bayesian neural network of 16 hidden units. The idea is to adopt physical system dynamics, which computes a gradient step of the log likelihood of the data, to propose future states in the Markov chain, to reduce random-walk behavior in MCMC and lead to faster convergence. To scale up to large datasets, stochastic gradient is introduced to HMC by Welling and Teh [215], followed by many others [2, 28]. See [139] for more discussions. The drawback of sampling-based approach includes memory inefficiency—we need to store multiple copies of the network weights, which is expensive for modern architectures, slow mixing time, and the guarantee of the convergence to the true posterior in practice [155]. Variational Inference The main idea of approximate inference is to learn an analytical ap- proximations to the posterior distribution via calculus of variations. One of the most widely used approximation techniques is called variational inference (VI) or variational Bayes (VB). In VI, the inference is cast as an optimization problem, where we optimize the parameters of a param- eterized approximate posteriorq (), such that theKL-divergence between the approximationq and the true posteriorp(jD) is minimized: min KL[q ()kp(jD)]; , max E q () [p(Dj)] | {z } expected log-likelihood KL[q ()kp()] | {z } reguralization : (4.3) In practice, equivalently we maximize the variational lower bound of the marginal likelihood of the data logp(D), ELBO (Eq. 4.3). Note that ELBO can be decomposed into two terms, where the first one is a data-dependent term we call expected log-likelihood, and a regularization term KL[q ()kp()], which measures the divergence between the posteriorq () and the priorp(). Oftentimes, we parameterizeq () as an exponential family distribution, for example a Gaussian distribution, so that the regularization term can be computed in closed form deterministically. The expected log-likelihood term is computed by sampling from the posteriorq (), and mini- batch sampling from the dataset. This approach is referred to as stochastic gradient variational Bayes [106]. 47 One challenge of this procedure is that the minibatch-based Monte Carlo estimator of the expected log-likelihood, and the estimator of its gradient exhibit high variance. Various variance reduction techniques are proposed in the context of Bayesian NN. [17, 107] proposed to sample the hidden unit activations instead of the weight parameters, to reduce the variance of the gra- dient, which people call “local re-parameterization trick”. Another idea is to use deterministic approximation in place of Monte Carlo samples, such as moment approximation on the hidden activations [222], and the decomposition on the output of ReLU [101]. Besides high variance, another criticism of VI is that the approximate posterior family, such as Gaussian distribution, is over simplified and cannot characterize the true posterior which is of- tentimes multimodal. Matrix Gaussian posterior [137], and flow-based multiplicative noise [138] are proposed to improve the flexibility of the posterior distribution. Besides increasing the expres- siveness of an explicit variational posterior, people have also tried to use implicit distributions as BNN posteriors, inspired by the success of implicit distribution in deep generative models, such as generative adversarial network (GAN). The implicit posteriors do not admit a tractable density function but can generate samples easily [172, 197]. The difficulty of implicit posteriors is that the regularization term KL[q ()kp()] is no longer tractable, and approximation estimators are applied. Other advances in VI for BNN includes automatic selection of an informative prior for a task [74, 222]; alternative VI training algorithm [103, 228] inspired by the connection between VI and natural gradient optimization [4]; and stein operator based inference algorithm [135]. Besides VI, other approximate inference methods can also be applied to Bayesian neural networks, including Laplacian approximation [185], and expectation propagation [76, 88, 87]. Wild Approximations Lastly, we describe a wild approximations method, which demonstrate strong empirical performance in real-world uncertainty tasks, despite theoretical deficiency. Gal [59] and Kingma et al. [107] proposed to interpret dropout [204] as VI, where the Bernoulli dis- tribution in dropout is interpreted as the variational posterior. Thus we can obtain the uncertainty information for free from existing deep learning models trained with dropout. Despite the hand- some empirical performances [154], several criticisms arise regarding the logical flaws of the VI interpretation of dropout: firstly Osband [166] points out that the Bayesian posterior of dropout does not collapse to a point mass in the limit of infinite data; and secondly Hron et al. [94] shows that the improper priors used in variational dropout leads to unbounded posteriors, and that the regularization termKL[q ()kp()] in Eq. 4.3 is ill-defined under dropout posterior. How accurate is the Bayesian posterior? The rapid progress in scaling up Bayesian neural networks has made it possible to benchmark the uncertainty quantification performances of BNN on large-scale datasets. However, recent works [217, 221] cast doubt on how accurate Bayesian posterior is in representing uncertainty. [217] finds out that the true Bayes posterior can be even worse than a simple model of point estimate, unless we scale the posterior with a temperature T < 1, namely “cold posterior”. Lastly we point out that whether the posterior is learnt by MCMC or VI, one always need to sample from the posterior to compute the predictive uncertainty at test time, as in Eq. 4.2. The sampling procedure, which requires multiple feedforward passes of deep networks, can be cumbersome in real applications. 48 4.2.2 Frequentist Approaches In contrast to Bayesian approach, frequentist [123] proposed to use deep ensembles, together with adversarial training [68], which demonstrates state-of-the-art performances on several uncer- tainty estimate tasks [201]. It is motivated by the classical statistical technique of bootstrapping, and has made simple yet effective modifications to the vanilla recipe. Another popular resam- pling method, jackknife, has also found its applications for deep models. Barber et al. [8] shows that jackknife estimators can achieve worst-case coverage guarantees on regression problems. Other frequentist works focus on post-processing procedure to obtain a confidence measure on the prediction. Hechtlinger et al. [80] constructs conformal prediction sets [214] based on the predictions of a deep neural network. It outputs multiple class labels when the predictor is unsure about more than one categories. It also allows to output the null set—which indicates “I don’t know” when the test sample does not resemble any training examples. Classification with a rejection option, or learning with abstention has been studied in the past for different classifiers, for example SVM [9], and other kernel-based classifiers [39]. Recently Geifman and El-Yaniv [62], Geifman et al. [63] develops a selective classification algorithm for deep neural network classifiers, based on the softmax output. 4.2.3 Discussion and Comparison Bayesian approach provides a unified treatment of uncertainties at both training and testing time. Besides, Bayesian training has several advantages over the deterministic risk minimization pre- dictor: it performs automatic complexity control through Bayesian Occam’s razor [140], and it has an explicit notion of the inductive bias in the model through the prior distribution. However, the drawback of the Bayesian approach is that the model can be completely wrong if the assumed priors or the likelihood functions is mis-specified—which nullifies our uncertainty estimate ob- tained via the posterior distribution. On the other hand, the statistical guarantee provided by the frequentist is always valid as long as the i.i.d. assumption of the data holds. Frequentist approach is also favored over Bayesian one from computational complexity considerations. 4.3 How Do We Evaluate Uncertainty? We have described various approaches to estimate uncertainty for deep neural networks. How- ever, how we can evaluate and compare them in practice is not a trivial task as we do not have ground-truth distribution of the data. In what follows, we describe three benchmark tasks to evaluate uncertainty estimate. 4.3.1 Calibration Probability calibration refers to that the classifier’s probability output of the predicted label should match its ground truth likelihood of correctness. In another word, among all times when the algorithm predicts the event with probability 0.8, 80% of them should be correct. Calibration has been studied in supervised learning: Platt scaling [175] and isotonic regression [161], as well as in online learning [27, 42]. 49 Recently, in the context of deep learning, Guo et al. [73] report that popular deep neural networks are poorly calibrated: the prediction probability of the model tends to be overconfi- dent, and does not reflect the true correctness likelihood. It proposes to solve mis-calibration by temperature scaling, a single parameter variant of Platt scaling, as post-processing. Follow ups include [121], where miscalibration score is penalized in the training as a regularizer, and [119], where the recalibration is extended to regression. Many recent works in uncertainty estimation evaluate calibration score in their experiments [123, 61, 118, 150]. However, the drawback of calibration is that it is numerically sensitive to the bin- ning procedure and it does not monotonically increase as predictions approach the ground truth. 4.3.2 Out-of-distribution (OOD) Samples Detection Figure 4.3: Demonstration of undesired predictions of popular neural network architectures on out-of-distribution samples. The networks are trained on ImageNet [187]. The red predictions are wrong, the green predictions are reasonable, while the orange predictions are in the middle. See [194] for more details. Another important application scenario for uncertainty estimate in supervised learning is to detect and reject invalid inputs: a learning system can receive a test sample that does not belong to the population distribution of the training data. This violates the i.i.d. assumption of supervised learning, however, we need to detect and reject such samples to prevent unpredictable behavior in real-world systems. Figure 4.3 (adapted from [194]) demonstrated the failure cases of out-of- distribution examples in object recognition for popular neural networks. Hendrycks and Gimpel [82] first proposed the task of out-of-distribution (OOD) sample detection in the context of deep neural networks, and they also provided a simple baseline to reject OOD samples based on the softmax output. Liang et al. [133] improved over the baseline by temperature scaling and adversarial perturbation on the input; Lee et al. [128] further improves the robustness by feeding synthetic OOD samples to the network during training, where the OOD samples are generated by a separately trained GAN network. Recent work [129] trains a Gaussian discriminant analysis component on the output of penultimate layer of DNNs, and use the density from the Gaussian mixture to reject OOD samples. Sensoy et al. [193], Malinin and Gales [146] and Milios et al. [150] proposed to apply a Dirichlet distribution over the softmax probability of neural network outputs, to evaluate the con- fidence of the softmax predictor and solve the OOD task. Moreover, [146] introduces the idea of distributional uncertainty, which arises in case of OOD samples—a mismatch between the test 50 distribution and the training distribution. They pointed out that previous works often conflate dis- tributional uncertainty with data uncertainty, or implicitly capture it under the model uncertainty (see Section 4.1). Figure 4.4 provides a motivating example of distinguishing distributional un- certainty from data uncertainty by Dirichlet distribution over softmax in classification problems. Figure 4.4: Demonstration of the difference between data uncertainty and distributional uncer- tainty in classification: the predictor outputsf (x 0 )2 K1 . There are three possible scenarios for the distribution of the prediction Pr[f (x 0 )]: In (a), when f (x 0 ) is confident in its predic- tion, Pr[f (x 0 )] centers at one of the corners of the simplex; In (b), sample x 0 lies in a region with high degrees of noise or class overlapping, i.e. with high data uncertainty, Pr[f (x 0 )] is a sharp distribution focused at the center of the simplex. It confidently predicts a flat categorical distribution over labels; In (c), for out-of-distribution samplex 0 , which they refer to as distribu- tional uncertainty, Pr[f (x 0 )] is a flat distribution over the simplex, indicating large uncertainty inf (x 0 ). Shafaei et al. [194] provides a comprehensive and exhaustive evaluation of recent algorithms on OOD task for deep models. They also proposed a benchmark procedure to evaluate OOD performance. Besides supervised learning, deep models for unsupervised learning also suffer from OOD samples. Nalisnick et al. [156] reveal that deep generative models can wrongly assign higher likelihood to inputs that were drawn from a different distribution than that of the training data. 4.3.3 Adversarial Robustness A related but more challenging task than OOD is adversarial attacks [170], where an adversary intentionally constructs inputs to force misclassification. The adversary is given certain level of information about the target model, for example the model parameters in a white-box attack. The goal of adversarial robustness is to make the classifiers robust to the test time adversarial perturbations on the inputs. Since its inception, adversarial robustness have drawn much attention due to its importance in AI safety. Concretely, we use to denote the input perturbation crafted by the adversary. The contami- nated input is fed into a trained model and the model predictsf (x+). The goal of the adversary is to maximize the loss of the model suffered on the attacked input, namely, max 2 `(f (x +);y): (4.4) 51 Usually the domain is selected such that the perturbation is imperceptible to human, thus hard to detect. Examples include` p -norm balls [68, 169, 207], and rotation and/or translation [52]. Figure 4.5 (adapted from [113]) illustrates an example of such adversarially generated image. Figure 4.5: An example of adversarial examples. A slightly perturbed image, that is almost indistinguishable from natural images to human eyes, can cause state-of-the-art classifiers to make incorrect predictions. See [113] for more details. Starting from the formulation of Eq. 4.4, many algorithms have been proposed to solve the optimization problem, which leads to different adversarial attack methods. For example, Szegedy et al. [207] applies L-BFGS, Fast Gradient Sign Method (FGSM) [68] applies a single gradient step, and Madry et al. [145] applies projected gradient descent, towards the maximization objec- tive. Different loss function, other than the cross-entropy loss, is investigated in [26]. The objective of adversarial robustness is, on the other hand, to defend the models against the adversarial attacks, as shown in Eq. 4.5. min X (x i ;y i )2D max 2 `(f (x i +);y i ): (4.5) The most popular way to achieve adversarial robustness is to optimize the model using the per- turbed inputs during training. It is inspired from the Danskin’s theorem, which ignoring some technicality [113] states that r max 2 `(f (x i +);y i ) =r `(x i + ? (x i );y i ;); where ? (x i ) = arg max 2 `(f (x i +);y i ). This adversarial training procedure is outlined in Algorithm 2. This simple recipe is shown to be highly effective in defending popular attacks, assuming we have the knowledge of the attack a priori. 4.4 Improving Uncertainty and Robustness for Deep Learning The problem of uncertainty and robustness has become one of the most important research areas in the development of deep learning, due to its critical practical impact. Deep models suffer from over-confident uncertainty estimate, and are particularly vulnerable to input perturbations and corruptions. Many works have been proposed to understand this phenomenon theoretically, and 52 Algorithm 2: Adversarial training [68] Input: Training setD, learning rate. 1 repeat 2 Select minibatchBD. 3 For each (x i ;y i )2B, compute the adversarial perturbation ? (x i ) that approximately 4 solves Eq. 4.4. Gradient update jBj X (x i ;y i )2B r `(x i + ? (x i );y i ;). 5 until convergence ; 6 to improve models’ uncertainty and robustness performance empirically. We briefly summarize the main themes below. On the theoretical side, [49, 55] derives theoretical upper bounds on the adversarial robust- ness via concentration measures. On the empirical side, besides adversarial training [68] (see Algorithm 2), mix-up [229], AugMix (data augmentation) [85], as well as pre-training and self- supervised learning [83, 84] are shown to be effective in improving models’ robustness against invalid inputs and adversarial attacks. Datasets that can evaluate and benchmark model robustness are curated, including distributional shift datasets [183], common image corruptions and pertur- bations datasets [81], and natural adversarial example datasets [86]. Since there is no ground-truth in the evaluation of the uncertainty, new evaluation metrics are proposed in [5, 163, 120]. 4.5 Uncertainty Beyond Supervised Learning Beyond supervised learning, uncertainty estimation of deep models is also studied and evaluated in active learning and reinforcement learning. Depeweg et al. [48] proposed to use entropy and variance as metrics to measure the de- composition of predictive uncertainty in terms of model uncertainty and data uncertainty (see Section 4.1 and Eq. 4.1) in practice: H[p(yjx)] | {z } Total Uncertainty = I(y;jx) | {z } Model Uncertainty +E q() [H[p(yj;x)]] | {z } Expected Data Uncertainty ; entropy; [yjx] |{z} Total Variance = q() [E [yj;x]] | {z } Model Variance + E q() [[yj;x]] | {z } Expected Data Variance ; variance; (4.6) where I(y;jx) =E q() [KL [p(yj;x)kp(yjx)]] is the mutual information between and y. A large value ofI(y;jx) indicates that knowing y would reduce a large amount of uncertainty in, i.e. model uncertainty. This decomposition allows us to identify informative samples in active learning: a point with large value ofI(y;jx) is more favorable as it contains more information in learning about the parameters. 53 In reinforcement learning applications, uncertainty estimation provides intrinsic motivation for efficient exploration [48, 164, 167, 168], which can partially solve the problem of pre-matured convergence to local optimum. 54 Chapter 5 UNCERTAINTY ESTIMATION WITH INFINITESIMAL JACKKNIFE AND MEAN-FIELD APPROXIMATION In this chapter, we solve uncertainty estimation of deep models by leveraging the curvature infor- mation of the training loss. Uncertainty quantification is an important research area in machine learning. Many ap- proaches have been developed to improve the representation of uncertainty in deep models to avoid overconfident predictions. Existing ones such as Bayesian neural networks and ensemble methods require modifications to the training procedures and are computationally costly for both training and inference. Motivated by this, we propose mean-field infinitesimal jackknife (mfIJ) – a simple, efficient, and general-purpose plug-in estimator for uncertainty estimation. The main idea is to use infinitesimal jackknife, a classical tool from statistics for uncertainty estimation to construct a pseudo-ensemble that can be described with a closed-form Gaussian distribution, without retraining. We then use this Gaussian distribution for uncertainty estimation. While the standard way is to sample models from this distribution and combine each sample’s prediction, we develop a mean-field approximation to the inference where Gaussian random variables need to be integrated with the softmax nonlinear functions to generate probabilities for multinomial vari- ables. The approach has many appealing properties: it functions as an ensemble without requiring multiple models, and it enables closed-form approximate inference using only the first and sec- ond moments of Gaussians. Empirically, mfIJ performs competitively when compared to state- of-the-art methods, including deep ensembles, temperature scaling, dropout and Bayesian NNs, on important uncertainty tasks. It especially outperforms many methods on out-of-distribution detection. 5.1 Introduction Recent advances in deep neural nets have dramatically improved predictive accuracy in super- vised tasks. For many applications such as autonomous vehicle control and medical diagnosis, decision-making also needs accurate estimation of the uncertainty pertinent to the prediction. Un- fortunately, deep neural nets are known to output overconfident, mis-calibrated predictions [73]. It is crucial to improve deep models’ ability in representing uncertainty. There have been a steady development of new methods on uncertainty quantification for deep neural nets. One popular idea is to introduce additional stochasticity (such as temperature annealing or dropout to network architecture) to existing trained models to represent uncertainty [60, 73]. Another line of work is to use an ensemble of models, collectively representing the uncertainty about the 55 predictions. This ensemble of models can be obtained by varying training with respect to ini- tialization [123], hyper-parameters [5], data partitions, i.e. , bootstraps [50, 190]. Yet another line of work is to use Bayesian neural networks (BNN), which can be seen as an ensemble of an infinite number of models, characterized by the posterior distribution [17, 140]. In practice, one samples models from the posterior or use variational inference. Each of those methods offers different trade-offs among computational costs, memory consumptions, parallelization, and mod- eling flexibility. For example, while ensemble methods are often state-of-the-art, they are both computationally and memory intensive, for repeating training procedures or storing the resulting models. Those stand in stark contrast to many practitioners’ desiderata. It would be the most ideal that neither training models nor inference with models incur additional memory and computational costs to estimate uncertainty beyond what is needed for making predictions. Additionally, it is also desirable to be able to quantify uncertainty on an existing model where re-training is not possible. In this work, we propose a new method to bridge the gap. The main idea of our approach is to use Infinitesimal jackknife, a classical tool from statistics for uncertainty estimation [99], to construct a pseudo-ensemble that can be described with a closed-form Gaussian distribution. We then use this Gaussian distribution for uncertainty estimation. While the standard way is to sample models from this distribution and combine each sample’s prediction, we develop a mean- field approximation to the inference, where Gaussian random variables need to be integrated with the softmax nonlinear functions to generate probabilities for multinomial variables. We show the proposed approach, which we refer to as mean-field Infinitesimal jackknife (mfIJ) often surpasses or as competitive as existing approaches in evaluations metrics of NLL, ECE, and Out-of-Distribution detection accuracy on several benchmark datasets. mfIJ shares appealing properties with several recent approaches for uncertainty estimation: constructing the pseudo-ensemble with infinitesimal jackknife does not require changing existing training procedures [190, 8, 65, 66, 112]; approximating the ensemble with a distribution removes the need of storing many models – an impractical task for modern learning models [29, 143]; the pseudo-ensemble distribution is in similar form as the Laplace approximation for Bayesian in- ference [7, 140] thus existing approaches in computational statistics such as Kronecker product factorization can be directly applied. The mean-field approximation brings out an additional appeal. It is in closed-form and needs only the first and the second moments of the Gaussian random variables. In our case, the first moments are simply the predictions of the networks while the second moments involve the prod- uct between the inverse Hessian and a vector, which can be computed efficiently [1, 149, 173]. Additionally, the mean-field approximation can be applied when integrals in similar form need to be approximated. In Section 5.5.4, we demonstrate its utility of applying it to the recently proposed SWAG algorithm for uncertainty estimation where the Gaussian distribution is derived differently [29, 97, 143, 147]. We describe our approach in Section 5.2, followed by a discussion on the related work in Sec- tion 6.2. Empirical studies and details of the approach are reported in Section 5.4 and Section 5.5. We conclude in Section 5.6. 56 5.2 Approach In this section, we start by introducing necessary notations and defining the task of uncertainty estimation. We then describe the technique of infinitesimal jackknife in Section 5.2.1. We derive a closed-form Gaussian distribution of an infinite number of models estimated with infinites- imal jackknife– we call them pseudo-ensemble. We describe how to use this distribution for uncertainty estimation in Section 5.2.2. We present our efficient mean-field approximation to Gaussian-softmax integral in Section 5.2.3. Lastly, we discuss hyper-parameters of our method and present the algorithm in Section 5.2.4. Notation We are given a training set ofN i.i.d. samplesD =fz i g N i=1 , wherez i = (x i ;y i ) with the inputx i 2X and the targety i 2Y. We fit the data to a parametric predictive model y = f(x;). We define the loss on a samplez as`(z;) and optimize the model’s parameter via empirical risk minimization onD. The minimizer is given by = arg min L(D;); whereL(D;) def = 1 N X N i=1 `(z i ;) (5.1) In practice, we are interested in not only the prediction f(x; ) but also quantifying the un- certainty of making such a prediction. In this paper, we consider (deep) neural network as the predictive model. 5.2.1 Infinitesimal Jackknife and Its Distribution Jackknife is a well-known resampling method to estimate the confidence interval of an estima- tor [210, 51]. It is a straightforward procedure. Each elementz i is left out from the datasetD to form a unique “leave-one-out” Jackknife sampleD i =Dfz i g. A Jackknife sample’s estimate of is given by b i = arg min L(D i ;) (5.2) We obtainN such samplesf b i g N i=1 and use them to estimate the variances of and the predic- tions made with . In this vein, this is a form of ensemble method. However, it is not feasible to retrain modern neural networksN times, whenN is often in the order of millions. Infinitesimal jackknife is a classical tool to approximate b i without re-training onD i . It is often used as a theoretical tool for asymptotic analysis [99], and is closely related to influence functions in robust statistics [37]. Recent studies have brought (renewed) interests in applying this methodology to machine learning problems [65, 112]. Here, we briefly summarize the method. Linear approximation. The basic idea behind infinitesimal jackknife is to treat the and b i as special cases of an estimator on weighted samples b (w) = arg min X i w i `(z i ;) (5.3) where the weights w i form a (N 1)-simplex: X i w i = 1. Thus the maximum likelihood estimate is b (w) whenw = 1 N 1. A Jackknife sample’s estimate b i , on the other end, is b (w) whenw = 1 N (1e i ) wheree i is all-zero vector except taking a value of 1 at thei-th coordinate. 57 Using the first-order Taylor expansion aroundw = 1 N 1, we obtain (under the condition of twice-differentiability and the invertibility of the Hessian), b i + 1 N H 1 ( )r`(z i ; ) def = + 1 N H 1 r i (5.4) whereH( ) is the Hessian matrix ofL evaluated at , andr`(z i ; ) is the gradient of`(z i ;) evaluated at . We useH andr i as shorthands when there is enough context to avoid confusion. An infinite number of infinitesimal jackknife samples. If the number of samples N ! +1, we can characterize the “infinite” number of b i with a closed-form Gaussian distribution with the following sample mean and covariance as the distribution’s mean and covariance, b i N (m; I ); withm = 1 N X i b i = ; and (5.5) I = 1 N X i ( b i )( b i ) T = 1 N 2 H 1 " 1 N X i r i r T i # H 1 = 1 N 2 H 1 JH 1 whereJ denotes the observed Fisher information matrix. Infinite infinitesimal bootstraps. The above procedure and analysis can be extended to bootstrapping (i.e. , sampling with replacement). Similarly, to characterize the estimates from the bootstraps, we can also use a Gaussian distribution – details omitted here for brevity, b i N ( ; B ); with B = 1 N H 1 JH 1 : (5.6) We refer to the distributionsN ( ; I ) andN ( ; B ) as the pseudo-ensemble distributions. We can approximate further byH =J to obtain I H 1 =N 2 and B H 1 =N. Lakshminarayanan et al. [123] discussed that using models trained on bootstrapped sam- ples does not work empirically as well as other approaches, as the learner only sees 63% of the dataset in each bootstrap sample. We note that this is an empirical limitation rather than a theoretical one. In practice, we can only train a very limited number of models. However, we hypothesize we can get the benefits of combining an infinite number of models without training. Empirical results valid this hypothesis. 5.2.2 Sampling Based Uncertainty Estimation with the Pseudo-ensemble Distributions Given the general form of the pseudo-ensemble distributionsN ( ; ), it is straightforward to see, if we approximate the predictive functionf(;) with a linear function, f(x;)f(x; ) +rf T ( ); we can then regard the predictions by the models as a Gaussian distributed random variable, f(x;)N (f(x; );rf T rf): 58 For predictive functions whose outputs are approximately Gaussian, it might be adequate to char- acterize the uncertainty with the approximated mean and variance. However, for predictive func- tions that are categorical, this approach is not applicable. A standard way is to use the sampling for combining the discrete predictions from the models in the ensemble. For example, for classificationy = arg max c f c (x;) wheref c is the probability of labelingx withc-th category, the averaged prediction from the ensemble is then e k =E [P (y =k)] 1 M X M i=1 1[arg max c f c (x; i ) =k] (5.7) where i N ( ; ), and 1 is the indicator function. In the next section, we propose a new approach that avoids sampling and directly approximates the ensemble prediction of discrete labels. 5.2.3 Mean-field Approximation for Gaussian-softmax Integration In deep neural network for classification, the predictions f k () are the outputs of the softmax layer. f k = SOFTMAX(a k )/ expfa k g; witha =g(x;) T W; (5.8) whereg(x;) is the transformation of the input through the layers before the fully-connected layer, andW is the connection weights in the softmax layer. We focus on the caseg(x;) is deterministic – extending to the random variables is straightforward. As discussed, we assume the pseudo-ensemble onW forms a Gaussian distributionW N (W ; ). Then we have aN (; S) such that =g(x;) T W ; S =g(x;) T g(x;) 1 : (5.9) We give detailed derivation in Section 5.5.1 to compute the expectation off k , e k =E [f k ] = Z SOFTMAX(a k )N (a;; S)da (5.10) The key idea is to apply the mean-field approximationE [f(x)]f(E [x]) and use the following well-known formula to compute the Gaussian integral of a sigmoid function(), Z (a)N (a;;s 2 ) p 1 + 0 s 2 1 With slight abuse of notation, the form of S is after proper vectorization and padding ofg and . 59 where 0 is a constant and is usually chosen to be=8 or 3= 2 . In the softmax case, we arrive at: Mean-Field 0 (mf0)e k 0 @ X i exp 0 @ ( k i ) q 1 + 0 s 2 k 1 A 1 A 1 (5.11) Mean-Field 1 (mf1)e k 0 @ X i exp 0 @ ( k i ) q 1 + 0 (s 2 k +s 2 i ) 1 A 1 A 1 (5.12) Mean-Field 2 (mf2)e k 0 @ X i exp 0 @ ( k i ) q 1 + 0 (s 2 k +s 2 i 2s ik ) 1 A 1 A 1 (5.13) The 3 approximations differ in how much information fromi6=k is considered: not considering s i , considering its variances 2 i and considering its covariances ik . Note thatmf0 andmf1 are com- putationally preferred over mf2 which usesK 2 covariances, whereK is the number of classes. ([41] derived an approximation in the form ofmf2 but did not apply to uncertainty estimation.) Intuition The simple form of the mean-field approximations makes it possible to understand them intuitively. We focus onmf0. We first rewrite it in the familiar “softmax” form: (mf0) e k e k p 1+ 0 s 2 k P i e i p 1+ 0 s 2 k def = ^ e k : Note that this form is similar to a “softmax” with a temperature scaling:P (y =k)/ exp n k T o : However, there are several important differences. In mf0, the temperature scaling factor is category-specific: ^ e k / exp k T k with T k = q 1 + 0 s 2 k : (5.14) Importantly, the factor T k depends on the variance of the category. For a prediction with high variance, the temperature for that category is high, reducing the corresponding “probability” ^ e k . Specifically, ^ e k ! 1 K as s k ! +1: In other words, the scaling factor is both category-specific and data-dependent, providing addi- tional flexibility to a global temperature scaling factor. Implementation nuances Because of this category-specific temperature scaling, ^ e k (and ap- proximations from mf1 and mf2) is no longer a multinomial probability. Proper normalization should be performed,e k ^ e k P i ^ e i def = p k : 60 5.2.4 Other Implementation Considerations Temperature scaling. Temperature scaling was shown to be useful for obtaining calibrated probabilities [73]. This can be easily included as well with f k / exp T act 1 a k (5.15) We can also combine with another “temperature” scaling factor, representing how well the models in the pseudo-ensemble are concentrated N ( ;T ens 1 ) (5.16) HereT ens is for the pseudo-ensembles or the posterior [217]. Note that these two temperatures control variability differently. When T ens ! 0, the ensemble focuses on one model. When T act ! 0, each model in the ensemble moves to “hard” decisions, as in Eq. (5.7). Using mf0 as an example, e k = X i exp ( k i ) T act 2 + 0 s 2 k =T ens 1 2 ! 1 (5.17) wheres k is computed atT ens = 1. Empirically, we can tune the temperaturesT ens andT act as hyper-parameters on a heldout set, to optimize the predictive performance. Computation complexity and scalability. The bulk of the computation, as in Bayesian ap- proximate inference, lies in the computation ofH 1 in I and B , as in Eq. (5.5) and Eq. (5.6), or more precisely the product between the inverse Hessian and vectors, cf. Eq. (5.9). For smaller models, computing and storying H 1 exactly is attractive. For large models, one can compute the inverse Hessian-vector product approximately using multiple Hessian-vector prod- ucts [173, 1, 112]. Alternatively, we can approximate inverse Hessian using Kronecker factoriza- tion [185]. In short, any advances in computing the inverse Hessian and related quantities can be used to accelerate computation needed in this paper. An application example. In Section 3, we exemplify the application of the mean-field ap- proximation to Gaussian-softmax integration in the uncertainty estimation. For brevity, we con- sider the last-layer/fully-connected layer’s parametersW as a part of the deep neural network’s parameters = (;W ). We assume the mappingg(x;) prior to the fully connected layer is deterministic. 5.3 Related Work Resampling methods, such as jackknife and bootstrap, are classical statistical tools for assess- ing confidence intervals [210, 151, 51, 50]. Recent results have shown that carefully designed jackknife estimators [8] can achieve worst-case coverage guarantees on regression problems. However, the exhaustive re-training in jackknife or bootstrap can be cumbersome in practice. Recent works have leveraged the idea of influence function [112, 37] to alleviate the compu- tational challenge. [190] combines the influence function with random data weight samples to approximate the variance of predictions in bootstrapping; [3] derives higher-order influence function approximation for Jackknife estimators. Theoretical properties of the approximation are studied in [65, 66]. [144] applies this approximation to identify underdetermined test points. 61 Algorithm 3: Uncertainty estimate with Mean-Field Infinitesimal jackknife (mfIJ) Input: A deep neural net model = ( ;W ), a training setD,T ens ;T act , and a test 1 pointx. Compute pre-fully-connected layer featuresg(x; ) 2 Compute the moments; S of the logits via Eq. (5.9), using either of the following two 3 approaches ComputeH 1 (W ) onD and the inverse Hessian-vector productH 1 (W )v 4 exactly. Compute the inverse Hessian-vector product approximately [1] 5 Computee k via mean-field approximation Eq. (5.11), (5.12) or (5.13), and normalizee k 6 to obtainp k . Output: The predictive uncertainty isp k . 7 Infinitesimal jackknife follows the same idea as those work [99, 65, 66]. To avoid explicitly stor- ing those models, we seek a Gaussian distribution to approximately characterize the model. This connects to several existing research. The posterior of a Bayesian model can be approximated with a Gaussian distribution [140, 185]: N ( ;H 1 ), where is interpreted as the maximum a posterior estimate (thus incorporating any prior the model might want to include). If we approximate the observed Fisher information matrix J in eq. (5.5) with the Hessian H, the two pseudo-ensemble distributions and the Laplace approximation to the posterior have identical forms except that the Hessians are scaled differently, and can be captured by the ensemble temperature eq. (5.16). Note that despite the similarity, infinitesimal jackknife are “frequentist” methods and do not assume well-formed Bayesian modeling. The trajectory of stochastic gradient descent gives rise to a sequence of models where the co- variance matrix among batch means converges toH 1 JH 1 [28, 29, 143], similar to the pseudo- ensemble distributions in form. But note that those approaches do not collect information around the maximum likelihood estimate, while we do. It is also a classical result that the maximum likelihood estimator converges in distribution to a normal distribution. and 1 N H 1 are simply the plug-in estimators of the truth mean and the covariance of this asymptotic distribution. 5.4 Experiments We first describe the setup for our empirical studies. We then demonstrate the effectiveness of the proposed approach on the MNIST dataset, mainly contrasting the results from sampling to those from mean-field approximation and other implementation choices. We then provide a detailed comparison to popular approaches for uncertainty estimation. In the main text, we focus on classification problems. We evaluate on commonly used benchmark datasets, summarized in Table 5.1. In Section 5.5.6, we report results on regression tasks. 62 Table 5.1: Datasets for Classification Tasks and Out-of-Distribution (OOD) Detection Dataset # of train, held-out OOD held-out Name classes and test splits Dataset and test splits MNIST [127] 10 55k / 5k / 10k NotMNIST [24] 5k / 13.7k CIFAR-10 [115] 10 45k / 5k / 10k LSUN (resized) [226] 1k / 9k CIFAR-100 [115] 100 SVHN [160] 5k / 21k ILSVRC-2012 [46] 1,000 1,281k / 25k / 25k Imagenet-O [86] 2k / - y y : the number of samples is limited; best results on held-out are reported. 5.4.1 Setup Model and training details for classifiers. For MNIST, we train a two-layer MLP with 256 ReLU units per layer, using Adam optimizer for 100 epochs. For CIFAR-10, we train a ResNet- 20 with Adam for 200 epochs. On CIFAR-100, we train a DenseNet-BC-121 with SGD optimizer for 300 epochs. For ILSVRC-2012, we train a ResNet-50 with SGD optimizer for 90 epochs. Evaluation tasks and metrics. We evaluate on two tasks: predictive uncertainty on in- domain samples, and detection of out-of-distribution samples. For in-domain predictive uncer- tainty, we report classification error rate ("), negative log-likelihood (NLL), and expected cali- bration error in` 1 distance (ECE) [73] on the test set. NLL is a proper scoring rule [123], and measures the KL-divergence between the ground-truth data distribution and the predictive distri- bution of the classifiers. ECE measures the discrepancy between the histogram of the predicted probabilities by the classifiers and the observed ones in the data – properly calibrated classifiers will yield matching histograms. Both metrics are commonly used in the literature and the lower the better. In Section 5.5.2, we give precise definitions. On the task of out-of-distribution (OOD) detection, we assess how wellp(x), the classifier’s output being interpreted as probability, can be used to distinguish invalid samples from normal in-domain images. Following the common practice [82, 133, 129], we report two threshold- independent metrics: area under the receiver operating characteristic curve (AUROC), and area under the precision-recall curve (AUPR). Since the precision-recall curve is sensitive to the choice of positive class, we report both “AUPR in:out” where in-distribution and out-of-distribution images are specified as positives respectively. We also report detection accuracy, the optimal accuracy achieved among all thresholds in classifying in-/out-domain samples. All three metrics are the higher the better. Competing approaches for uncertainty estimation. We compare to popular approaches: (i) frequentist approaches: the point estimator of maximum likelihood estimator (MLE) as a base- line, the temperature scaling calibration (T. SCALE) [73], the deep ensemble method (ENSEMBLE) [123], and the resampling bootstrap RUE [190]. (ii) variants of Bayesian neural networks (BNN): DROPOUT [60] approximates the Bayesian posterior using stochastic forward passes of the network with dropout. BNN(VI) trains the network via stochastic variational inference to maximize the evidence lower bound. BNN(KFAC) applies the Laplace approximation to construct a Gaussian posterior, via layer-wise Kronecker product factorized covariance matrices [185]. Hyper-parameter tuning. For in-domain uncertainty estimation, we use the NLL on the held-out sets to tune hyper-parameters. For the OOD detection task, we use AUROC on the held- out to select hyper-parameters. We report the results of the best hyper-parameters on the test 63 Table 5.2: Performance of Infinitesimal jackknife Pseudo-Ensemble Distribution on MNIST which approx. MNIST: in-domain (#) NotMNIST: OOD detection (") layer(s) method " (%) NLL ECE Acc. (%) AU ROC AU PR (in : out) all sampling(Section 5.2.2) M = 500 1.66 0.06 0.42 87.46 92.50 87.01: 93.52 last 20 1.74 0.06 0.41 87.58 93.48 91.66 : 94.27 last 100 1.68 0.05 0.47 90.10 95.55 94.62 : 96.08 last 500 1.67 0.05 0.50 90.77 96.20 95.79 : 96.47 last mean-field (M = 0) mf0 1.67 0.05 0.20 91.93 96.91 96.67 : 96.99 last mf1 1.67 0.05 0.47 91.91 96.93 96.72 : 97.03 last mf2 1.67 0.05 0.46 91.91 96.94 96.72 : 97.03 sets. The key hyper-parameters are the temperatures, regularization or prior in BNN methods, and dropout rates. Other implementation details. When Hessian needs to be inverted, we add a dampening term ~ H = (H +I) following [185, 190] to ensure positive semi-definiteness and the smallest eigenvalue of ~ H be 1. For BNN(VI), we use Flipout [216] to reduce gradient variances and fol- low [201] for variational inference on deep ResNets. On ImageNet, we compute the Kronecker- product factorized Hessian matrix, rather than full due to high dimensionality. For BNN(KFAC) and RUE, we use mini-batch approximations on subsets of the training set to scale up on Ima- geNet, as suggested in [185, 190]. 5.4.2 Infinitesimal Jackknife on MNIST Most uncertainty quantification methods, including the ones proposed in this paper, have “knobs” to tune. We first concentrate on MNIST and perform extensive studies of the proposed approaches to understand several design choices. Table 5.2 contrasts them. Use all layers or just the last layer. Uncertainty quantification on deep neural nets is com- putationally costly, given their large number of parameters, especially when the methods need information on the curvatures of the loss functions. To this end, many approaches assume layer- wise independence [185] and low-rank components [143, 152] and in some cases, restrict uncer- tainty quantification to only a few layers [227] – in particular, the last layer [114]. The top portion of Table 5.2 shows the restricting to the last layer harms the in-domain ECE slightly but improves OOD significantly. Effectiveness of mean-field approximation. Table 5.2 also shows that the mean-field ap- proximation has similar performance as sampling (the distribution) on the in-domain tasks but noticeably improves on OOD detection. mf0 performs the best among the 3 variants. Effect of ensemble and activation temperatures. We study the roles of ensemble and activation temperatures inmfIJ (Section 5.2.4). We grid search the two and generate the heatmaps of NLL and AU ROC on the held-out sets, shown in Figure 5.1. Note that (T ens = 0;T act = 1) correspond to MLE. What is particularly interesting is that for NLL, higher activation temperature (T act 1) and lower ensemble temperature (T ens 10 2 ) work the best. For AU ROC, however, lower temperatures on both work best. That lower T ens is preferred was also observed in [217] and usingT act > 1 for better calibration is noted in [73]. On the other end, for OOD detection, [133] 64 NLL# on held-out AUROC" on in-/ood- held-out Calibration under distribution shift 0.001 0.25 0.5 0.75 1 1.5 2 5 activation temperature 1e-4 1e-3 1e-2 1e-1 1. 1e1 1e2 ensemble temperature 0.15 0.15 0.15 0.15 0.15 0.16 0.17 0.26 0.07 0.07 0.07 0.07 0.07 0.06 0.06 0.11 0.21 0.18 0.15 0.12 0.10 0.08 0.07 0.10 0.48 0.34 0.21 0.15 0.11 0.08 0.07 0.10 0.47 0.39 0.22 0.15 0.12 0.08 0.07 0.10 0.36 0.39 0.22 0.15 0.12 0.08 0.07 0.10 0.31 0.39 0.22 0.15 0.12 0.08 0.07 0.10 0.001 0.25 0.5 0.75 1 1.5 2 5 activation temperature 1e-4 1e-3 1e-2 1e-1 1. 1e1 1e2 ensemble temperature 0.96 0.96 0.96 0.96 0.96 0.95 0.95 0.91 0.94 0.94 0.94 0.94 0.93 0.92 0.90 0.77 0.77 0.78 0.79 0.81 0.81 0.79 0.77 0.66 0.31 0.37 0.46 0.53 0.59 0.67 0.68 0.64 0.11 0.22 0.36 0.46 0.54 0.64 0.67 0.64 0.03 0.20 0.34 0.46 0.54 0.64 0.66 0.64 0.01 0.20 0.34 0.45 0.54 0.64 0.66 0.64 0 15 30 45 60 75 90 105 120 135 150 165 180 Rotation ( ∘ ) 0 10 20 30 40 50 60 70 80 ECE (∘) mle tempscale ensemble ue d opout bnn-vi bnn-kfac mf-ij Figure 5.1: Best viewed in colors. (a-b) Effects of the softmax and ensemble temperatures on NLL and AuROC. The yellow star marks the best pairs of temperatures. See texts for details. Table 5.3: Comparing different uncertainty estimate methods on in-domain tasks (lower is better) Method MNIST CIFAR-100 ImageNet " (%) NLL ECE (%) " (%) NLL ECE (%) " (%) NLL ECE (%) MLE 1.67 0.10 1.18 24.3 1.03 10.4 23.66 0.92 3.03 T. SCALE 1.67 0.06 0.74 24.3 0.92 3.13 23.66 0.92 2.09 ENSEMBLE 1.25 0.05 0.30 19.6 0.71 2.00 21.22 0.83 3.10 RUE # 1.72 0.08 0.85 24.3 0.99 8.60 23.63 0.92 2.83 DROPOUT 1.67 0.06 0.68 23.7 0.84 3.43 24.93 0.99 1.62 BNN(VI) # 1.72 0.14 1.13 25.6 0.98 8.35 26.54 1.17 4.41 BNN(KFAC) 1.71 0.06 0.16 24.1 0.89 3.36 23.64 0.92 2.95 mfIJ 1.67 0.05 0.20 24.3 0.91 1.49 23.66 0.91 0.93 # : for CIFAR-100 and ImageNet, only the last layer is used due to the high computational cost. suggests a very high activation temperature (T act = 1000 in their work, likely due to using a single model instead of an ensemble). 5.4.3 Comparison to Other Approaches Given our results in the previous section, we report mf0 of infinitesimal jackknife in the rest of the paper. Table 5.3 contrasts various methods on in-domain tasks of MNIST, CIFAR-100, and ImageNet. Table 5.4 contrasts performances on the out-of-distribution detection task (OOD). Results on CIFAR-10 (both in-domain and OOD), as well as CIFAR-100 OOD on SVNH are in Section 5.5.5. While deep ensemble [123] achieves the best performance on in-domain tasks most of time, the proposed approachmfIJ typically outperforms other approaches, especially on the calibration metric ECE. On the OOD detection task, mfIJ significantly outperforms all other approaches in all metrics. ImageNet-O is a particularly hard dataset for OOD [86]. The images are from ImageNet-22K samples, thus share similar low-level statistics as the in-domain data. Moreover, the images are chosen such that they are misclassified by existing networks (ResNet-50) with high confidence, or so called “natural adversarial examples”. We follow [86] to use 200-class subset, which are the confusing classes to the OOD images, of the test set as the in-distribution examples. [86] 65 Table 5.4: Comparing different methods on out-of-distribution detection (higher is better) Method MNIST vs. NotMNIST CIFAR-100 vs. LSUN ImageNet vs. ImageNet-O Acc. y ROC z PR $ Acc. y ROC z PR $ Acc. y ROC z PR $ MLE 67.6 53.8 40.1 : 72.5 72.7 80.0 83.5 : 75.2 58.4 51.6 78.4 : 26.3 T. SCALE 67.4 66.7 48.8 : 77.0 76.6 84.3 86.7 : 80.3 58.5 54.5 79.2 : 27.8 ENSEMBLE 86.5 88.0 70.4 : 92.8 74.4 82.3 85.7 : 77.8 60.1 50.6 78.9 : 25.7 RUE 61.1 64.7 60.5 : 68.4 75.2 83.0 86.7 : 77.8 58.4 51.6 78.3 : 26.3 DROPOUT 88.8 91.4 78.7 : 93.5 69.8 77.3 81.1 : 72.9 59.5 51.7 79.0 : 26.3 BNN(VI) # 86.9 81.1 59.8 : 89.9 62.5 67.7 71.4 : 63.1 57.8 52.0 75.7 : 26.8 BNN(KFAC) 88.7 93.5 89.1 : 93.8 72.9 80.4 83.9 : 75.5 60.3 53.1 79.6 : 26.9 mfIJ 91.9 96.9 96.7 : 97.0 82.2 89.9 92.0 : 86.6 63.2 62.9 83.5 : 33.3 y : Accuracy (%). z : Area under ROC. $ : Area under Precision-Recall with “in” vs. “out” domains flipped. # : for CIFAR-100 and ImageNet, only the last layer is used for due to the high computational cost. further demonstrates that many popular approaches to improve neural network robustness, like adversarial training, hardly help on ImageNet-O.mfIJ improves over other baselines by a margin. Robustness to distributional shift. [201] points out that many uncertainty estimation meth- ods are sensitive to distributional shift. Thus, we evaluate the robustness of mfIJ on rotated MNIST images, from 0 to 180 . The ECE curves in Figure 5.1(c) showmfIJ is better or as robust as other approaches. 5.5 Details and More Experiment Results 5.5.1 Derivation of Mean-Field Approximation for Gaussian-softmax Integration In this section, we derive the mean-field approximation for Gaussian-softmax integration, eq. (5.10) in the main text. Assume the same notations as in Section 5.2.3, where the activation to softmax follows a GaussianaN (; S). e k =E [f k ] = Z SOFTMAX(a k )N (a;; S)da = Z 1 1 + P i6=k e (a k a i ) N (a;; S)da = Z 0 @ 2K + X i6=k 1 (a k a i ) 1 A 1 N (a;; S)da integrate independently 0 @ 2K + X i6=k 1 E p(a i ;a k ) [(a k a i )] 1 A 1 (5.18) 66 where “integrate independently” means integrating each term in the summand independently, resulting the expectation to the marginal distribution over the pair (a i ;a k ). This approximation is prompted by the mean-field approximation:E [f(x)]f(E [x]) for a nonlinear functionf() 2 . Next we plug in the approximation toE [()] where() is the sigmoid function, which states that Z (x)N(x;;s 2 )dx p 1 + 0 s 2 : (5.19) 0 is a constant and is usually chosen to be=8 or 3= 2 . This is a well-known result, see [16]. We further approximate by considering different ways to compute the bivariate expectations in the denominator. Mean-Field 0 (mf0) In the denominator, we ignore the variance ofa i fori6=k and replacea i with its mean i , and compute the expectation only with respect to k . We arrive at e k 0 @ 2K + X i6=k 1 E p(a k ) [(a k i )] 1 A 1 : (5.20) Applying eq.(5.19), we have e k 0 B B @ 2K + X i6=k 1 k i p 1+ 0 s 2 k 1 C C A 1 = 0 @ X i exp 0 @ k i q 1 + 0 s 2 k 1 A 1 A 1 : (5.21) Mean-Field 1 (mf1) If we replacep(a i ;a k ) with the two independent marginalsp(a i )p(a k ) in the denominator, recognizing (a k a i )N ( k i ;s 2 i +s 2 k ), we get, e k 0 B B @ 2K + X i6=k 1 k i p 1+ 0 (s 2 i +s 2 k ) 1 C C A 1 = 0 @ X i exp 0 @ k i q 1 + 0 (s 2 k +s 2 i ) 1 A 1 A 1 : (5.22) 2 Similar to the classical use of mean-field approximation on Ising models, we use the term mean-field approxi- mation to capture the notion that the expectation is computed by considering the weak, pairwise coupling effect from points on the lattice, i.e. ,ai withi6=k. 67 Mean-Field 2 (mf2) Lastly, if we compute eq.(5.18) with a full covariance betweena i anda k , recognizing (a k a i )N ( k i ;s 2 i +s 2 k 2s ik ), we get e k 0 B B @ 2K + X i6=k 1 k i p 1+ 0 (s 2 i +s 2 k 2s ik ) 1 C C A 1 = 0 @ X i exp 0 @ k i q 1 + 0 (s 2 k +s 2 i 2s ik ) 1 A 1 A 1 : (5.23) We note that [41] has developed the approximation form eq.(5.23) for computinge k , though the author did not use it for uncertainty estimation. 5.5.2 Definitions of Evaluation Metrics NLL is defined as the KL-divergence between the data distribution and the model’s predictive distribution, NLL = logp(yjx) = K X c=1 y c logp(y =cjx) (5.24) wherey c is the one-hot embedding of the label. ECE measures the discrepancy between the predicted probabilities and the empirical accu- racy of a classifier in terms of` 1 distance. It is computed as the expected difference between per bucket confidence and per bucket accuracy, where all predictionsfp(x i )g N i=1 are binned intoS buckets such thatB s =fi2 [N]jp(x i )2I s g are predictions falling within the intervalI s . ECE is defined as, ECE = S X s=1 jB s j N jconf(B s ) acc(B s )j; (5.25) where conf(B s ) = P i2Bs p(x i ) jB s j and acc(B s ) = P i2Bs 1 [^ y i =y i ] jB s j . 5.5.3 Hyper-parameters of Main Experiments Table 5.5 provides key hyper-parameters used in training deep neural networks on different datasets. FormfIJ method, we use 0 = 3 2 in our implementation. For the ENSEMBLE approach, we useM = 5 models on all datasets as in [201]. For RUE, DROPOUT, BNN (VI) and BNN(KFAC), where sampling is applied at inference time, we useM = 500 Monte-Carlo samples on MNIST, andM = 50 on CIFAR-10, CIFAR-100 and ImageNet. We useB = 10 buckets when computing ECE on MNIST, CIFAR-10 and CIFAR-100, andB = 15 on ImageNet. 68 Table 5.5: Hyper-parameters of neural network trainings Dataset MNIST CIFAR-10 CIFAR-100 ImageNet Architecture MLP ResNet20 Densenet-BC-121 ResNet50 Optimizer Adam Adam SGD SGD Learning rate 0.001 0.001 0.1 0.1 Learning rate decay exponential staircase0:1 staircase0:1 staircase0:1 0.998 at 80, 120, 160 at 150, 225 at 30, 60, 80 Weight decay 0 1e 4 5e 4 1e 4 Batch size 100 8 128 256 Epochs 100 200 300 90 5.5.4 Apply the Mean-field Approximation to Other Gaussian Posteriors Inference Tasks Table 5.6: Uncertainty estimation with SWAG on MNIST Approx. method MNIST: in-domain (#) NotMNIST: OOD detection (") " (%) NLL ECE Acc. (%) AU ROC AU PR (in : out) sampling M = 20 1.55 0.06 0.60 85.78 91.15 91.91 : 90.34 100 1.55 0.06 0.60 88.33 93.19 86.95 : 95.72 500 1.55 0.06 0.60 82.68 88.80 82.51: 91.23 mean-field (M = 0) mf0 1.56 0.05 0.25 87.56 93.26 91.49 : 93.96 mf1 1.55 0.05 0.46 85.28 90.68 85.15 : 92.39 mf2 1.55 0.05 0.51 82.18 87.40 78.12 : 90.69 The mean-field approximation is interesting in its own right. In particular, it can be applied to approximate any Gaussian-softmax integral. In this section, we apply it to the SWA-Gaussian posterior (SWAG) [143], whose covariance is a low-rank matrix plus a diagonal matrix derived from the SGD iterates. We tune the ensemble and activation temperatures with SWAG to perform uncertainty tasks on MNIST, and the results are reported in Table. 5.6. We ran SWAG on the softmax layer for 50 epochs, and collect models along the trajectory to form the posterior. We use the default values for other hyper-parameters, like SWAG learning rate and the rank of the low-rank component, as the main objective here is to combine the mean-field approxi- mation with different Gaussian posteriors. As we can see from Table. 5.6, SWAG, when using sampling for approximate inference, have a variance larger than expected 3 . Nonetheless, within the variance, using the mean-field approximation instead of sampling performs similarly. The no- table exception is in-domain tasks where the mean-field approximation consistently outperforms sampling. This suggests that the mean-field approximations can work with other Gaussian distributions as a replacement for sampling to reduce the computation cost. 3 In particular, the variance is higher than the variance in the sampling results of the infinitesimal jackknife, cf. Table 5.2 in the main text. 69 Table 5.7: Uncertainty estimation on CIFAR-10 Method CIFAR-10 in-domain (#) LSUN: OOD detection (") SVHN: OOD detection (") " (%) NLL ECE (%) Acc. AU ROC AU PR (in/out) Acc. AU ROC AU PR (in : out) MLE 8.81 0.30 3.59 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 T. SCALE 8.81 0.26 0.52 89.0 95.3 96.5 : 93.4 88.9 94.0 92.7 : 95.0 ENSEMBLE 6.66 0.20 1.37 88.0 94.4 95.9 : 92.1 88.7 93.4 92.4 : 94.8 RUE 8.71 0.28 1.87 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 DROPOUT 8.83 0.26 0.58 81.8 88.6 91.7 : 84.3 86.0 91.6 90.1 : 94.0 BNN VI 11.09 0.33 1.57 79.9 87.3 90.5 : 83.3 85.7 91.4 89.5 : 93.9 BNN LL-VI 8.94 0.33 4.15 84.6 91.0 93.5 : 87.2 87.8 93.3 91.7 : 95.4 BNN(KFAC) 8.75 0.29 3.45 85.0 91.3 93.8 : 87.8 86.3 90.7 89.5 : 92.6 mfIJ 8.81 0.26 0.56 91.0 96.4 97.4 : 94.8 89.7 94.6 93.3 : 95.3 Table 5.8: Out of Distribution Detection on CIFAR-100 Method SVHN: OOD detection (") Acc. AU ROC AU PR (in : out) MLE 73.90 80.69 74.03 : 86.93 T. scaling 76.94 83.43 77.53 : 88.11 Ensemble 75.99 83.48 77.75 : 88.85 RUE 73.90 80.69 74.03 : 86.93 Dropout 74.13 81.87 75.78 : 88.07 BNN LL-VI 71.87 79.55 72.08 : 87.27 BNN(KFAC) 74.13 80.97 74.46 : 87.06 mfIJ 81.38 88.04 84.59 : 91.23 5.5.5 More Results on CIFAR-10 and CIFAR-100 Tables 5.7 and 5.8 supplement the main text with additional experimental results on the CIFAR- 10 dataset with both in-domain and out-of-distribution detection tasks, and on the CIFAR-100 with out-of-distribution detection using the SVHN dataset. BNN LL-VI refers to stochastic varia- tional inference on the last layer only. The results support the same observations in the main text: mfIJ performs similar to other approaches on in-domain tasks, but noticeably outperforms them on out-of-distribution detection. 5.5.6 Regression Experiments We conduct real-world regression experiments on UCI datasets. We follow the experimental setup in [60, 88, 123], where each dataset is split into 20 train-test folds randomly, and the av- erage results with the standard deviation are reported. We use the same architecture as previous works, with 1 hidden layer of 50 ReLU units. Since in regression tasks the output distribution can be compute analytically, we don’t use the mean-field approximation in these experiments. Nevertheless, we still refer to our method asmfIJ to be consistent with other datasets. For mfIJ approach, we use the pseudo-ensemble Gaussian distribution of the last layer pa- rameters to compute a Gaussian distribution of the network output, i.e.aN ((x);s 2 (x)) as eq. (5.9) in the main text. We also estimate the variance of the observation noise 2 from the 70 Table 5.9: NLLs and RMSEs of different methods on regression benchmark datasets. Metrics Dataset ENSEMBLE [123] y RUE [190] DROPOUT [60] y BNN (PBP) [88] y BNN (KFAC) [185] mfIJ NLL (#) Housing 2.41 0.25 2.69 0.44 2.46 0.06 2.57 0.09 2.74 0.46 2.66 0.39 Concrete 3.06 0.18 3.21 0.14 3.04 0.02 3.16 0.02 3.25 0.18 3.15 0.09 Energy 1.38 0.22 1.48 0.35 1.99 0.02 2.04 0.02 1.48 0.35 1.47 0.35 Kin8nm -1.20 0.02 -1.13 0.03 -0.95 0.01 -0.90 0.01 -1.14 0.03 -1.15 0.02 Naval -5.63 0.05 -6.00 0.22 -3.80 0.01 -3.73 0.01 -5.99 0.22 -5.99 0.22 Power 2.79 0.04 2.86 0.04 2.80 0.01 2.84 0.01 2.86 0.04 2.86 0.04 Wine 0.94 0.12 0.99 0.06 0.93 0.01 0.97 0.01 0.99 0.08 0.99 0.07 Yacht 1.18 0.21 1.14 0.42 1.55 0.03 1.63 0.02 1.66 0.88 1.18 0.53 RMSE (#) Housing 3.281.00 3.34 0.94 2.97 0.19 3.01 0.18 3.42 1.09 3.24 0.93 Concrete 6.03 0.58 5.67 0.65 5.23 0.12 5.67 0.09 5.65 0.63 5.60 0.61 Energy 2.09 0.29 1.09 0.35 1.66 0.04 1.80 0.05 1.08 0.35 1.08 0.35 Kin8nm 0.09 0.00 0.08 0.00 0.10 0.00 0.10 0.00 0.08 0.00 0.08 0.00 Naval 0.00 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.00 Power 4.11 0.17 4.21 0.15 4.02 0.04 4.12 0.03 4.21 0.15 4.21 0.15 Wine 0.64 0.04 0.65 0.04 0.62 0.01 0.64 0.01 0.65 0.04 0.65 0.04 Yacht 1.58 0.48 0.82 0.26 1.11 0.09 1.02 0.05 1.19 1.08 0.80 0.25 y : numbers are cited from the original paper. residuals on the training set [190]. Therefore, the predictive distribution of f(x) is given by N ((x);s 2 (x) + 2 ). We can compute the negative log likelihood (NLL) as, NLL = logp(yjx) = 1 2 log s 2 (x) + 2 + (y(x)) 2 s 2 (x) + 2 + log(2) ! : We tune ensemble temperature on the heldout sets, and fix the activation temperature to be 1. For sampling-based methods, RUE and KFAC, the NLL fromM prediction samplef 1 (x);:::;f M (x) can be computed as, NLL = logp(yjx) = 1 2 log( s 2 m + 2 ) + (y m ) 2 s 2 m + 2 + log(2) where m = 1 M X m f m (x) is the mean of prediction samples, and s 2 m = 1 M X m (f m (x) m ) 2 is the variance of prediction samples. The sampling is conducted on all network layers in both RUE and KFAC. Results We report the NLL and RMSE on the test set in Table. 5.9 and compare with other approaches. RUE and BNN(KFAC) useM = 50 samples and the ENSEMBLE method hasM = 5 models in the ensemble. We achieve similar or better performances than RUE and KFAC without the computation cost of explicit sampling. Compared to DROPOUT and ENSEMBLE, which are the state-of-the-art methods on these datasets, we also achieve competitive results. We highlight the best performing method on each dataset, and mfIJ when its result is within one std from the best method. 71 5.5.7 More Results on Comparing All Layers versus Just the Last Layer We conduct an experiment similar to the top portion of Table 5.2 in the main text, to study the effect of restricting the parameter uncertainty to the last layer only. We use NotMNIST as the in- domain dataset and treat MNIST as the out-of-distribution dataset. We use a two-layer MLP with 256 ReLU hidden units. Table 5.10 supports the same observation as in the main text: restricting to the last layer improves on OOD task while it does not have significant negative impact on the in-domain tasks. Table 5.10: Performance of infinitesimal jackknife pseudo-ensemble distribution on NotMNIST comparing all layers versus just the last layer which approx. NotMNIST: in-domain (#) MNIST: OOD detection (") layer(s) method " (%) NLL ECE Acc. (%) AU ROC AU PR (in : out) all sampling M = 500 3.18 0.12 0.43 90.73 95.14 97.09 : 89.78 last 500 3.18 0.12 0.43 92.26 96.23 97.82 : 92.06 5.6 Conclusion In this work, we propose a simple, efficient, and general-purpose confidence estimator mfIJ for deep neural networks. The main idea is to approximate the ensemble of an infinite number of infinitesimal jackknife samples with a closed-form Gaussian distribution, and derive an effi- cient mean-field approximation to classification predictions, when the softmax layer is applied to Gaussian. Empirically, mfIJ surpasses or is competitive with the state-of-the-art methods for uncertainty estimation while incurring lower computational cost and memory footprint. Solving inverse Hessian vector product approximately (Line 5 in Algorithm 3) using iterative algorithm for uncertainty estimation is our future work. 72 Part III Large-scale Kernel Learning 73 Chapter 6 RANDOM KITCHEN SINKS FOR LARGE-SCALE MACHINE LEARNING In this Chapter, we describe our work on large-scale kernel machines based on random feature approximation of the kernel. We propose techniques to scale up kernel methods to large learning problems that are commonly found in speech recognition and computer vision. We have shown that the performance of those large kernel models approaches matches or surpasses their deep neural networks counterparts, which have been regarded as the state-of-the-art. 6.1 Introduction Deep neural networks (DNNs) and other types of deep learning architecture have made significant advances [12, 13]. In both well-benchmarked tasks and real-world applications, such as automatic speech recognition [90, 153, 191] and image recognition [116, 208], deep learning architectures have achieved an unprecedented level of success and have generated major impact. Arguably, the most instrumental factors contributing to their success are: (1) learning from a huge amount of training data for highly complex models with millions to billions of parameters; (2) adopting simple but effective optimization methods such as stochastic gradient descent; (3) combatting overfitting with new schemes such as drop-out [92]; and (4) computing with massive parallelism on GPUs. New techniques as well as “tricks of the trade” are frequently invented and added to the toolboxes for machine learning researchers and practitioners. In stark contrast, there have been many fewer publicly known successful applications of ker- nel methods (such as support vector machines) to problems at a scale comparable to the speech and image recognition problems tackled by DNNs. This is a surprising chasm, noting that ker- nel methods have been extensively studied both theoretically and empirically for their power of modeling highly nonlinear data [189]. Moreover, the connection between kernel methods and (infinite) neural networks has also been long noted [157, 220, 35]. Nonetheless, a common misconception is that it may be difficult, if not impossible, for kernel methods to catch up with deep learning methods in addressing large-scale learning problems. In particular, many kernel-based algorithms scale quadratically in the number of training samples. This barrier in computational complexity makes it especially challenging for kernel methods to reap the benefits of learning from a very large amount of data, while deep learning architectures are especially adept at it. We contend that this skepticism can be sufficiently attenuated. Concretely, in this work, we investigate and propose new ideas tailored for keworkrnel methods, with the aim of scaling them up to take on challenging problems in computer vision and automatic speech recognition. 74 To this end, we build on the work by [178] on approximating kernel functions with features derived from random projections. Our innovation is, however, to advance the state-of-the-art to a much larger scale. Concretely, we propose fast training methods for models with hundreds of millions of parameters — these methods are necessary for classifiers using hundreds of thousands of features to recognize thousands of categories. We also propose scalable methods for combining multiple kernels as ways of learning feature representations. Interestingly, we show multiplicative combination of kernels scale better than additive combinations. We validate our approaches with extensive empirical studies. We contrast kernel models to DNNs on 4 large-scale benchmark datasets, some of which are often used to demonstrate the effectiveness of DNNs. We show that the performance of large-scale kernel models approaches match their deep learning counterparts, which are either exhaustively optimized by us or are well-accepted as yardsticks in industry standards. While providing a recipe to obtain state-of-the-art large-scale kernel models, another impor- tant contribution of our work is to shed light on new perspectives and opportunities for future study. The techniques we have developed are easy to implement, readily reproducible, and incur much less computational cost (for hyper-parameter tuning and model selection) than deep learn- ing architectures. Thus, they are valuable tools, tested and verified to be effective for constructing comparative systems. Comparative studies enabled by such systems will, in our view be indispensable in pursuing the higher goal of exploring and acquiring an understanding of how the two camps of methods differ, for instance in learning new representations of the original data 1 . As an example, we show that combining kernel models and DNNs improves over either individual model, suggesting that the two paradigms learn different yet complementary representations from the data. We believe that research in this line will offer deep insights, and broaden the theory and practice of designing alternative methods to both DNNs and kernel methods for large-scale learning. The rest of the Chapter is organized as follows. We briefly review related work in Section 6.2. In Section 6.3, we give a brief account of [178]. We describe our approaches in Section 6.4. In Section 6.5, we report extensive experiments comparing DNNs and kernel methods on the problems in image and automatic speech recognition. We summarize the work in Section 6.6. 6.2 Related work The computational complexity of kernel methods, such as support vector machines, depends quadratically on the number of training examples at training time and linearly on the number of training examples at the testing time. Hence, scaling up kernel methods has been a long-standing and actively studied problem. [20] summarizes several earlier efforts in this vein. With clever implementation tricks such as computation caching (for example, keeping only a small portion of the very large kernel matrix inside the memory), earlier kernel machines can cope with hundreds of thousands of samples [199, 45]. [19] provides an excellent account of various design considerations. 1 Note this inquiry would be informative only if both kernel methods and deep learning methods attain similar performance yet exploit different aspects of data. 75 To further reduce the dependency on the number of training samples, a more effective strategy is to actively select training samples [18]. An early version of this idea was reflected in the Se- quential Minimal Optimization (SMO) algorithm [176]. With more sophistication, this technique was extended to enable training SVMs on 8 million samples [136]. Alternative approaches exploit the equivalence between SVMs and sparse greedy approximation and solve SVMs approximately with a smaller subset of examples called coresets 2 [209, 36]. Exploiting structures of the kernel matrix can scale kernel methods to 2 million to 50 million training samples [203]. Note that at the time of publication, none of the above-mentioned methods had been directly compared to DNNs. Instead of reducing the number of training samples, we can reduce the dimensionality of kernel features. In theory, those features are infinite dimensional. But for any practical problem, the dimensionality is bounded above by the number of training samples. The main idea is then to directly use such features, after dimensionality reduction, to construct classifiers (i.e., solving the optimization problem of SVM in the primal space). Thus far, approximating kernels with finite-dimensional features has been recognized as a promising way of scaling up kernel methods. The most relevant one to our work is the early observation by Rahimi and Recht that inner products between features derived from random projections can be used to approximate translation-invariant kernels, a direct result of spectral analysis of positive functions [14, 189, 178]. Their follow-up work of using those random features — weighted random kitchen sink [179] — is a major inspiration to our work. Since, there has been a growing interest in using random projections to approximate different kernels [102, 75, 126, 213]. For example, [40] studied how to use random features for online learning. We note that the amount of time for such classifiers to make a prediction depends linearly on the number of training samples. This could be a concern when the number of training samples is large. In spite of these progresses, there have been only a few reported large-scale empirical studies of those techniques on challenging tasks from speech recognition and computer vision, on which DNNs have been highly effective. In the context of automatic speech recognition, examples of directly using kernel methods were reported [47, 31, 95]. However, the tasks were fairly small- scale (for instance, on the TIMIT dataset). Moreover, none of them explores kernel learning as a way of learning new representations. In contrast, one major aspect of our work is to use multiple kernel learning to arrive at new representations so as to reduce the gap between DNNs and kernel methods, cf. Section 6.4.2. 6.3 Kernel Methods and Random Features In what follows, we describe the basic idea we have built upon to scale up kernel methods. Kernel methods, broadly speaking, are a set of machine learning techniques which either explicitly or implicitly map data from the input spaceX to some feature spaceH, in which a linear model is learned. A “kernel function”K :XX ! R is then defined 3 as the function which takes as inputx;y2X , and returns the dot-product of the corresponding points inH. If we let:X !H denote the map into the feature space, then K(x;y) =h(x);(y)i. 2 We also experimented those techniques though we were not able to identify significant empirical success. 3 It is also possible to define the kernel function prior to defining the feature map; then, for positive-definite kernel functions, Mercer’s theorem guarantees that a corresponding feature map exists such thatK(x;y) =h(x);(y)i. 76 Standard kernel methods avoid inference inH, because it is generally a very high-dimensional, or even infinite-dimensional, space. Instead, they solve the dual problem by using the N-by- N kernel matrix, containing the values of the kernel function applied to all pairs ofN training points. When dim(H) is far greater than N, this “kernel trick” provides a nice computational advantage. However, whenN is exceedingly large, the (N 2 ) size of the kernel matrix makes training impractical. Rahimi and Recht [178] address this problem by leveraging Bochner’s Theorem, a classical result in harmonic analysis, in order to provide a fast way to approximate any positive-definite shift-invariant kernelK with finite-dimensional features. A kernelK(x;y) is shift-invariant if and only ifK(x;y) = ^ K(xy) for some function ^ K : R d ! R. We now present Bochner’s Theorem: Theorem 2. (Bochner’s theorem, adapted from [178]): A continuous shift-invariant kernelK(x;y) = ^ K(xy) onR d is positive-definite if and only if ^ K is the Fourier transform of a non-negative measure(!). Thus, for any positive-definite shift-invariant kernel ^ K(), we have that ^ K() = Z R d (!)e j! T d!; (6.1) where (!) = (2) d Z R d ^ K()e j! T d (6.2) is the inverse Fourier transform 4 of ^ K(), and wherej = p 1. By Bochner’s theorem,(!) is a non-negative measure. As a result, if we letZ = Z R d (!)d!, thenp(!) = 1 Z (!) is a proper probability distribution, and we get that 1 Z ^ K() = Z R d p(!)e j! T d!: For simplicity, we will assume going forward that ^ K is properly-scaled, meaning thatZ = 1. Now, the above equation allows us to rewrite this integral as an expectation: ^ K() = ^ K(xy) = Z R d p(!)e j! T (xy) d! =E ! h e j! T x e j! T y i : (6.3) This can be further simplified as ^ K(xy) =E !;b h p 2 cos(! T x +b) p 2 cos(! T y +b) i ; where! is drawn fromp(!), andb is drawn uniformly from [0; 2]. 4 There are various ways of defining the Fourier transform and its inverse. We use the convention specified in Equations (6.1) and (6.2), which is consistent with Rahimi and Recht [178]. 77 This motivates a sampling-based approach for approximating the kernel function. Concretely, we drawf! 1 ;! 2 ;:::;! D g independently from the distributionp(!), andfb 1 ;b 2 ;::: b D g inde- pendently from the uniform distribution on [0; 2], and then use these parameters to approximate the kernel, as follows: K(x;y) 1 D D X i=1 p 2 cos(! T i x +b i ) p 2 cos(! T i y +b i ) = ^ (x) T ^ (y); (6.4) where ^ i (x) = r 2 D cos(! T i x+b i ) is thei th element of theD-dimensional random vector ^ (x). In Table 6.1, we list two popular (properly-scaled) positive-definite kernels with their respective inverse Fourier transforms. Table 6.1: Gaussian and Laplacian Kernels, together with their sampling distributionsp(!) Kernel name K(x;y) p(!) Density name Gaussian e kxyk 2 2 =2 2 (2(1= 2 )) d=2 e k!k 2 2 2(1=) 2 Normal(0 d ; 1 2 1 d ) Laplacian e kxyk 1 d Y i=1 1 (1 + (! i =) 2 ) Cauchy(0 d ;) Using these random feature maps, in conjunction with linear learning algorithms, can yield huge gains in efficiency relative to standard kernel methods on large datasets. Learning with a representation ^ ()2 R D is relatively efficient provided that D is far less than the number of training samplesN. For example, in our experiments (see Section 6.5), we have 2 million to 16 million training samples, whileD 25;000 often leads to good performance. Rahimi and Recht [178, 179] prove a number of important theoretical results about these random feature approximations. First, they show that ifD = ~ ( d 2 ), then with high probability ^ (x) T ^ (y) will be within ofK(x;y) for allx;y in some compact subsetM2R d of bounded diameter. 5 See claim 1 of Rahimi and Recht [178] for the more precise statement and proof of this result. In their follow-up paper [179], the authors prove a generalization bound for models learned using these random features. They show that with high-probability, the excess risk 6 assumed from using this approximation, relative to using the “oracle” kernel model (the exact kernel model with the lowest risk), is bounded byO( 1 p N + 1 p D ) (see the main result of [179] for more details). Given that the generalization error of a model trained using exact kernel methods is known to be within O( 1 p N ) of the oracle model [10], this implies that in the worst case, D = (N) random features may be required in order for the approximated model to achieve generalization performance comparable to the exact kernel model. Empirically, however, fewer than (N) features are often needed in order to attain strong performance [225]. 5 We are using the ~ notation to hide logarithmic factors. 6 The “risk” of a model is defined as its expected loss on unseen data. 78 6.3.1 Use Random Features in Classifiers Just as the standard kernel methods (SVMs or kernelized linear regression) can be seen as fitting data with linear models in kernel-induced feature spaces, we can plug in the random feature vector ^ (x) in just about any (linear) model. In this work, we focus on using them to construct multinomial logistic regression. Specifically, our model is a special instance of the weighted sum of random kitchen sinks [179] p(y =cjx) = e w T c ^ (x) P c e w T c ^ (x) (6.5) where the label y can take any value fromf1; 2;:::;Cg. We use multinomial logistic regres- sion mainly because it can deal with a large number of classes and provide posterior probability assignments, needed by the application task (i.e., the speech recognition systems, in order to combine with components such as language models). 6.4 Our Approach To scale up kernel methods, we address two challenges: (1) how to train large-scale models in the form of Eq. 6.5; (2) how to learn optimal kernel functions adapted to the data. We tackle the former with a parallel optimization algorithm and the latter by extending the construction of random features initially proposed in [178]. 6.4.1 Parallel Optimization for Large-scale Kernel Models While random features and weighted sum of random kitchen sinks have been investigated before, there are few reported cases of scaling up to the problems commonly seen in automatic speech recognition and other domains. For example, in our empirical studies of acoustic modeling (cf. Section 6.5.3), the number of classes isC = 1000 and we often use more thanD = 100; 000 ran- dom features to compose ^ (x). Thus, the linear model Eq. 6.5 has a large number of parameters (aboutCD = 10 8 ). We have developed two major strategies to overcome this challenge. First, we leverage the ob- servation that fitting multinomial logistic regression is a convex optimization problem and adopt the method of stochastic averaged gradient (SAG) for its faster convergence, both theoretically and empirically, over stochastic gradient descent (SGD) [186]. Note that while SGD is widely applicable to both convex and non-convex optimization problems, SAG is specifically designed for convex optimization and thus well-suited to our learning setting. Secondly, we leverage the property that random projections are just random – that is, given a D-dimensional ^ (x), any random subset of it is still random. Our idea is then to train a model on each subset of features in parallel and then assemble them together to form a large model. Specifically, for large D (say 100; 000), we partition D into B blocks ^ b (x) with each block having a size ofD 0 (say = 25; 000). Note that each block corresponds to a different set of random projections sampled from the densityp(!). We trainB multinomial logistic regression models and obtainB sets of parameters for each class, ie.,fw b c ;c = 1; 2;:::;Cg B b=1 . To assemble 79 them, we combine in the spirit of geometric mean of the probabilities (or arithmetic mean of the log probabilities) p(y =cjx)/ exp 1 B B X b=1 ^ b (x) T w b c ! = B s Y b exp ^ b (x) T b w b c (6.6) Note that this assembled model can be seen as a D-dimensional model with parameters of f 1 B w b c ;c = 1; 2;:::;Cg B b=1 . We sketch the main argument for the validity of this parallel training procedure, leaving a rigorous proof for future work. The parameters of the weighted sum of random kitchen sink converges inO(1= p D) to the true risk minimizer [179]. Thus, for each model of sizeD 0 , the pre-softmax activations (i.e., the logits) converge inO(1= p D 0 ). ForB such models, the arith- metic mean of logits converge inO(1=( p B p D 0 )) thus matching up the rate for aD-dimensional model. Our extensive empirical studies have supported our argument — in virtually all training settings, the assembled models cannot be improved further, attaining the optimum of the corre- spondingD-dimensional model. 6.4.2 Learning Kernel Features Another advantage of using kernels is to sidestep the problem of feature engineering, i.e., how to select the best basis functions for a task at hand. Essentially, determining what kernel function to use implicitly specifies the basis functions. But then the question becomes: how to select the best kernel function? One popular paradigm to address the latter problem is multiple kernel learning (MKL) [125, 6, 38, 111]. That is, starting from a collection of base kernels, the algorithm identifies the best subset of them and combines them together to best adapt to the training data, analogous to designing the best features according to the data. In the following, we show how a few common MKL ideas can benefit from the previously described large-scale learning techniques (cf. Section 6.3). While many MKL algorithms are for- mulated with kernel matrices (and thus are not easily scalable to large problems), we demonstrate how they can be efficiently implemented with the general recipe of random feature approxima- tion. Among them, we show an interesting and novel result on combining kernels with Hadamard products, where the random feature approximation is especially computationally advantageous. In our empirical studies, we will show that MKL improves methods using a single kernel, and eventually approaches the performance of deep neural networks. Thus, MKL presents an effective and computational tractable alternatives to DNNs, even for large-scale problems. Additive Kernel Combination Given a collection of base kernelsfk i (;);i = 1; 2;:::;Lg, their non-negative combination k(;) = X i i k i (;) (6.7) 80 is also a kernel function, provided i 0 for anyi. Suppose each kernel k i (; _ ) is approximated with a D-dimensional random feature vector ^ i (), as in Eq. 6.4. Then, given the linearity of the combination, the kernel functionk(;) can be approximated by k(;) X i i ^ i () T ^ i () = ^ () T ^ () (6.8) where ^ () is just the concatenation of the p i -scaled ^ i (). Note that the dimensionality of ^ () would beLD. There are several ways to exploit this approximation. The first way is to straightforwardly plug ^ () into the multinomial logistic regression Eq. 6.5 and optimize overLD features. The second way is more scalable. For each ^ i (), we learn an optimal model with parametersw i c for each classc. We then learn a set of combination coefficients i by optimizing the likelihood model p(y =cjx)/ exp X i p i ^ i (x) T w i c ! (6.9) while holding the other parameters fixed. This is a convex optimization with (presumably) a small set of parameters. While the first approach is more general, however, empirically, we do not observe a strong difference and have adopted the second approach for its scalability. Multiplicative Kernel Combination Kernels can also be multiplicatively combined from base kernels: k(;) = L Y i=1 k i (;) (6.10) Note that this is a highly nonlinear combination [38]. Unlike the additive combination, to ap- proximate the multiplicative combination of kernels, there does not exist a simple form (such as concatenating) of composing with the approximate features of individual kernels. Nonetheless, We have proved the following theorem as a way to constructing the approximate features for k(;) efficiently. Theorem 3. Suppose allk i (;) are translation-invariant kernels such that k i (xz) = Z R d p i (!)e j! T (xz) d! (6.11) Thenk(;) is also translation-invariant such that k(xz) = Z R d p(!)e j! T (xz) d! (6.12) where the probability measurep(!) is given by the convolution of allp i (!) p(!) =p 1 (!)p 2 (!)p L (!) (6.13) 81 Moreover, let! i p i (!) be a random variable drawn from the corresponding distribution, then ! = X i ! i p(!) (6.14) Namely, to approximatek(;), one needs only to draw random variables from each individual component kernel’s corresponding density, and use the sum of those variables to compute random features. The proof of the theorem is provided in Section 6.7. We note that! and! i have the same dimensionality. Thus, the number of approximating features is independent of the number of kernels, leading to a computational advantage over additive combination. Kernel composition Kernels can also be composited. Specifically, ifk 2 (x;z) is a kernel func- tion that depends on only the inner products of its arguments, thenk =k 2 k 1 is also a kernel func- tion. A concrete example is whenk 2 is the Gaussian RBF kernel andk 1 (x;z) = 1 (x) T 1 (z) for some mapping 1 () k(x;z) = expfk 1 (x) 1 (z)k 2 2 = 2 g = expf[k 1 (x;x) +k 1 (z;z) 2k 1 (x;z)]= 2 g If we approximate bothk 1 andk 2 using the random feature approximation of Eq. (6.4), the composition would be (graphically) equivalent to the following mapping, x !p 1 (!) ! ^ 1 (x) !p 2 (!) ! ^ 2 ( ^ 1 (x)) (6.15) namely, a one-hidden-layer neural networks with the weight parameters in the layers being com- pletely random. As before, the result of the composite mapping ^ 2 ^ 1 can be used in any classifier as input features. We also generalize this operation to introduce a linear projection to reduce dimensionality, serving as information bottleneck: ^ 2 P ^ 1 . We experimented on two choices. First,P performs PCA (using the sample covariance matrix) on ^ 1 (x). Note that this im- pliesP ^ 1 is an approximate kernel PCA on the original feature spacex, using the kernelk 1 . Secondly,P performs supervised dimensionality reduction. One simple choice is to implement Fisher discriminant analysis (FDA) on ^ 1 (x), which is equivalent to kernel (FDA) onx. In our experiments, we have used a different procedure in a similar spirit. Specifically, In particular, we first use ^ 1 (x) as input features to build a multinomial logistic regression to predict its labels. We then perform PCA on the log-posterior probabilities. Our choice here is largely due to the con- sideration of re-using the computations as we often need to estimate the performance ofk 1 (;) alone, thus the multinomial classifier built withk 1 (;) is readily usable. 6.5 Experiments We validate our approaches of scaling up kernel methods on challenging problems in computer vi- sion and automatic speech recognition (ASR). We conduct extensive empirical studies comparing kernel methods to deep neural networks (DNNs), which perform well in computer vision, and are 82 Table 6.2: Handwritten digit recognition error rates (%) Model kernel DNN Augment training data no yes no yes On validation 0.97 0.77 0.71 0.62 On test 1.09 0.85 0.69 0.77 state-of-the-art in ASR. We show that kernel methods attain similarly competitive performance as DNNs – details in Section 6.5.2 and 6.5.3. What can we learn from two very different, yet equally competitive, learning models? We report our initial findings on this question (Section 6.5.5). We show that kernel methods and DNNs learn different yet complementary representation of the data. As such, a direct application of this observation is to combine them to obtain better models than either independently. 6.5.1 General Setup For all kernel-based models, we tune only three hyperparameters: the bandwidths for Gaussian or Laplacian kernels, the number of random projections, and the step size of the (convex) opti- mization procedure (as adjusting it has a similar effect as early-stopping). For all DNNs, we tune hyperparameters related to both the architecture and the optimization. This includes the number of layers, the number of hidden units in each layer, the learning rate, the rate decay, the momentum, regularization, etc. We also use unsupervised pre-training and tune hyperparameters for that phase too. Details about tuning those hyper-parameters are described in the Section 6.7 as they are often dataset-specific. In short, model selection for kernel models has significantly lower computational cost. We give concrete measures to support this observation in Section 6.5.4. 6.5.2 Computer Vision Tasks We experiment on two problems: handwritten digit recognition and object recognition. Handwritten digit recognition We extract a dataset MNIST-6.7M, from the dataset MNIST- 8M [136]. MNIST-8M is a transformed version of the original MNIST dataset [127]. Concretely, we randomly select 50,000 out of 60,000 images from the MNIST’s training set and extracted the corresponding samples (the original as well as transformed/distorted ones) in MNIST-8M, resulting in a total of 6.75 million samples in total, as our training set. We use the remaining 10,000 images from the original training set as a validation set — we purposely avoid using any transformed versions of those 10,000 images as a validation dataset to avoid potential overfitting. We report test error rate on the standard 10,000 MNIST test set. We also experimented with a data augmentation trick to increase the number of training sam- ples during training. Whenever we encounter a training sample, we corrupt it with masking noise (randomly flipping 1 to 0 in the binary image). We crudely tune the mask-out rates, which are either 0.1, 0.2 or 0.3. 83 Table 6.3: Object recognition error (%) Model kernel DNN Augment training data no yes no yes On validation 43.2 41.4 42.9 43.2 On test 43.9 42.2 43.3 44.0 Table 6.2 compares the performance of the best single-kernel based classifier to that of the best DNN. The kernel classifier uses Gaussian kernel, and 150,000 random projections. The best DNN has 4 hidden layers with 1000, 2000, 2000, and 2000 hidden units respectively. The difference between the kernel model and the DNN is small – about 16 (out of 10,000) misclassified images. Interestingly, on the test data, the kernel model benefits from the data augmentation trick while the DNN does not. Possibly, the DNN overfits to the validation dataset. Object recognition For this task, we experiment on the database CIFAR-10 [115]. The dataset contains 50,000 training samples and 10,000 test samples. Each sample is a RGB image of 32 32 pixels in one of 10 object categories. We randomly picked 10,000 images from the training set for validation, keeping the remaining 40,000 images for training. We did not perform any preprocessing to the images as we want to relate our findings to previously published results which often do not preprocess data or do not report all specific details in preprocessing. We also experimented with an augmented version of the dataset, by injecting Gaussian noise to images during training. Table 6.3 compares the performance of the best single-kernel based classifier to that of the best DNN. The kernel classifier uses Gaussian kernel, and 4,000,000 random projections. The best DNN has 3 hidden layers with 4000, 2000, 2000 hidden units respectively. The best kernel model performs slightly better than the DNN. They both outperform previ- ously reported DNN results on this dataset, whose error rates are between 44.4% and 48.5% [184, 115, 180]. Convolutional neural nets (CNNs) can significantly outperform DNNs on this dataset. How- ever, we do not compare to CNNs as our kernel models (as well as DNNs) do not construct feature extractions with prior knowledge, while CNNs are designed especially for object recognition. 6.5.3 Automatic Speech Recognition (ASR) Task and evaluation metric Deep neural nets (DNNs) have been very successfully applied to ASR. There, DNNs perform the task of acoustic modeling. Acoustic modeling is analogous to the conventional multi-class classification, that is, to learn a predictive model to assign phoneme context-dependent phoneme state labelsy to short segments of speech, called frames, represented as acoustic feature vectorsx. Acoustic feature vectors are extracted from frames and their context windows (i.e., neighboring frames in temporal proximity). Analogously, kernel-based multinomial logistic regression models, as described in Section 6.3.1, are also used as acoustic models and compared to DNNs. Acoustic models are often evaluated in conjunction with other components of ASR systems. In particular, speech recognition is inherently a sequence recognition problem. Thus, perplex- ity and classification accuracies — commonly used for conventional multi-class classification 84 Table 6.4: Best Token Error Rates on Test Set (%) Model Bengali Cantonese ibm 70.4 67.3 rbm 69.5 66.3 best kernel model 70.0 65.7 problems — provide only a proxy (and intermediate goals) to the sequence recognition error. To measure the latter, a full ASR pipeline is necessary where the posterior probabilities of the phoneme states are combined with the probabilities of the language models (of the interested lin- guistic units such as words) to yield the most probable sequence of those units. A best alignment with the ground-truth sequence is computed, yielding token error rates (TER). Given the inherent complexity, in what follows, we summarize the empirical studies of apply- ing both paradigms to the acoustic modeling task. We will reportTER on two different languages. We begin by describing the datasets, followed by a brief description of various kernel and DNN models we have experimented with. Datasets We use two datasets: the IARPA Babel Program Cantonese (IARPA-babel101-v0.4c) and Bengali (IARPA-babel103b-v0.4b) limited language packs. Each pack contains a 20-hour training, and a 20-hour test sets. We designate about 10% of the training data as a held-out set to be used for model selection and tuning. We follow the same procedure to extract acoustic features from raw audio data as in the previous work using DNNs for ASR [108]. In particular, we have used IBM’s proprietary system Attila which is adapted for the above-mentioned Babel language packs. The acoustic features are 360-dimensional real-valued dense vectors. There are 1000 (non-overlapping) phoneme context- dependent state labels for each language pack. For Cantonese, there are about 7.5 million data points for training, 0.9 million for held-out, and 7.2 million for test, and on Bengali, 7.7 million for training, 1.0 million for held-out and 7.1 million for test. For Bengali, theTER metric is the word-error-rate (WER) and for Cantonese, it is character-error-rate (CER). Various of models being experimented IBM’s Attila ASR system has a DNN acoustic model that contains five hidden-layers, each of which contains 1,024 units with logistic nonlinearities. We refer to this system as ibm. We have developed another version of DNN, following the original Restricted Boltzman Machine (RBM)-based training procedure for learning DNNs[91]. Specifically, the pre-training is unsupervised. We have trained DNNs with 1, 2, 3 and 4 hidden layers, and 500, 1000, and 2000 hidden units per layer (thus, 12 architectures per language). We refer to this system asrbm. For kernel-based acoustic models, we used Gaussian RBF, Laplacian kernels or some forms of combinations. The only hyper-parameter there to tune is the kernel bandwidth, which ranges from 0.3 - 5 median of the pairwise distances in the data (Typically, the median works well.), the number of random projections ranging from 2,000 to 400,000 (though a stable performance is often observed at 25,000 or above). For training with very large number of features, we used the parallel training procedure, described in Section 6.4.1. For optimization, we used the stochastic average gradient and tune the step size loosely from 4 valuesf10 4 ; 10 3 ; 10 2 ; 10 1 g. 85 Results Table 6.4 reports the best performing models measured in TER. The RBM-trained DNN (rbm), which has 4 hidden layers and 2000 units in each layer, performs the best on Ben- gali. But our best kernel model, which uses Gaussian RBF kernel and 150,000 – 200,000 random projections, performs the best on Cantonese. Both perform better than IBM’s DNN. On Can- tonese, the improvement of the kernel model over ibm is noticeably substantial (1:6% reduction in absolute). 6.5.4 Computational Efficiency In contrast to DNNs, kernel models can be more efficiently developed. We illustrate this on two aspects: the computational cost of training a single model and the cost of model selection (i.e., hyperparameter tuning) Cost of training a single model While the amount of training time depends on several factors, including the volume and the dimensionality of the dataset, the choice of hyperparameters and their effect on convergence, implementation details, etc. We give a rough picture after controlling those extraneous factors as much as possible. We implement both methods with highly optimized Matlab codes (comparable to our CUDA C implementation) and utilize a single GPU (NVidia Tesla K20m). The timing results reported below are obtained from training acoustic models on the Bengali language dataset. For a kernel model with 25,000 random projections (25 million model parameters), conver- gence is reached in less than 20 epochs, with an average of 15 minutes per epoch. In contrast, a competitive deep model with four hidden layers of 2,000 hidden units (15 million parameters), if initialized with pretrained parameters, reaches convergence in roughly 10 epochs, with an average of 28 minutes per epoch. (The pretraining requires additional 12 hours.) Thus, the training time for a single kernel model is about the same as that for a DNN. This holds for a range of datasets and configurations of hyperparameters. Cost of model selection The number of kernel models to be tuned, is significantly (at least one order in magnitude ) less than DNNs. There are only two hyperparameters to search when selecting kernel models: the kernel bandwidth and the learning rate. Generally, the higher the number of random projections, the better the performance is. For DNNs, the number of hyperparameters needed to be tuned is substantially more. As previously mentioned, in our experiments, we tuned those related to the network architecture and optimization procedure, for both pre-training and fine-tuning. As such, it is fairly common to select the best DNN among hundreds to thousands of them. Combining both factors, kernel models are especially appealing when they are used to tackle new problem settings where there is only a weak knowledge about what the optimal hyperparam- eters are or what the proper ranges for those parameters are. To develop DNNs in this scenario, one would be forced to combinatorially adjust many knobs while kernel approaches are simple and straightforward. 86 Table 6.5: Combining best performing kernel and DNN models Dataset MNIST-6.7 CIFAR-10 Bengali Cantonese Best single 0.69 42.2 69.5 65.7 Combined 0.61 40.3 69.1 64.9 -100 -50 0 50 100 -100 -50 0 50 100 kernel machine -50 0 50 -60 -40 -20 0 20 40 60 80 neural network Figure 6.1: t-SNE embeddings of data representation learnt by kernel (Left) and by DNN (Right). The relative positioning of those well-separated clusters are different, suggesting that these two models learn to represent data in quite different ways (Kernel embedding is computed using DNN’s embedding as initialization to avoid the randomness in t-SNE.) 6.5.5 Do Kernel and Deep Models Learn the Same Thing? Given their matching performances, do kernel and DNN models learn the same knowledge from data? We report in the following our initial findings. We first combine the best performing models from each paradigm — we use weighted sum of pre-softmax activations (i.e. logits). Table 6.5 summarizes those results across 4 tasks we have studied in the previous sections. In the row of “best single”, blue color indicates the number is from a kernel model while red from a DNN. Clearly, combining the two improves either one independently. These improvements suggest that despite being close in error rates, the two models are still different enough. We gain more intuitive understanding by visualizing what are being learnt by each model. To this end, we project each model’s pre-softmax activations onto the 2-D plane with t-SNE. Figure 6.1 contrasts the embeddings for 1000 samples from MNIST-6.7M’s test set. Each data point is a dot and the color encodes its label. Due to the low classification error rates, it is not surprising to find that there are 10 well separated clusters, one for each of the 10 digits. However, the relative positioning of those clusters is noticeably different between DNN and kernel. It is not clear the embeddings can be transformed into each other with linear transformations. This seems to suggest each method has its unique way of nonlinearly embedding data. Elucidating more precisely is our future research direction. 87 6.6 Conclusion In this work, we propose techniques to scale up kernel methods to large learning problems that are commonly found in speech recognition and computer vision. We have shown that the perfor- mance of those large kernel models approaches or surpasses their deep neural networks counter- parts, which have been regarded as the state-of-the-art. 6.7 Details In this section, we provide more details, experimental setups, results and analysis. 6.7.1 Proof of Theorem 3 Proof. Denote =xz. For translation-invariant kernel, we have k i (x;z) =k i () = Z ! i p i (! i )e j! T i d! i The product of the kernels is, k(;) = L Y i=1 k i (;) = L Y i=1 k i () =k(); which is also translation-invariant. k() = L Y i=1 Z ! i p i (! i )e j! T i d! i = Z ! 1 :::! L p 1 (! 1 ):::p L (! L )e j( P L i ! T i ) d! 1 :::d! L ~ != P L i ! i = Z ~ ! 2 6 4 Z ! 1 :::! L1 p 1 (! 1 ):::p L1 (! L1 )p L (~ ! L1 X i ! i )d! 1 :::d! L1 3 7 5e j ~ ! T d~ ! = Z p ~ ! (~ !)e j ~ ! T d~ ! = Z p ~ ! (~ !)e j ~ ! T (xz) d~ ! =E ~ ! [ ~ ! (x) ~ ! (z) ] We have used the fact (due to convolution theorem) that Z ! 1 :::! L1 p 1 (! 1 ):::p L1 (! L1 )p L (~ ! L1 X i ! i )d! 1 :::d! L1 = p 1 (! 1 )p 2 (! 2 )p L (! L ) =p ~ ! (~ !) 88 It means we have found a new distributionp ~ ! (~ !) as the random projection generating distribution for the new kernel k(;) = L Y i=1 k i (;): From the definition of ~ !, in order to sample from p ~ ! (~ !), we can simply use the sum of independent samples fromp i (! i ). In what follows, we provide more experimental details. 6.7.2 Image Recognition We first provide details on our empirical studies on challenging problems in image recognition. Handwritten digit recognition (MNIST) Data preprocessing We scale the input between [0; 1) by dividing 256. Kernel We use Gaussian RBF and Laplacian kernels with kernel bandwidth selected from f1; 1:5; 2; 2:5; 3g median of the pairwise distance in data. We select the learning rate from f5 10 4 ; 10 3 ; 5 10 3 ; 10 2 ; 5 10 2 ; 10 1 g. The random feature dimension we have used is 150,000. Performance with different dimension is shown in Table 6.6. DNN We trained DNNs with 1; 2; 3; or 4 hidden layers, with 1000, 2000, 2000 and 2000 hid- den units respectively. We firstly pre-trained 1 Gaussian-Bernoulli and 3 consecutive Bernoulli restricted Boltzmann machines (RBMs), all using Stochastic Gradient Descent (SGD) with Con- trastive Divergence (CD-1) algorithm [89]. We select learning rate fromf10 1 ; 1:5 10 1 g, momentum fromf0.5, 0.9g and set L2 reg- ularization to 210 4 for 2 epochs of pretraining. In finetuning, we tune SGD with learning rate fromf5 10 3 ; 10 2 ; 5 10 2 ; 10 1 ; 5 10 1 g, momentum fromf0.7, 0.9g. We decrease the learning rate by a factor of 0.99 for every epoch and set mini-batch size to 100, L2 regularization to 0. We use early-stopping to control overfitting. When trained with data augmentation, we use smaller learning rate and run for more epochs. Data Augmentation We use mask-out noise with ratiof0.1, 0.2, 0.3g for both kernel methods and DNN. Results Table 6.7 compares the performance of kernel methods to deep neural nets of differ- ent architectures. The best result of DNN is a 4-hidden-layer neural network. Deep nets generally have slightly smaller test errors. Kernel models benefit more from data augmentation and achieve similar error rates. Table 6.6: Kernel Methods on MNIST-6.7M (error rates %) Kernel type Data aug. 10K 50K 100K 150K Gaussian No 1.45/1.42 1.03/1.25 0.98/1.12 0.97/1.09 Laplacian No 1.93/1.93 1.21/1.34 1.16/1.17 1.10/1.13 Gaussian Yes - 0.83/1.03 0.79/0.92 0.77/0.85 89 Table 6.7: DNN on MNIST-6.7M (error rates %) Model Original Augmented Validation Test Validation Test kernel 0.97 1.09 0.77 0.85 4 hidden 0.71 0.69 0.64 0.80 3 hidden 0.78 0.73 0.74 0.77 2 hidden 0.76 0.71 0.64 0.79 1 hidden 0.84 0.95 0.79 0.76 Object Recognition (CIFAR-10) Data Preprocessing We scale the input between [0; 1) by dividing 256. No other preprocessing is used because we would like to relate to previously reported results on DNNs on this data where preprocessing is not applied. Kernel Gaussian kernel is used. We achieve the best performance by using 4,000,000 ran- dom features. This is done by training 200K single models in parallel and then combine. Ta- ble 6.8 shows the performance of kernel with respect to different number of random features. Similarly, we select bandwidth fromf0.5,1,1.5,2,3g median distance and learning rate from f5 10 4 ; 10 3 ; 5 10 3 ; 10 2 ; 5 10 2 ; 10 1 g. DNN We trained DNNs with 1 to 4 hidden layers, with 2000 hidden units per layer. In pre- training, we use a Gaussian RBM for the input layer and three Bernoulli RBMs for intermediate hidden layers using CD-1 algorithm. (We adopt the parameterization of GRBM in [34], which shows better performance. ) For Gaussian RBM, we tune learning rate fromf5 10 5 ; 10 4 ; 2 10 4 g, momentum fromf0.2, 0.5, 0.9g 7 , and L2 regularization fromf2 10 5 ; 2 10 4 g. For Bernoulli RBM, we tune learning rate fromf10 2 ; 2:5 10 2 ; 5 10 2 g, momentum fromf0.2, 0.5, 0.9g, and L2 regularization isf2 10 5 g. In finetuning, we tune SGD with learning rate fromf4 10 2 ; 8 10 2 ; 1:2 10 1 g, momentum fromf0.2, 0.5, 0.9g, decrease the learning rate by a factor of 0.9 for every 20 or 50 epochs and set mini-batch size to 50, L2 regularization to 0. We use early- stopping to control overfitting. When trained with data augmentation, we use smaller learning rate and run for more epochs. We used constant learning rate throughout 30 epochs and update momentum after 5 epochs. In fine-tuning stage, we used stochastic gradient descent with 0.9 momentum, fixed learning rate schedule, decreasing the learning rate by 10 after 50 epochs. The optimal model is selected according to the classification accuracy on validation dataset. We trained DNNs with 1, 2, 3 and 4 hidden layers with 2000 hiddenunits per layer. Overfitting was observed after we increased model from 3 hidden layers to 4 hidden layers. Data Augmentation We apply additive Gaussian noise with standard deviationf0.1, 0.2, 0.3g on raw pixel for both kernel methods and DNN. Results Table 6.9 contrasts results of kernel models and DNN. We observe overfitting for 4 hidden layer neural network, and achieve best results in a 3 hidden layer architecture. Deeper models start to overfit and give worse validation and test performance. Kernel models achieve the best error when data augmentation is used. 7 Momentum can be increased to another choice from the three after 5 epochs. 90 Table 6.8: Kernel Methods on CIFAR-10 (error rate%) Gaussian r.f. Original Augmented Validation Test Validation Test 200K 43.74 44.48 42.15 43.13 1M 43.43 44.08 41.62 42.38 2M 43.26 44.04 41.47 42.26 4M 43.22 43.93 41.36 42.23 Table 6.9: DNN on CIFAR-10 (error rate %) Model Original Augmented Validation Test Validation Test kernel 43.22 43.93 41.36 42.23 4 hidden 43.21 43.74 43.00 43.38 3 hidden 42.89 43.29 42.93 43.35 2 hidden 43.30 43.76 43.80 44.81 1 hidden 48.40 48.94 47.28 47.79 6.7.3 Automatic Speech Recognition In what follows, we provide comprehensive details on our empirical studies on challenging prob- lems in automatic speech recognition. Tasks, datasets and evaluation metrics Task We have selected the task of acoustic modeling, a crucial component in automatic speech recognition. In its most basic form, acoustic modeling is analogous to the conventional multi-class classification, that is, to learn a predictive model to assign phoneme context-dependent state labels to short segments of speech, called frames. While speech signals are highly non- stationary and context-sensitive, acoustic modeling addresses this issue by using acoustic features extracted from context windows (i.e., neighboring frames in temporal proximity) to capture the transient characteristics of the signals. Data characteristics To this end, we use two datasets: the IARPA Babel Program Cantonese (IARPA-babel101-v0.4c) and Bengali (IARPA-babel103b-v0.4b) limited language packs. Each pack contains a 20-hour training, and a 20-hour test sets. We designate about 10% of the training data as a held-out set to be used for model selection and tuning (i.e., tuning hyperparameters etc). The training, held-out, and test sets contain different speakers. The acoustic data is very chal- lenging as it is two-person conversations between people who know each other well (family and friends) recorded over telephone channels (in most cases with mobile telephones) from speakers in a wide variety of acoustic environments, including moving vehicles and public places. As a re- sult, it contains many natural phenomena such as mispronunciations, disfluencies, laughter, rapid speech, background noise, and channel variability. Compared to the more familiar TIMIT corpus, which contains about 4 hours of training data, the Babel data is substantially more challenging because the TIMIT data is read speech recorded in a well-controlled, quiet studio environment. 91 As is standard on previous work using DNNs for speech recognition, the data is prepro- cessed using Gaussian mixture models to give alignments between phoneme state labels and 10-millisecond-frames of speech [108]. The acoustic features are 360-dimensional real-valued dense vectors. There are 1000 (non-overlapping) phoneme context-dependent state labels for each language pack. For Cantonese, there are about 7.5 million data points for training, 0.9 mil- lion for held-out, and 7.2 million for test, and on Bengali, 7.7 million for training, 1.0 million for held-out and 7.1 million for test. Evaluation metrics We will be reporting 3 evaluation metrics, typically found in mainstream speech recognition research. Perplexity Given a set of examples,f(x i ;y i );i = 1:::mg, the perplexity is defined as perp = exp ( 1 m m X i=1 logp(y i jx i ) ) The perplexity measure is lower bound by 1 when all predictions are perfect: p(y i jx i ) = 1 for all samples. With random guessing p(y i jx i ) = 1=C, where C is the number of classes, the perplexity attainsC. We use the perplexity measure on the held-out for model selection and tuning. This is because the perplexity is often found to be correlated with the next two performance measures. Accuracy The classification accuracy is defined as acc = 1 m m X i=1 1 y i = arg max y21;2;:::;C p(yjx i ) Token Error Rate (TER) Speech recognition is inherently a sequence recognition problem. Thus, perp and acc provide only proxy (and intermediate goals) to the sequence recognition error. To measure the latter, a full automatic speech recognition pipeline is necessary where the posterior probabilities of the phoneme labelsp(yjx) are combined with the probabilities of the language models (of the interested linguistic units such as words) to yield the most probable se- quence of those units. A best alignment with the ground-truth sequence is computed, yielding token error rates. For Bengali, the token error rate is the word-error-rate (WER) and for Can- tonese, it is character-error-rate (CER). Because it entails performing speech recognition, obtaining TER is computationally costly thus it is rarely used for model selection and tuning. Note also that the token error rates obtained on the Babel tasks are much higher than those are reported for other conversational speech tasks such as Switchboard or Broadcast News. This is because we have much less training data for Babel than for the other tasks. This low-resource setting is an important one in the speech pro- cessing area, given that there are a large number of languages in the world for which speech and language models do not currently exist. 92 Deep neural nets acoustic models There are many variants of DNNs techniques. We have decided to choose two flavors that are very different in learning from data, in order to have a broader comparison. In either case, our model tuning is extensive. IBM’s DNN We have used IBM’s proprietary system Attila for the conventional speech recognition that is adapted for the above-mentioned Babel task. A detailed description appears in [108]. Attila contains a state-of-the-art acoustic model provided by IBM. It also powers our full ASR pipeline in order to compute token error rate (TER). We have also used it to con- vert raw speech signals into acoustic features. Concretely, the features at a frame is a 40- dimensional speaker-adapted representation that has previously been shown to work well with DNN acoustic models [108]. Features at 8 neighboring contextual frames are concatenated, yield 360-dimensional features. We have used the same features for our kernel methods. IBM’s DNN acoustic model contains five hidden-layers, each of which contains 1,024 units with logistic nonlinearities. The output is a softmax nonlinearity with 1,000 targets that corre- spond to quinphone context-dependent HMM states clustered using decision trees. All layers in the DNN are fully connected. The training of the DNN occurs in two stages. First, a greedy layer-wise discriminative pretraining [192] to set the weights for each layer in a reasonable range. Then, the cross-entropy criterion is minimized with respect to all parameters in the network, us- ing stochastic gradient descent with a mini-batch size of 250 samples, without momentum, and with annealing the learning rate based on the reduction in cross-entropy loss on a held-out set. RBM-DNN We have designed another version of DNN, following the original Restricted Boltzman Machine (RBM)-based training procedure for learning DNNs[91]. Specifically, the pre-training is unsupervised. We have trained DNNs with 1, 2, 3 and 4 hidden layers, and 500, 1000, and 2000 hidden units per layer (thus, totally 12 architectures per language). The first hidden layer is a Gaussian RBM and the upper layers are Binary-Bernoulli RBM. In pre-training, we use 5 epochs of SGD with Contrastive Divergence (CD-1) algorithm on all training data. We tuned 3 hyper-parameters, which are learning rate, momentum, and the strength for an ` 2 regularizer. For fine-tuning, we used error back-propagation. We tuned the initial learning rate, learning rate decay, momentum and the strength for another ` 2 regularizer. The fine-tuning usually converges in 10 epochs. Kernel acoustic models The development of kernel acoustic models does not require combinatory searching over many factors. We experimented only two types of kernels: Gaussian RBF and Laplacian kernels. The only hyper-parameter there to tune is the kernel bandwidth, which ranges from 0.3 - 5 median of the pairwise distances in the data. (Typically, the median works well.) The random feature dimensions we have used ranging from 2,000 to 400,000, though a stable performance is often observed at 25,000 or above. For training with very large number of features, we used the parallel training procedure, described in section 4.1 of the main text. All kernel acoustic models are multi-nomial logistic regression, thus optimized by convex optimization. As mentioned in section 4.1 of the main text, we use Stochastic Average Gradient (SAG), which efficiently leverages the convexity property. We do tune the step size, selected from a loose range of 4 valuesf10 4 ; 10 3 ; 10 2 ; 10 1 g. 93 Table 6.10: Best perplexity and accuracy by different models (see texts for description of different models) Bengali Cantonese Model perp acc (%) perp acc (%) ibm 3.4/3.5 71.5/71.2 6.8/6.16 56.8/58.5 rbm 3.3/3.4 72.1/71.6 6.2/5.7 58.3/59.3 1-k 3.7/3.8 70.1/69.7 6.8/6.2 57.0/58.3 a-2-k 3.6/3.8 70.3/70.0 6.7/6.0 57.1/58.5 m-2-k 3.7/3.8 70.3/69.9 6.7/6.1 57.1/58.4 c-2-k 3.5/3.6 71.0/70.4 6.5/5.7 57.3/58.8 Table 6.11: Performance ofrbm acoustic models Bengali Cantonese (h, L) perp acc (%) perp acc (%) (1; 500) 3.9/3.9 69.2/69.3 7.1/6.4 55.8/57.4 (2; 500) 3.5/3.6 70.9/70.7 6.6/6.1 57.3/58.4 (3; 500) 3.5/3.5 71.2/70.9 6.4/5.9 57.7/58.6 (4; 500) 3.4/3.5 71.2/70.8 6.4/5.9 57.5/58.7 (1; 1000) 3.7/3.7 70.1/70.1 6.8/6.2 56.4/58.0 (2; 1000) 3.4/3.4 71.6/71.4 6.3/5.8 58.2/59.0 (3; 1000) 3.4/3.5 71.7/71.3 6.3/5.7 58.0/59.2 (4; 1000) 3.3/3.5 71.8/71.4 6.6/5.8 57.1/58.6 (1; 2000) 3.6/3.7 70.5/70.3 6.7/6.1 56.9/58.1 (2; 2000) 3.4/3.4 71.8/71.4 6.2/5.7 58.3/59.3 (3; 2000) 3.4/3.5 71.5/71.2 6.2/5.6 57.8/59.1 (4; 2000) 3.3/3.4 72.1/71.6 6.4/5.8 57.8/59.1 For additive and multiplicative kernel combinations, we combine only two, one Gaussian and the other Laplacian. For additive combinations, we first train two models, one for each kernel. The combining coefficient is selected from 0:1; 0:2;:::; 0:9. For composite kernels, we compose Gaussian with Laplacian. We perform a supervised dimensionality reduction, as described in section 4.2 of the main text. The reduced dimensionality is chosen from 50, 100, or 360. The first kernel’s bandwidth is greedily selected to be optimal as a single-kernel acoustic model. The other kernel’s bandwidth is selected after composing the features. Results on Perplexity and Accuracy Table 6.10 concisely contrasts the best perplexity and accuracy attained by various systems: ibm (IBM’s DNN), rbm (RBM-trained DNN), 1-k (single kernel based model), a-2-k (addi- tive combination of two kernels), m-2-k (multiplicative combination of two kernels) and c-2-k (composite of two kernels). We report the metrics on both the held-out and the test datasets (the 94 Table 6.12: Performance of single Laplacian kernel Bengali Cantonese Dim perp acc (%) perp acc (%) 2k 4.4/4.4 66.5/66.8 8.5/7.4 52.7/54.8 5k 4.1/4.2 67.8/67.8 7.8/7.0 53.9/56.0 10k 4.0/4.1 68.4/68.3 7.5/6.7 54.9/56.6 25k 3.8/3.9 69.2/69.0 7.1/6.4 55.9/57.3 50k 3.8/3.9 69.7/69.4 6.9/6.2 56.5/57.9 100k 3.7/3.8 70.0/69.6 6.8/6.2 56.8/58.2 200k 3.7/3.8 70.1/69.7 6.8/6.2 57.0/58.3 Table 6.13: Best Token Error Rates on Test Set (%) Model Bengali Cantonese ibm 70.4 67.3 rbm 69.5 66.3 1-k 70.0 65.7 a-2-k 73 68.8 m-2-k 72.8 69.1 c-2-k 71.2 68.1 numbers are separated by a /). In general, the metrics are consistent across both datasets andperp correlates with ACC reasonably well. On Bengali, across all systems, the rbm attains the best perplexity (red colored numbers in the table), outperformingibm and suggesting that unsupervised pre-training is advantageous. The best performing kernel model isc-2-k, trailing slightly behindrbm andibm. Similarly, on Cantonese,rbm performs the best, followed byc-2-k, both outperformingibm. As an illustrate example, we show in Table 6.11 the performance of rbm on Bengali, under different types of architectures (h is the number of hidden layers and L the number of hidden units). Meanwhile, in Table 6.12, we show the performance of single Laplacian kernel acoustic model with different number of random features. Contrasting these two tables, it is interesting to observe that kernel models use far more parameters than DNNs to achieve similar perplexity and accuracy. For instance, for a rbm with (h = 1;L = 500) with a perplexity of 3:9, the number of parameters is 360500+5001000 = 0:68 million. This is a fraction of a comparable kernel model with Dim=10k1000 = 10 million parameters. In some way, this ratio provides an intuitive measure of the price being convenient, i.e., using random features in kernel models instead of adapting features to the data as in DNNs. Results on Token Error Rates Table 6.13 reports the performance of various models measured in TER, another important and more relevant metric to speech recognition errors. 95 Table 6.14: Detailed Comparison on TER for Bengali Model Arch. TER (%) rbm h = 2;L = 1000 73.1 rbm h = 3;L = 1000 72.7 rbm h = 2;L = 2000 72.4 rbm h = 3;L = 2000 72.2 rbm h = 4;L = 1000 69.8 rbm h = 4;L = 2000 69.5 1-k Dim = 25k 73.1 1-k Dim = 50k 70.2 1-k Dim = 100k 70.0 1-k Dim = 200k 70.0 Table 6.15: Token Error Rates (%) for Combined Models Model Bengali Cantonese BEST SINGLE SYSTEM 69.5 65.7 rbm (h = 3;L = 2000) +1-k 69.7 65.3 rbm (h = 4;L = 1000) +1-k 69.2 64.9 rbm (h = 4;L = 2000) +1-k 69.1 64.9 Note that the RBM-trained DNN (rbm) performs the best on Bengali, but our best kernel model performs the best on Cantonese. Both perform better than IBM’s DNN. On Cantonese, the improvement of our kernel model overibm is noticeably large (1:6% reduction in absolute). Table 6.14 highlights several interesting comparison between rbm and kernel models. Con- cretely, it seems that DNNs need to be big enough in order to reach the proximity of its best TER. On the other end, the kernel models’ performance plateaus rather quickly. This is the opposite to what we have observed when we compare two methods using perplexity and accuracy. One possible explanation is that for different models, the relationship between perplexity and TER are different. This is certainly plausible, given TER is highly complex to compute and two different models might explore parameter spaces very differently. Another possible explanation is that these two different models learn different representations that bias either toward perplexity or toward TER. Table 6.15 suggests that this might indeed be true: as we combine two different models, we see handsome gains in performance over each individual one’s. DNN and kernels learn complementary representations Inspired by what we have observed in the previous section, we set out to analyze in what way the representations learnt by two different models might be complementary. We have obtained preliminary results. We took a learned DNN (we used the best perform one in terms of TER) and computed its pre-activation to the output layer, which is a linear transformation of the last hidden layer’s outputs. For the best performing single-kernel model, we computed the pre-activation similarly. 96 −120 −100 −80 −60 −40 −20 0 20 40 60 80 −100 −80 −60 −40 −20 0 20 40 60 80 −40 −20 0 20 40 60 80 100 −80 −60 −40 −20 0 20 40 60 80 100 Figure 6.2: Contrast the learned representations by kernel models (left plot) and DNNs (right plot): kernel models’ representations tend to form clumps for data from the same class, while DNNs are more spread out. Note that since they both predict the same set of labels, the pre-activations from either model have the same dimensionality. We perform PCA on them independently and then visualize in 2D. Figure 6.2 displays the two scatter plots where each has 1000 points, representing the means of the learned representations for data points in each class. To visualize easily, we color each point not by its phoneme state labels. Instead, we collapse them into phone labels (which are considerably few, generally around 40 - 60). An initial examination seems to suggest that kernel models’ representations tend to form clumps for data from the same class. In the figure, the most obvious observation is the cluster in the blue color. On the other end, those blue color scattered data points do not seem to form a large and tight cluster under the representations learned by the DNNs – they seem to be more spread out. The clumps seem to be indicative of the Gaussian kernels we have used. However, how important they are and in what way, the more flourish patterns by DNNs’ representations are more advantageous require more careful and detailed analysis. We hope our work has provided enough incentives and tools for that pursuit. 97 Bibliography [1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization in linear time. stat, 1050:15, 2016. [2] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient fisher scoring. arXiv preprint arXiv:1206.6380, 2012. [3] Ahmed M. Alaa and Mihaela van der Schaar. The Discriminative Jackknife: Quantifying Predictive Uncertainty via Higher-order Influence Functions, 2019. URL https:// openreview.net/forum?id=H1xauR4Kvr. [4] Shun-ichi Amari. Neural learning in structured parameter spaces-natural riemannian gra- dient. In Advances in neural information processing systems, pages 127–133, 1997. [5] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020. [6] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, page 6, 2004. [7] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks. In Advances in neural information processing systems, pages 395–401, 1998. [8] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Pre- dictive inference with the jackknife+. arXiv preprint arXiv:1905.02928, 2019. [9] Peter L Bartlett and Marten H Wegkamp. Classification with a reject option using a hinge loss. Journal of Machine Learning Research, 9(Aug):1823–1840, 2008. [10] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized Rademacher com- plexities. In Proceedings of the 15th Annual Conference on Computational Learning The- ory, COLT ’02, pages 44–58, London, UK, UK, 2002. Springer-Verlag. ISBN 3-540- 43836-X. [11] Irwan Bello, Barret Zoph, Vijay Vasudevan, and Quoc V Le. Neural optimizer search with reinforcement learning. arXiv preprint arXiv:1709.07417, 2017. [12] Yoshua Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1):1–127, January 2009. 98 [13] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: a review and new perspectives. pami, 35(8):1798–1828, 2013. [14] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer, 1984. [15] Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995. [16] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [17] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un- certainty in neural networks. arXiv preprint arXiv:1505.05424, 2015. [18] L´ eon Bottou. Personal communication, 2014. [19] L´ eon Bottou and Chih-Jen Lin. Support vector machine solvers. In Bottou et al. [20]. [20] L´ eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors. Large Scale Kernel Machines. MIT Press, Cambridge, MA., 2007. [21] Craig Boutilier and Tyler Lu. Budget allocation using weakly coupled, constrained markov decision processes. In UAI, 2016. [22] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. [23] S´ ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and non- stochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012. [24] Yaroslav Bulatov. Notmnist dataset. Google (Books/OCR), Tech. Rep.[Online]. Available: http://yaroslavvb. blogspot. it/2011/09/notmnist-dataset. html, 2, 2011. [25] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4): 335, 2008. [26] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017. [27] Nicolo Cesa-Bianchi and G´ abor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. [28] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pages 1683–1691, 2014. [29] Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. arXiv preprint arXiv:1610.08637, 2016. 99 [30] Yutian Chen, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Timothy P Lillicrap, and Nando de Freitas. Learning to learn for global optimization of black box functions. arXiv preprint arXiv:1611.03824, 2016. [31] Chih-Chieh Cheng and B. Kingsbury. Arccosine kernels: Acoustic modeling with infi- nite neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5200–5203, 2011. [32] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State- of-the-art speech recognition with sequence-to-sequence models. In ICASSP, pages 4774– 4778. IEEE, 2018. [33] Jaejin Cho, Raghavendra Pappagari, Purva Kulkarni, Jes´ us Villalba, Yishay Carmiel, and Najim Dehak. Deep neural networks for emotion recognition combining audio and tran- scripts. In Interspeech, pages 247–251, 2018. [34] K. Cho, A. Ilin, and T. Raiko. Improved learning of gaussian-bernoulli restricted boltz- mann machines. In Proceedings of the International Conference on Artificial Neural Net- works (ICANN 2011), pages 10–17, 2011. [35] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009. [36] Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algo- rithm. ACM Trans. Algorithms, 6(4):63:1–63:30, 2010. [37] R Dennis Cook and Sanford Weisberg. Residuals and influence in regression. New York: Chapman and Hall, 1982. [38] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combi- nations of kernels. In Advances in neural information processing systems, pages 396–404, 2009. [39] Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Interna- tional Conference on Algorithmic Learning Theory, pages 67–82. Springer, 2016. [40] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Informa- tion Processing Systems, pages 3041–3049, 2014. [41] Jean Daunizeau. Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables. arXiv preprint arXiv:1703.00091, 2017. [42] A Philip Dawid. The well-calibrated bayesian. Journal of the American Statistical Asso- ciation, 77(379):605–610, 1982. [43] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian q-learning. In AAAI/IAAI, pages 761–768, 1998. 100 [44] Richard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 150–159. Morgan Kaufmann Publishers Inc., 1999. [45] Dennis DeCoste and Bernhard Sch¨ olkopf. Training invariant support vector machines. Mach. Learn., 46:161–190, 2002. [46] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [47] Li Deng, G¨ okhan T¨ ur, Xiaodong He, and Dilek Z. Hakkani-T¨ ur. Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In 2012 IEEE Spo- ken Language Technology Workshop (SLT), Miami, FL, USA, December 2-5, 2012, pages 210–215, 2012. [48] Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. In International Conference on Machine Learning, pages 1192–1201, 2018. [49] Elvis Dohmatob. Limitations of adversarial robustness: strong no free lunch theorem. arXiv preprint arXiv:1810.04065, 2018. [50] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pages 569–593. Springer, 1992. [51] Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statistics, pages 586–596, 1981. [52] Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. Exploring the landscape of spatial robustness. arXiv preprint arXiv:1712.02779, 2017. [53] Sefik Emre Eskimez, Zhiyao Duan, and Wendi Heinzelman. Unsupervised learning ap- proach to feature analysis for automatic speech emotion recognition. In ICASSP, pages 5099–5103. IEEE, 2018. [54] Stefan Falkner, Aaron Klein, and Frank Hutter. Bohb: Robust and efficient hyperparameter optimization at scale. arXiv preprint arXiv:1807.01774, 2018. [55] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any clas- sifier. In Advances in Neural Information Processing Systems, pages 1178–1187, 2018. [56] Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems, pages 2962–2970, 2015. [57] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. arXiv preprint arXiv:1703.01785, 2017. 101 [58] Luca Franceschi, Paolo Frasconi, Saverio Salzo, and Massimilano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. arXiv preprint arXiv:1806.04910, 2018. [59] Yarin Gal. Uncertainty in deep learning. University of Cambridge, 2016. [60] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016. [61] Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. In Advances in Neural Infor- mation Processing Systems, pages 3581–3590, 2017. [62] Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in neural information processing systems, pages 4885–4894, 2017. [63] Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Boosting uncertainty estimation for deep neural classifiers. arXiv preprint arXiv:1805.08206, 2018. [64] Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian re- inforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6): 359–483, 2015. [65] Ryan Giordano, Will Stephenson, Runjing Liu, Michael I Jordan, and Tamara Broderick. A swiss army infinitesimal jackknife. arXiv preprint arXiv:1806.00550, 2018. [66] Ryan Giordano, Michael I Jordan, and Tamara Broderick. A higher-order swiss army infinitesimal jackknife. arXiv preprint arXiv:1907.12116, 2019. [67] John J Godfrey and Edward Holliman. Switchboard-1 release 2. Linguistic Data Consor- tium, Philadelphia, 926:927, 1997. [68] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [69] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. [70] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, pages 6645–6649. IEEE, 2013. [71] Yue Gu, Kangning Yang, Shiyu Fu, Shuhong Chen, Xinyu Li, and Ivan Marsic. Multi- modal affective analysis using hierarchical attention strategy with word-level alignment. In Proc. ACL (Volume 1: Long Papers), pages 2225–2235, 2018. [72] Arthur Guez, David Silver, and Peter Dayan. Efficient bayes-adaptive reinforcement learn- ing using sample-based search. In Advances in Neural Information Processing Systems, pages 1025–1033, 2012. [73] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017. 102 [74] Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv preprint arXiv:1807.09289, 2018. [75] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. In International Conference on Machine Learning, pages 19–27, 2014. [76] Leonard Hasenclever, Stefan Webb, Thibaut Lienart, Sebastian V ollmer, Balaji Laksh- minarayanan, Charles Blundell, and Yee Whye Teh. Distributed bayesian learning with stochastic natural gradient expectation propagation and the posterior server. The Journal of Machine Learning Research, 18(1):3744–3780, 2017. [77] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: a spectral approach. arXiv preprint arXiv:1706.00764, 2017. [78] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [79] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end- to-end speech recognition for mobile devices. In ICASSP, pages 6381–6385. IEEE, 2019. [80] Yotam Hechtlinger, Barnab´ as P´ oczos, and Larry Wasserman. Cautious deep learning. arXiv preprint arXiv:1805.09460, 2018. [81] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to com- mon corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019. [82] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of- distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016. [83] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. arXiv preprint arXiv:1901.09960, 2019. [84] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self- supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pages 15637–15648, 2019. [85] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lak- shminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019. [86] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019. [87] Jose Hernandez-Lobato, Yingzhen Li, Mark Rowland, Thang Bui, Daniel Hern´ andez- Lobato, and Richard Turner. Black-box alpha divergence minimization. In International Conference on Machine Learning, pages 1511–1520, 2016. 103 [88] Jos´ e Miguel Hern´ andez-Lobato and Ryan Adams. Probabilistic backpropagation for scal- able learning of bayesian neural networks. In International Conference on Machine Learn- ing, pages 1861–1869, 2015. [89] G. E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002. [90] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kings- bury. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97, 2012. [91] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neual Comp., 18(7):1527–1554, 2006. [92] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detec- tors. arXiv preprint arXiv:1207.0580, 2012. [93] Ronald A Howard. Information value theory. IEEE Transactions on systems science and cybernetics, 2(1):22–26, 1966. [94] Jiri Hron, Alexander G de G Matthews, and Zoubin Ghahramani. Variational bayesian dropout: pitfalls and fixes. arXiv preprint arXiv:1807.01969, 2018. [95] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhad- ran. Kernel methods match deep neural networks on timit. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 205–209. IEEE, 2014. [96] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [97] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018. [98] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017. [99] Louis A Jaeckel. The infinitesimal jackknife. Bell Telephone Laboratories, 1972. [100] Kevin Jamieson and Ameet Talwalkar. Non-stochastic best arm identification and hyper- parameter optimization. In Artificial Intelligence and Statistics, pages 240–248, 2016. [101] Melih Kandemir, Manuel Haussmann, and Fred A Hamprecht. Sampling-free variational inference of bayesian neural nets. arXiv preprint arXiv:1805.07654, 2018. [102] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In Artificial Intelligence and Statistics, pages 583–591, 2012. 104 [103] Mohammad Emtiyaz Khan, Didrik Nielsen, V oot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018. [104] Eesung Kim and Jong Won Shin. Dnn-based emotion recognition based on bottleneck acoustic features and lexical features. In ICASSP, pages 6720–6724. IEEE, 2019. [105] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [106] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [107] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015. [108] Brian Kingsbury, Jia Cui, Xiaodong Cui, Mark JF Gales, Kate Knill, Jonathan Mamou, Lidia Mangu, David Nolden, Michael Picheny, Bhuvana Ramabhadran, et al. A high- performance cantonese keyword search system. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 8277–8281. IEEE, 2013. [109] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artifi- cial Intelligence and Statistics, pages 528–536, 2017. [110] Aaron Klein, Stefan Falkner, Jost Tobias Springenberg, and Frank Hutter. Learning curve prediction with bayesian neural networks. Proc. of ICLR, 17, 2017. [111] Marius Kloft, Ulf Brefeld, S¨ oren Sonnenburg, and Alexander Zien. l p -norm multiple kernel learning. Journal of Machine Learning Research, 12:953–997, 2011. [112] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence func- tions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1885–1894. JMLR. org, 2017. [113] Zico Kolter and Aleksander Madry. Adversarial robustness: Theory and practice. In NeurIPS 2018 Tutorial, 2018. [114] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in relu networks. arXiv preprint arXiv:2002.10118, 2020. [115] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Tront, 2009. [116] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. 105 [117] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [118] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017. [119] V olodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. arXiv preprint arXiv:1807.00263, 2018. [120] Ananya Kumar, Percy S Liang, and Tengyu Ma. Verified uncertainty calibration. In Ad- vances in Neural Information Processing Systems, pages 3787–3798, 2019. [121] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learn- ing, pages 2810–2819, 2018. [122] Egor Lakomkin, Mohammad Ali Zamani, Cornelius Weber, Sven Magg, and Stefan Wermter. Incorporating end-to-end speech recognition models for sentiment analysis. arXiv preprint arXiv:1902.11245, 2019. [123] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Informa- tion Processing Systems, pages 6405–6416, 2017. [124] Remi Lam, Karen Willcox, and David H Wolpert. Bayesian optimization with a finite bud- get: An approximate dynamic programming approach. In Advances in Neural Information Processing Systems, pages 883–891, 2016. [125] Gert R. G. Lanckriet, Nello Cristianini, Peter L. Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004. [126] Quoc Le, Tam´ as Sarl´ os, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, vol- ume 85, 2013. [127] Y . LeCun and C. Cortes. The mnist database of handwritten digits, 1998. [128] Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017. [129] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7165–7175, 2018. [130] Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. arXiv preprint arXiv:1603.06560, 2016. 106 [131] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo, and Lirong Dai. An attention pooling based representation learning method for speech emotion recognition. In Inter- speech, pages 3087–3091, 2018. [132] Runnan Li, Zhiyong Wu, Jia Jia, Sheng Zhao, and Helen Meng. Dilated residual network with multi-head self-attention for speech emotion recognition. In ICASSP, pages 6675– 6679. IEEE, 2019. [133] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017. [134] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017. [135] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378– 2386, 2016. [136] Ga¨ elle Loosli, St´ ephane Canu, and L´ eon Bottou. Training invariant support vector ma- chines using selective sampling. In Bottou et al. [20]. [137] Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pages 1708–1716, 2016. [138] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017. [139] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015. [140] David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992. [141] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [142] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learn- ing, pages 2113–2122, 2015. [143] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pages 13132–13143, 2019. [144] David Madras, James Atwood, and Alex D’Amour. Detecting extrapolation with local ensembles. arXiv preprint arXiv:1910.09573, 2019. 107 [145] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. [146] Andrey Malinin and Mark Gales. Predictive uncertainty estimation via prior networks. arXiv preprint arXiv:1802.10501, 2018. [147] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. The Journal of Machine Learning Research, 18(1): 4873–4907, 2017. [148] John Markoff. A learning advance in artificial intelligence rivals human abilities. New York Times, 10, 2015. [149] James Martens. Deep learning via hessian-free optimization. In ICML, volume 27, pages 735–742, 2010. [150] Dimitrios Milios, Raffaello Camoriano, Pietro Michiardi, Lorenzo Rosasco, and Maur- izio Filippone. Dirichlet-based gaussian processes for large-scale calibrated classification. arXiv preprint arXiv:1805.10915, 2018. [151] Rupert G Miller. The jackknife-a review. Biometrika, 61(1):1–15, 1974. [152] Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Moham- mad Emtiyaz Khan. Slang: Fast structured covariance approximations for bayesian deep learning with natural gradient. In Advances in Neural Information Processing Systems, pages 6245–6255, 2018. [153] Abdel-rahman Mohamed, George Dahl, , and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20 (1):14–22, 2012. [154] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017. [155] Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J V ollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient langevin dynam- ics. arXiv preprint arXiv:1706.02692, 2017. [156] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi- narayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018. [157] R. Neal. Priors for infinite networks. Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 1994. [158] Radford M Neal. Bayesian learning via stochastic dynamics. In Advances in neural infor- mation processing systems, pages 475–482, 1993. [159] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996. 108 [160] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Ng. Reading digits in natural images with unsupervised feature learning. NIPS, 01 2011. [161] Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with super- vised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632. ACM, 2005. [162] Robert Nishihara, David Lopez-Paz, and L´ eon Bottou. No regret bound for extreme ban- dits. In AISTATS, pages 259–267, 2016. [163] Jeremy Nixon, Mike Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Mea- suring calibration in deep learning. arXiv preprint arXiv:1904.01685, 2019. [164] Brendan O’Donoghue, Ian Osband, Remi Munos, and V olodymyr Mnih. The uncertainty bellman equation and exploration. arXiv preprint arXiv:1709.05380, 2017. [165] United States. Executive Office of the President and John Podesta. Big data: Seizing opportunities, preserving values. White House, Executive Office of the President, 2014. [166] Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout. In Proceedings of the NIPS* 2016 Workshop on Bayesian Deep Learning, 2016. [167] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026– 4034, 2016. [168] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforcement learning. arXiv preprint arXiv:1806.03335, 2018. [169] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pages 372–387. IEEE, 2016. [170] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017. [171] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019. [172] Nick Pawlowski, Andrew Brock, Matthew CH Lee, Martin Rajchl, and Ben Glocker. Im- plicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297, 2017. [173] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1): 147–160, 1994. [174] Sundar Pichai. Making ai work for everyone. https://blog.google/ technology/ai/making-ai-work-for-everyone/, 2017. 109 [175] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999. [176] John C. Platt. Fast training of support vector machines using sequential minimal optimiza- tion. In Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998. [177] Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. A comparison of sequence-to-sequence models for speech recognition. In Inter- speech, pages 939–943, 2017. [178] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008. [179] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing min- imization with randomization in learning. In Advances in neural information processing systems, pages 1313–1320, 2009. [180] T. Raiko, H. Valpola, and Y . LeCun. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932, 2012. [181] Kanishka Rao, Has ¸im Sak, and Rohit Prabhavalkar. Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 193–199. IEEE, 2017. [182] Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004. [183] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? arXiv preprint arXiv:1902.10811, 2019. [184] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y . Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840, 2011. [185] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. In 6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings, volume 6. International Conference on Representa- tion Learning, 2018. [186] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. J.C. Burges, L. Bottou, and K. Q. Weinberger, editors, nips2012, pages 2663–2671, 2012. [187] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211– 252, 2015. 110 [188] Ilya O Ryzhov. On the convergence rates of expected improvement methods. Operations Research, 64(6):1515–1528, 2016. [189] B. Sch¨ olkopf and A. Smola. Learning with kernels. MIT Press, 2002. [190] Peter Schulam and Suchi Saria. Can you trust this prediction? auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403, 2019. [191] Frank Seide, Gang Li, Xie Chen, and Dong Yu. Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Automatic Speech Recog- nition and Understanding (ASRU), 2011 IEEE Workshop on, pages 24–29, 2011. [192] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context- dependent deep neural networks. In Proc. of Interspeech, pages 437–440, 2011. [193] Murat Sensoy, Melih Kandemir, and Lance Kaplan. Evidential deep learning to quantify classification uncertainty. arXiv preprint arXiv:1806.01768, 2018. [194] Alireza Shafaei, Mark Schmidt, and James J Little. Does your model know the digit 6 is not a cat? a less biased evaluation of” outlier” detectors. arXiv preprint arXiv:1809.04729, 2018. [195] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Tak- ing the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016. [196] Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scal- able framework for sequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019. [197] Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. arXiv preprint arXiv:1705.10119, 2017. [198] Ralph C Smith. Uncertainty quantification: theory, implementation, and applications, volume 12. Siam, 2013. [199] Alex Smola. Personal communication, 2014. [200] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012. [201] Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s un- certainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980, 2019. [202] Hagen Soltau, Hank Liao, and Has ¸im Sak. Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition. Interspeech, pages 3707–3711, 2017. 111 [203] S¨ oren Sonnenburg and V ojtech Franc. Coffin: A computational framework for linear svms. In ICML, pages 999–1006, 2010. [204] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. [205] Chi Sun, Enrique Stevens-Navarro, and Vincent WS Wong. A constrained mdp-based vertical handoff decision algorithm for 4g wireless networks. In Communications, 2008. ICC’08. IEEE International Conference on, pages 2169–2174. IEEE, 2008. [206] Kevin Swersky, Jasper Snoek, and Ryan Prescott Adams. Freeze-thaw bayesian optimiza- tion. arXiv preprint arXiv:1406.3896, 2014. [207] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [208] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [209] Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005. URLhttp://www.jmlr.org/papers/v6/tsang05a.html. [210] John Tukey. Bias and confidence in not quite large samples. Ann. Math. Statist., 29:614, 1958. [211] Panagiotis Tzirakis, Jiehao Zhang, and Bj¨ orn W Schuller. End-to-end speech emotion recognition using deep neural networks. In ICASSP, pages 5089–5093. IEEE, 2018. [212] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017. [213] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit feature maps. IEEE Trans. on Pattern Anal. & Mach. Intell., 34(3):480–492, 2012. [214] Vladimir V ovk, Alexander Gammerman, and Glenn Shafer. Conformal prediction. Springer, 2005. [215] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011. [216] Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient pseudo-independent weight perturbations on mini-batches. arXiv preprint arXiv:1803.04386, 2018. 112 [217] Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub ´ Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020. [218] Wikipedia contributors. List of self-driving car fatalities — Wikipedia, the free en- cyclopedia. https://en.wikipedia.org/w/index.php?title=List_of_ self-driving_car_fatalities&oldid=867022821, 2018. [Online; accessed 25-November-2018]. [219] Christopher KI Williams. Computing with infinite networks. In Advances in neural infor- mation processing systems, pages 295–301, 1997. [220] Christopher KI Williams. Computing with infinite networks. In Advances in neural infor- mation processing systems, pages 295–301, 1997. [221] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791, 2020. [222] Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jos´ e Miguel Hern´ andez- Lobato, and Alexander L Gaunt. Fixing variational bayes: Deterministic variational infer- ence for bayesian neural networks. arXiv preprint arXiv:1810.03958, 2018. [223] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li, Jianwei Yu, Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu, Xunying Liu, et al. Speech emotion recognition using capsule networks. In ICASSP, pages 6695–6699. IEEE, 2019. [224] Yue Xie, Ruiyu Liang, Zhenlin Liang, Chengwei Huang, Cairong Zou, and Bj¨ orn Schuller. Speech emotion classification using attention-based lstm. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(11):1675–1685, 2019. [225] Felix X. Yu, Sanjiv Kumar, Henry A. Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant extensions. CoRR, abs/1503.03893, 2015. URL http://arxiv. org/abs/1503.03893. [226] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015. [227] Jiaming Zeng, Adam Lesnikowski, and Jose M Alvarez. The relevance of bayesian layer positioning to model uncertainty in deep bayesian active learning. arXiv preprint arXiv:1811.12535, 2018. [228] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390, 2017. [229] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 113 [230] Zixing Zhang, Bingwen Wu, and Bj¨ orn Schuller. Attention-augmented end-to-end multi- task learning for emotion prediction from speech. In ICASSP, pages 6705–6709. IEEE, 2019. 114
Abstract (if available)
Abstract
Deep neural nets have exhibited great success on a wide range of machine learning problems across various domains, such as speech, image, and text. Despite decent prediction performances, there are rising concerns for the “in-the-lab” machine learning models to be vastly deployed in the wild. In this thesis, we study two of the main challenges in deep learning: efficiency, computational as well as statistical, and robustness. We describe a set of techniques to solve the challenges by utilizing information from the training process intelligently. The solutions go beyond the common recipe of a single point estimate of the optimal model. ❧ The first part of the thesis studies the efficiency challenge. We propose a budgeted hyper-parameter tuning algorithm to improve the computation efficiency of hyper-parameter tuning in deep learning. It can estimate and utilize the trend of training curves to adaptively allocate resources for tuning, which demonstrates improved efficiency over state-of-the-art tuning algorithms. Then we study the statistical efficiency on tasks with limited labeled data. Specifically we focus on the task of speech sentiment analysis. We apply pre-training using automatic speech recognition data, and solve sentiment analysis as a downstream task, which greatly improves the data efficiency of sentiment labels. ❧ The second part of the thesis studies the robustness challenge. Motivated by the resampling method in statistics, we propose mean-field infinitesimal jackknife (mfIJ)—a simple, efficient, and general-purpose plug-in estimator for uncertainty estimation. The main idea is to approximate the ensemble of an infinite number of infinitesimal jackknife samples with a closed-form Gaussian distribution, and derive an efficient mean-field approximation to classification predictions, when the softmax layer is sampled from Gaussian. mfIJ improves the model's robustness against invalid inputs by leveraging the curvature information of the training loss.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling, learning, and leveraging similarity
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Deep learning models for temporal data in health care
PDF
3D deep learning for perception and modeling
PDF
Visual knowledge transfer with deep learning techniques
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Robust and adaptive online reinforcement learning
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Identifying and leveraging structure in complex cooperative tasks for multi-agent reinforcement learning
PDF
Interactive learning: a general framework and various applications
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Deep representations for shapes, structures and motion
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Invariant representation learning for robust and fair predictions
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Green learning for 3D point cloud data processing
PDF
Multimodal representation learning of affective behavior
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
Asset Metadata
Creator
Lu, Zhiyun
(author)
Core Title
Leveraging training information for efficient and robust deep learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/08/2020
Defense Date
05/20/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,hyper-parameter optimization,OAI-PMH Harvest,uncertainty estimation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sha, Fei (
committee chair
), Kuo, C.-C. Jay (
committee member
), Luo, Haipeng (
committee member
)
Creator Email
zhiyunlu.is.alive@gmail.com,zhiyunlu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-363694
Unique identifier
UC11666357
Identifier
etd-LuZhiyun-8904.pdf (filename),usctheses-c89-363694 (legacy record id)
Legacy Identifier
etd-LuZhiyun-8904.pdf
Dmrecord
363694
Document Type
Dissertation
Rights
Lu, Zhiyun
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
hyper-parameter optimization
uncertainty estimation