Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
(USC Thesis Other)
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ALGORITHMS AND FRAMEWORKS FOR GENERATING NEURAL NETWORK MODELS ADDRESSING ENERGY-EFFICIENCY, ROBUSTNESS, AND PRIVACY by Souvik Kundu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Souvik Kundu Epigraph Success is not final, failure is not fatal, it is the courage to continue that counts –Winston Churchill ii Dedicated to my family. iii Acknowledgements Pursuing a Ph.D. is a beautiful and adventurous journey since it provides us the opportunity to explore the uncharted territory instead of being burdened by con- straintsofbanality. Inthisregard,Iwouldliketothankmyco-advisors–Professor Peter A. Beerel and Professor Massoud Pedram – without whose help, support, and guidance I would not have been able to explore the research directions that have led to the material in this dissertation. I am grateful to past and present members of my research team who have helped and collaborated with me in my research. Special thanks to my primary collaborators – Gourav Datta, Saurav Prakash, Arash Fayyazi, Digbalay Bose, Sairam Sundaresan (Intel), Hesham Mostafa (Intel), and Mahdi Nazemi. I also acknowledge the contribution of secondary collaborators few of whom have since moved on – Arnab Sanyal, Sourya Dey, Yang Hu, Connor Imes, Qirui Sun, Shikai Wang, Guowei Yang, Yao Fu, Bill Ye, Shunlin Lu, Jacqueline Liu, Jiaqi Liu, and RyanFeng; andcurrentresearchteammemberswhoseconstantfeedbackhasbeen invaluable–Dr. LeanaGolubchik,Dr. AjeyJacob,Dr. AkhileshR.Jaiswal,Zihan Yin, Haonan Wang, Dr. Stephen P. Crago, Dr. John Paul N. Walters, Marco Paolieri, Yuke Zhang, Matthew Conn, Robert Aviles, and Dr. Andrew Schmidt. I am also indebted to Dr. Salman Avestimehr, Dr. Yanzhi Wang (Northeastern University), Dr. Jonathan Gratch, Dr. Murali Annavaram, Dr. Keith M. Chugg, AnthonySarah(Intel),SharathNitturSridhar(Intel),SubhajitDuttaChowdhury, and Rajrup Ghosh for help in specific efforts. iv I would like to thank the agencies that have funded my research and helped to pay the bills – National Science Foundation Software and Hardware Founda- tions (NSF SHF) Grant #1763747, Defense Advanced Research Projects Agency (DARPA) In-Pixel Intelligent Processing (IP2) Artificial Intelligence Exploration (AIE) Opportunity contract number #HR00112190120, and Defense Threat Re- duction Agency (DTRA) in association with the Scalable Acceleration Platform Integrating Reconfigurable Computing and Natural Language Processing Tech- nologies (SAPIENT) team and University of Southern California Information Sci- ences Institute (USC ISI). I would also like to thank the USC graduate school for selecting me as one of the Annenberg fellows during my Ph.D. program. I am also indebtedtoDianeDemetrasandAnnieYufortheirhelpinadministrativematters related to the progress of my Ph.D. and presentation of my research to the outside world. Heartfelt thanks to my family members back in India, for their constant love and support, despite being located halfway around the world. I am eternally grateful to my lovely wife Chandani, who has been the beacon of light and always stayed beside me, supported me, motivated me, inspired me, and strengthened me throughout this journey. Nothing would have been possible without them. Finally, thanks to you the reader for picking up this dissertation detailing my research efforts since spring 2019. I hope you have as much enjoyment reading it as I had writing it. Author’s note: This dissertation is being completed while the COVID-19 pan- demic and political instability in Europe have gripped the world. I wish a safe and healthy livelihood of all the readers and earth’s future generations. v This doctoral thesis has been examined by a committee of the Department of Electrical and Computer Engineering and Department of Computer Science. Dr. Peter A. Beerel Co-chair, thesis committee Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Dr. Massoud Pedram Co-chair, thesis committee Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Dr. Salman Avestimehr Member, thesis committee Ming Hsieh Department of Electrical and Computer Engineering University of Southern California Dr. Leana Golubchik Member, thesis committee Department of Computer Science University of Southern California vi Table of Contents Epigraph ii Acknowledgements iv List of Figures xi List of Tables xix Abstract xxii Chapter 1: Introduction 1 1.1 Deep Neural Networks Basics . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Thesis Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Part I: Methods to Generate Energy-Efficient Models 16 Chapter 2: Pre-Defined Sparsity in Convolution Operation 17 2.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Pre-defined Computationally Limited Filters . . . . . . . . . 20 2.2.2 Sparse Matrix Storage Formats . . . . . . . . . . . . . . . . 21 2.3 pSConv: Pre-defined Sparse Kernels . . . . . . . . . . . . . . . . . . 23 2.4 CNN with 3D Filters having Periodically Repeating Patterns . . . . 24 2.5 FLOPs and Storage Analysis . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1 FLOPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Impact on Storage . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Performance Comparison with ShuffleNet and MobileNetV2 44 2.6.2 Performance Evaluation on Networks Models with Scaled Down Width . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 vii Chapter 3: Layer Sensitivity Driven Mixed-Precision Quantization 49 3.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 51 3.2.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.2 PACT Non-Linearity Function for Activation. . . . . . . . . 52 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.2 Loss Bit Gradient . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.3 ILP-Driven Iterative Bit-Width Assignment . . . . . . . . . 55 3.3.4 BMPQ Training . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.4.3 Comparison with Single-Shot Training . . . . . . . . . . . . 60 3.4.4 BMPQ Generated Models as Teachers . . . . . . . . . . . . 60 3.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Model Compression for Spiking Neural Networks 64 4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 66 4.2.1 SNN Fundamentals . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.2 ANN-to-SNN Conversion . . . . . . . . . . . . . . . . . . . . 67 4.2.3 Model Compression in SNNs . . . . . . . . . . . . . . . . . . 68 4.3 Hybrid Sparse Learning (SL) of SNNs . . . . . . . . . . . . . . . . . 69 4.3.1 Attention-guided Compression (AGC) . . . . . . . . . . . . 71 4.3.2 Sparse Learning based SNN Training . . . . . . . . . . . . . 73 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 Results with AGC . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.2 Analysis of Energy Consumption . . . . . . . . . . . . . . . 81 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Part II: Robustness of Compressed Energy-Efficient Models 86 Chapter 5: Efficient Training for Robust Yet Pruned Models 87 5.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 89 5.2.1 Adversarial Attacks on DNNs . . . . . . . . . . . . . . . . . 89 5.2.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Dynamic Network Rewiring (DNR) . . . . . . . . . . . . . . . . . . 92 5.3.1 Dynamic Regularizer . . . . . . . . . . . . . . . . . . . . . . 92 5.3.2 Hybrid Loss Function . . . . . . . . . . . . . . . . . . . . . . 93 5.3.3 Support for Channel Pruning . . . . . . . . . . . . . . . . . 94 viii 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Results with DNR . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.2 Pruning to Classify Clean-only Images . . . . . . . . . . . . 101 5.4.3 Generalized Robustness Against PGD Attack of Different Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 6: A Fast Learnable Once-for-All Adversarial Training 105 6.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 108 6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.2 Robust Model Training . . . . . . . . . . . . . . . . . . . . . 108 6.2.3 Conditional Learning . . . . . . . . . . . . . . . . . . . . . . 109 6.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.1 FLOAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.2 Extension to Model Compression via Pruning . . . . . . . . 113 6.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . 116 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 116 6.4.2 Performance of FLOAT . . . . . . . . . . . . . . . . . . . . 117 6.4.3 Comparison with OAT and PGD-AT . . . . . . . . . . . . . 118 6.4.4 Performance of FLOATS . . . . . . . . . . . . . . . . . . . . 123 6.4.5 Generalization on Various Perturbation Techniques . . . . . 123 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 7: SpikingNeuralNetworkRobustness: AnalysisandIm- provement 126 7.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 126 7.2 Initial Study: SNN Robustness Analysis . . . . . . . . . . . . . . . 129 7.2.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 130 7.3 HIRE-SNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 136 7.4.2 Performance Against WB and BB Attacks . . . . . . . . . . 137 7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.4.4 Computation Energy . . . . . . . . . . . . . . . . . . . . . . 145 7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Part III: Vulnerability and Opportunities in Private Inference 147 Chapter 8: Reality Check of Model Privacy under Compression through Distillation 148 8.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 148 8.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 151 ix 8.2.1 Knowledge Distillation . . . . . . . . . . . . . . . . . . . . . 151 8.2.2 Model IP protection . . . . . . . . . . . . . . . . . . . . . . 152 8.2.3 Poisoning of Neural Network Models . . . . . . . . . . . . . 152 8.3 Motivational Case Study . . . . . . . . . . . . . . . . . . . . . . . . 153 8.3.1 Transferability of the Impact of Nasty Teachers . . . . . . . 153 8.3.2 TransferringKnowledgetoaShallowSubsectionoftheStudent154 8.4 Skeptical Students . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 159 8.5.2 Data-available Distillation . . . . . . . . . . . . . . . . . . . 160 8.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . 164 8.5.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . 166 8.5.5 Limited Data and Data-Free Distillation . . . . . . . . . . . 167 8.5.6 Transferability of Nastiness on Skeptical Students . . . . . . 169 8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Chapter 9: GeneratingModelsforClient-ServerPrivateInference Framework: A Path Towards Security and Efficiency 170 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 173 9.3 MotivationalStudy: RelationbetweenReLUimportanceandPrun- ing Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.4 SENet Training Methodology . . . . . . . . . . . . . . . . . . . . . 177 9.4.1 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 178 9.4.2 ReLU Mask Identification . . . . . . . . . . . . . . . . . . . 179 9.4.3 Maximizing Activation Similarity via Distillation . . . . . . 180 9.4.4 SENet++: Support for Ordered Channel Dropping . . . . . 181 9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 182 9.5.2 SENet Results . . . . . . . . . . . . . . . . . . . . . . . . . . 184 9.5.3 SENet++ Results . . . . . . . . . . . . . . . . . . . . . . . . 184 9.5.4 Analysis of Linear and ReLU Inference Latency . . . . . . . 185 9.5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . 186 9.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Chapter 10:Conclusions 188 10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 10.2 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 10.3 Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Bibliography 190 x List of Figures Figure 1.1 ACNNusedforimageclassification. Pooling,batch-normalization andnon-linearitylayersarenotexplicitlyshownforsimplicity. 3 Figure 1.2 Illustration of iterative LIF. Spike communication between neurons and The update of membrane potential according to Eq. 1.9 -Eq. 1.10. . . . . . . . . . . . . . . . . . . . . . 6 Figure 1.3 The evolution of the number of parameters of the SOTA models alongside various types of AI accelerators memory capacity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 1.4 Two major variants of pruning of the convolutional layer weight tensor: 1.irregular and 2. structured. . . . . . . . . 9 Figure 1.5 An overview of the thesis organized in three parts. . . . . 15 Figure 2.1 Four major variants of convolutions: (a) standard fully connected convolution (SFCC), (b) depth-wise convolution (DWC), (c) group-wise convolution (GWC), and (d) point- wise convolution (PWC). . . . . . . . . . . . . . . . . . . . 19 Figure 2.2 (a)Anexampleofpre-definedsparsekernelswith8different kernelvariantseachhavingKSSof2. Thecoloredlocations in each 2D kernel are allowed to have non-zero weight values. 23 Figure 2.3 Regularsparsekernelbased4Dweighttensor. Inthefigure the 4D weight tensor has 4 different types of 2D kernel i.e. 4 different KVs (colored differently). . . . . . . . . . . . . . 25 Figure 2.4 Periodic insertion of FC 2D kernels between sparse kernels. 26 Figure 2.5 A 3D illustration of the change in R mob and R shuf as a function of the C o and P. Here we assumed G, k, and n to be 16, 3, and 1, respectively. . . . . . . . . . . . . . . 29 Figure 2.6 Illustration of how periodicity in a filter leads to repeating rows of sub-matrices of the filter’s flattened weight matrix. 32 xi Figure 2.7 Comparison of storage requirements of (a) various existing storage formats and (b) dense, CSR, and CSR P formats at differentlevelsofdensityforamatrixofsize32 × 12(b v =8, b c =b r =4, b i =7, and b P =6). . . . . . . . . . . . . . . . 33 Figure 2.8 (a), and(b)showsthetestaccuracyvs. epochsfor CIFAR- 10 dataset in different variants of VGG16 and ResNet18 models, respectively; (c), and (d) are plots of top 5 error ratevs. epochs forTinyImageNetdatasetin different vari- ants of VGG16 and ResNet18 models, respectively. The KSS for all the variants is 1. . . . . . . . . . . . . . . . . . 39 Figure 2.9 Test accuracy vs. FLOPs count plots for different datasets on different architectures: CIFAR-10 on (a) VGG16, (b) ResNet18variants;TinyImageNeton(c)VGG16,(d)ResNet18 variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 2.10 Performancecomparisonofourproposedarchitecturesthat havesimilarorfewerFLOPsthanShuffleNetandMobileNetV2 with comparable or better classification accuracy on (a-b) CIFAR-10 and (c-d) Tiny ImageNet. . . . . . . . . . . . . . 45 Figure 2.11 Comparison of the number of model parameters of the net- work models described in Fig 2.10 for (a) CIFAR-10 and (b) Tiny ImageNet. . . . . . . . . . . . . . . . . . . . . . . 46 Figure 2.12 PerformancecomparisonintermsoftestaccuracyandFLOPs of different squeezed (width multiplier 0.5) ResNet18 vari- ant models with MobileNetV2 (MobV2) having width mul- tiplier 1.0 and 0.75 on (a-b) CIFAR-10, and (c-d) Tiny Im- ageNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 3.1 (a) Illustration of the proposed BMPQ method. (b) Il- lustration of superior performance of BMPQ compared to BNNs. Both methods use training from scratch. . . . . . . 51 Figure 3.2 Step-wise description of layer bit-width evaluation. . . . . . 56 Figure 3.3 (a), (b) Layer sensitivities based on ENBG for VGG16 on CIFAR10, during early and late phase of the training, re- spectively. ep i indicates that the normalization was per- formed after i th epoch. . . . . . . . . . . . . . . . . . . . . 60 Figure 3.4 EnergyconsumedbytheCONVlayersofVGG16onCIFAR- 10. Here, we add the E mem and E MAC values to get E total . . 62 Figure 4.1 Histogram of the gradients for CONV layer 7 of VGG16 withtargetdensityof0.4atanearlystageoftraining(after 10 epochs) to classify CIFAR-10. . . . . . . . . . . . . . . . 65 xii Figure 4.2 Two major training stages of the proposed scheme: (a) ANN training using attention-guided compression (AGC), (b)Sparse-learningbasedSNNtrainingusingsurrogategradient- based training. . . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.3 PlotoftestaccuracyversusepochsforResNet12onCIFAR- 10 for model compressed using AGC with VGG9 chosen as Ψ m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 4.4 Input rate-coded spike equivalent images for different num- ber of time steps T. . . . . . . . . . . . . . . . . . . . . . . 76 Figure 4.5 A ResNet basic block layer for target density (a) d = 1.0 (left) and d = 0.1 (right). . . . . . . . . . . . . . . . . . . . 78 Figure 4.6 Plot of test accuracy versus epochs with different target densities for (a) ResNet12 on CIFAR-10, (b) VGG11 on CIFAR-100, and (c) VGG16 on Tiny-ImageNet. The SNN training is done at reduced time steps. . . . . . . . . . . . . 81 Figure 4.7 Average spiking activity generated at each layer of VGG16 while classifying over the test set of CIFAR-10 for model having parameter density (d) of 1.0 and 0.03. . . . . . . . . 82 xiii Figure 4.8 Comparison of ANN to SNN in terms of (a-c)FLOPs and (d-f)normalized compute energy for VGG16 with different parameterdensitytoclassify(a,d)CIFAR-10,(b,e)CIFAR- 100, and (c,f) Tiny-ImageNet. . . . . . . . . . . . . . . . . 83 Figure 5.1 (a) Weight distribution of the 14 th convolution layer of ResNet18modelfordifferenttrainingschemes: normal, ad- versarial [1], and noisy adversarial [2]. (b) An adversarially generatedimage(ˆ x)obtainedthroughFGSMattack,which is predicted to be the number 5 instead of 8 (x). . . . . . . 89 Figure 5.2 (a)Traininglossvs. epochsand(b)Pruningsensitivityper layer for VGG16 on CIFAR-10. . . . . . . . . . . . . . . . . 95 Figure 5.3 Model compression vs. accuracy (on both clean and adver- sariallygeneratedimages)forirregularandchannelpruning evaluated with VGG16 on CIFAR-10 (a-b) and ResNet18 on CIFAR-100 (c-d). (e-f) Comparison of channel pruning with irregular pruning in terms of % of channels present. Note that the % of channels present correlates with infer- ence time [3,4]. . . . . . . . . . . . . . . . . . . . . . . . . 100 Figure 5.4 On CIFAR-10, the perturbed data accuracy of ResNet18 underPGDattackversusincreasing(a),(c)attackiteration and (b), (d) attack bound ϵ for irregular (5% density), and channel pruned (50% density) models, respectively. . . . . . 104 Figure 6.1 Normalized memory vs. Test accuracy for FLOAT and FLOAT with irregular sparsity (FLOATS-i) compared to theexistingstate-of-the-artOATfor(a)ResNet34,(b)WRN16- 8, and (c) WRN40-2, respectively. CA and RA repre- sent clean-image classification accuracy and robust accu- racy (accuracy on adversarial images), respectively. For each model we normalized the memory requirement with the maximum memory needed to store corresponding model.108 Figure 6.2 Impact of various training λ choices on the conditionally trained OAT. During testing we use S λ =[0,0.2,0.7,1.0]. . 110 Figure 6.3 Comparison of a conditional layer between existing FiLM based approach in OAT (left) and proposed approach in FLOAT (right). . . . . . . . . . . . . . . . . . . . . . . . . 111 Figure 6.4 Post-trainingmodelperformanceonbothcleanandgradient- based attack-generated adversarial images, with different noise re-scaling factor λ n . . . . . . . . . . . . . . . . . . . . 113 xiv Figure 6.5 (a) Comparison of channel density (weights plotted in abs. magnitude)forFLOATSirregularandchannel, forthe29 th CONV layer of WRN40-2 on STL10 while both are trained for d = 0.3. (b) Convolutional layer operation path for FLOATS slim. Note, the switchable BNs correspond to BNs for each SF.. . . . . . . . . . . . . . . . . . . . . . . . 115 Figure 6.6 Performance of FLOAT on (a) CIFAR-10, (b) STL10, (c) SVHN, (d) CIFAR-100, and (e) Tiny-ImageNet with var- ious λ n values sampled from S λ n for two different λ th for BN C to BN A switching. The numbers in the bracket cor- respondsto(CA,RA)fortheboundaryconditionsofλ =0 and λ = 1. λ n varies from largest to smallest value from top-left to bottom-right. . . . . . . . . . . . . . . . . . . . 119 Figure 6.7 Performance comparison of FLOAT with OAT and PGD- AT generated models on (a) CIFAR10, (b) SVHN, and (c) STL10. λ varies from largest to smallest value in S λ for the points from top-left to bottom-right. . . . . . . . . . . . . . 120 Figure 6.8 ComparisonofFLOATwithOATandPGD-ATintermsof (a) normalized training time per epoch and (b) model pa- rameterstorage(neglectingthestoragecostfortheBNand α ) (c) CONV layer compute delay on conventional ASIC (using the delay model of Eq. 7, 8, and 9) architecture [5] evaluated on ResNet34 for CIFAR-10. Note here, PGD- AT:1T yields 1 model for a specific λ choice. . . . . . . . . 121 Figure 6.9 PerformancecomparisonofFLOATslim, FLOATS(-i)slim with OAT slim. We used ResNet34 on CIFAR-10 to evalu- ate the performance.. . . . . . . . . . . . . . . . . . . . . . 122 xv Figure 6.10 PerformancecomparisonofFLOATwithOATon(a)PGD- 20and(b)FGSMattackgeneratedimages. (c)CA-RAplot of FLOAT vs. PGD-AT on autoattack. All evaluations are done with ResNet34 on CIFAR-10. λ varies from largest to smallest value in S λ for the points from top-left to bottom- right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Figure 7.1 (a) Direct and rate-coded input variants of the original im- age. (b) Layer wise average spikes for VGG11. (c) Perfor- manceofdirect-inputVGG11SNNanditsequivalentANN undervariouswhite-box(WB)andblack-box(BB)attacks. Both the evaluations are done on CIFAR-100.. . . . . . . . 128 Figure 7.2 Per layer TASAs of VGG5 and VGG11 on CIFAR-10 and CIFAR-100, respectively. . . . . . . . . . . . . . . . . . . . 131 Figure 7.3 ClassificationperformanceofVGG11onCIFAR-100,under (a) white-box and (b) black-box attacks as number of time steps T varies. . . . . . . . . . . . . . . . . . . . . . . . . . 132 Figure 7.4 (a) PD vs LIF leak parameter for a fixed threshold (0.8) and latency (T = 10) averaged over two randomly chosen input images that are perturbed with PGD-1. (b) Interme- diatelayerspikePDforVGG5fedwitharandomly-selected CIFAR-10 clean image and its perturbed variant. . . . . . . 132 Figure 7.5 Traditional and proposed training schemes, respectively. Herethegreenandorangeblocksrepresentactivationmaps and the gradients that are generated after passing the in- put image. For the proposed training scheme we use two color variants deep and light, respectively, to highlight the sets of activation maps and gradients from an image and its noisy variant during two different periods. The yellow blocks represent the weight tensors that get updated from accumulated gradients. In proposed, we compute the in- put gradient with these updated weights to craft the noise. Here, we assumed T =4 andN =2. . . . . . . . . . . . . . 134 Figure 7.6 (a) Normalized GPU memory usage and (b) average train- ing time for a batch of 200 images for VGG5, VGG11, and ResNet12 when trained with the traditional and proposed approaches.. . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Figure 7.7 White-box PGD attack performance as a function of (a) boundϵ and(b)attackiterationsK withVGG5onCIFAR- 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Figure 7.8 Comparison of traditional SNN vs. proposed training with both GN and crafted input noise. Training were performed with direct-input VGG11 on CIFAR-100. . . . . . . . . . . 142 xvi Figure 7.9 (a) Inference T steps for rate-coded vs direct input trained SNNs, (b-e) Accuracy vs. ϵ s plot for both clean and adver- sarially generated images (both with WB and BB attack settings) with VGG5 (b, c) and VGG11 (d, e) on CIFAR- 10 and CIFAR-100, respectively. . . . . . . . . . . . . . . . 143 Figure 7.10 Comparison of normalized compute energy computed as- suming (a) 32-bit FP and (b) 32-bit INT implementations. 145 Figure 8.1 Distillation from a nasty ResNet50 to (a) normal students, (b)proposedskepticalstudents,onCIFAR-100. Inparticu- lar,forMobileNetV2(MbV2)whichisareducedparameter model, the proposed distillation method can improve the accuracy by 59.49%. (c) Impact of transferring knowledge at various depth of a ResNet18 from a nasty teacher. BB represents a basic-block layer. . . . . . . . . . . . . . . . . 150 Figure 8.2 A ResNet18 student’s performance on CIFAR-100 dataset. 154 Figure 8.3 Skepticalstudentdistillationframework. Notethearrowof the distillation loss components are directed from teacher to student for the corresponding KL-divergence computation.155 Figure 8.4 Data-free distillation from a teacher to a skeptical student. 157 Figure 8.5 Logit response visualization after the softmax layer. Each rowcontainsanexampleimagefromCIFAR-10datasetand corresponding response for normal teacher, nasty teacher, normal student and skeptical student. We used ResNet50 and ResNet18 as teacher and student model, respectively. 163 Figure 8.6 Visualization of tSNE for normal and skeptical students (ResNet18) upon distillation from both normal and evasive teacher (ResNet50) on CIFAR-10. For the skeptical stu- dents we plot visualization both at the final classifier (C) and auxiliary classifier (AC). . . . . . . . . . . . . . . . . . 165 Figure 8.7 Ablation study with a and τ for normal and skeptical stu- dents (ResNet18) upon distillation from both normal and nasty teacher (ResNet50) on CIFAR-100. . . . . . . . . . . 166 Figure 8.8 ResNet18 on CIFAR-10 dataset under different percentage oflimitedtrainingdataupondistillationfrom(a)nastyand (b) normal teachers. . . . . . . . . . . . . . . . . . . . . . . 168 Figure 9.1 Comparison of various methods in accuracy vs. #ReLU trade-off plot. SENet outperforms the existing approaches with an accuracy improvement of up to ∼ 4.5% for similar ReLU budget [6]. . . . . . . . . . . . . . . . . . . . . . . . 171 xvii Figure 9.2 Layer-wise pruning sensitivity (d = 0.1) vs. normalized ReLU importance. The later layers are less sensitive to pruning,thus,canaffordsignificantlymorezero-valuedweights asopposedtotheearlierones. Onthecontrary,laterReLU stages generally have more importance. . . . . . . . . . . . 176 Figure 9.3 Different stages of the proposed training methodology for efficient private inference that can support dynamic chan- nel reduction. For example, the model here supports two channel SFs, S 1 and S 2 . Note, similar to [7], for each SF support we use a separate batch-normalization (BN) layer to maintain a separate statistics. . . . . . . . . . . . . . . . 181 Figure 9.4 PerformanceofSENet++onthreedatasetsforvarious#ReLU budgets. The points labelled A, B, C, D corresponds to ex- periments of different target #ReLUs for the full model (d r = 1.0). For SENet++, note that a single training loop yields two points with the same label corresponding to the two different drop out rates . . . . . . . . . . . . . . . . . . 185 Figure 9.5 Performance comparison of SENet++ (with d r = 1.0 and 0.5)vs. existingalternatives(a)withVGG16andResNet18 in terms of ReLU latency. The labels A, B, C, D corre- spond to experiments of different target #ReLUs for the full model (d r = 1.0). For SENet++, note that a single training loop yields two points with the same label corre- sponding to the two different drop out rates. (b) Compar- ison between DeepReDuce and SENet++ for a target # ReLU budget of∼ 50k with ResNet18 on CIFAR-100. . . . 186 Figure 9.6 Ablation studies with different (a) λ and (b) β values for the loss term in Eq. 9.4. . . . . . . . . . . . . . . . . . . . 187 xviii List of Tables 2.1 Descriptions of tensor dimensions in a convolutional layer . . . . . . 20 2.2 ExpressionofFLOPscountforinferenceoperationwithvariouspre- defined computationally-limited filters . . . . . . . . . . . . . . . . 27 2.3 Summary of notation for matrix storage formats . . . . . . . . . . . 31 2.4 Storagerequirementofstoringamatrixusingdenseandsparsestor- age formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Nomenclature of the network architectures used in simulation . . . 35 2.6 TestaccuracyofpSConvbasedVGG16,andResNet18onCIFAR-10 andTinyImageNet. HereweuseKSSof9, 4, 2, and1, respectively. Also, KSS of 9 means SFCC based CONVs and thus they are used as baseline to compare accuracy, and parameters. . . . . . . . . . . 37 2.7 Test accuracy of different variants of periodic sparse kernel based VGG16andResNet18onCIFAR-10andTinyImageNet. Thebase- line architectures of Table 2.6 are used as the reference for calculat- ing the reduction in parameters. . . . . . . . . . . . . . . . . . . . . 38 2.8 Test accuracy of different variants of VGG16, and ResNet18 on CIFAR-10 with periodic sparse kernels boosted through insertion of periodic FC kernels. . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.9 TestaccuracyofdifferentvariantsofVGG16,andResNet18onTiny ImageNet with periodic sparse kernels boosted with periodic FC kernels.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.10 Parametersreductionandcorrespondingnormalizedstoragerequire- ment including indexing overhead for four VGG16 variants with both CSR P and CSR format of compressed storage. . . . . . . . . 43 2.11 Test accuracy of boosting as a general method to improve accuracy. Dataset used here is Tiny ImageNet. . . . . . . . . . . . . . . . . . 44 2.12 CONV layer channel width parameters with different α w values of the network models. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 PerformanceofBMPQgeneratedmodelscomparedtotherespective baseline full precision (FP-32) models. . . . . . . . . . . . . . . . . 59 3.2 Comparison with single-shot MPQ achieved through analysis of ac- tivationdensity(AD).Note,tohaveafaircomparisonwereportthe accuracy of our models after 120, 120, and 60 epochs for CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively. . . . . . . . . . . . . 60 xix 3.3 Boosting of BNN model performance through BMPQ and FP-32 trained models as teachers. . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Estimate of energy consumption. . . . . . . . . . . . . . . . . . . . 63 4.1 ModelperformanceswithAGCbasedtrainingonCIFAR-10,CIFAR- 100, and Tiny-ImageNet after a) ANN training, b) ANN-to-SNN conversion and c) SNN training. . . . . . . . . . . . . . . . . . . . . 79 4.2 Performance comparison of the proposed hybrid SL with state-of- the-art deep SNNs on CIFAR-10 and CIFAR-100. . . . . . . . . . . 80 4.3 Convolutional layer FLOPs for ANN and SNN models. . . . . . . . 82 4.4 Estimated energy costs for various operations in 45 nm CMOS pro- cess at 0.9 V [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1 Results on VGG16 to classify Tiny-ImageNet. . . . . . . . . . . . . 99 5.2 ComparisonofDNR,ADMMbased,andL 1 lassobasedrobustprun- ing schemes on CIFAR-10. . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Comparison of DNR with and without the dynamic regularizer for CIFAR-10 classification. . . . . . . . . . . . . . . . . . . . . . . . . 102 5.4 Comparison with state-of-the-art non-iterative pruning schemes on CIFAR-10andcomparisonofdeviationfrombaselineonCIFAR-100 and Tiny-ImageNet. . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.1 PerformancecomparisonbetweendifferentcompressedFLOATvari- ants trained on CIFAR-10 with ResNet34. ✓✓,✓, and✗ indicate aggressive,non-aggressive,andnoreduction,respectively,compared to the baseline of FLOAT. . . . . . . . . . . . . . . . . . . . . . . . 123 7.1 Comparison of model performances under various white-box and black-box attacks on both CIFAR-10 and CIFAR-100. Note that italicized values are taken directly from the original paper. . . . . . 129 7.2 Performance comparison of SNN models generated using the pro- posed training scheme on clean and adversarially-generated images under a white-box attack. . . . . . . . . . . . . . . . . . . . . . . . 138 7.3 Performance comparison of SNN models generated using the pro- posed training scheme on clean and adversarially-generated images under a black-box attack. . . . . . . . . . . . . . . . . . . . . . . . 139 7.4 Checklist set of tests for characteristic behaviors caused by obfus- cated and masked gradients [9]. . . . . . . . . . . . . . . . . . . . . 141 7.5 Performance comparison of proposed with traditional SNN training when threshold-leak parameters are frozen to their initialized values. 144 7.6 Estimated energy costs for various operations in a 45 nm CMOS process at 0.9 V [8]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xx 8.1 Performance of student (ResNet18) under transferability test on CIFAR-100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.2 Performance of normal vs. skeptical student when distilled from a nasty teacher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.3 Performance of normal vs. skeptical student when distilled from a normal teacher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.4 Performanceofnormalvs. skepticalstudentondata-freedistillation [10] from a teacher. . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.5 Performanceofaskepticalstudent(ResNet18)undertransferability test on CIFAR-100. . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 9.1 Comparison between existing approaches in yielding efficient mod- els to perform PI. Note, SENet++ can yield a model that can be switched to sub models of reduced channel sizes. . . . . . . . . . . . 172 9.2 Runtime and communication costs of linear and ReLU operations for 15-bit fixed-point model parameters/inputs and 31-bit ReLU operation [11]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.3 Performance of SENet and other methods on various datasets and models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.4 Performance of SENet and DeepReDuce on Tiny-ImageNet. . . . . 184 9.5 Importance of ReLU sensitivity. . . . . . . . . . . . . . . . . . . . . 186 xxi Abstract Thesuper-linearincreaseofdeeplearningmodelsizewiththeslow-downofMoore’s law has made their deployment on a resource-constrained device exceedingly chal- lenging. Compressing these large models has become a critical step in meeting energy,memory,andI/Obandwidthconstraintsimposedbythedeviceitself. How- ever, training to yield efficient neural network models can be expensive in terms of the compute complexity and associated environmental impact. Moreover, these energy-efficientmodelsmustsimultaneouslyaddressincreasinglyimportantaspects ofrobustnessandmodelprivacy,particularlyforsafety-criticalapplicationssuchas autonomous driving, healthcare, and military-grade robotics. This thesis presents low-complexitytrainingmethodsthatyieldenergy-efficientmodels,disclosesmeth- odsthatimprovemodelrobustnesswithreducedcomputationalcost,andidentifies some key challenges associated with achieving model privacy. Specifically, in part I, we introduce efficient non-iterative pruning and quantization schemes that are able to generate compressed models with a negligible drop in inference accuracy. In part II, we investigate the limitations of post-training compression for robust modelgenerationunderadversarialattacksanddevelopsparse-learningalgorithms to train robust yet compressed models. We then present a training method that conditionally trains a novel class of models that simultaneously yields state-of- the-art performance on both clean and adversarial images. Finally, part III of this thesis analyzes vulnerability and methods in protecting model privacy. More precisely, we first introduce the notion of a “skeptical student” that, using a novel hybriddistillation,cancircumventthemodelIPprotectionobtainedfromso-called “undistillable” models under both data-available and data-free scenarios. We fur- ther develop training methodologies in yielding efficient models while protecting modelIPviaaformofclient-serversecureprivateinferenceframework. Thisthesis thushighlightstheimportanceandpotentialbenefitsofexpandingstate-of-the-art training methods to not only consider model accuracy, but also target energy effi- ciency, robustness, and privacy. Thesis Supervisors: Dr. Peter A. Beerel, ECE, University of Southern California. Dr. Massoud Pedram, ECE, University of Southern California. xxii Publications Please see the author’s Google Scholar page for a full list of publications. 1. S. Kundu, Y. Fu, B. Ye, P. A. Beerel, M. Pedram, “Towards Adversary Aware Non-Iterative Model Pruning Through Dynamic Network Rewiring of DNNs”, ACM Transactions on Embedded Computing Systems (TECS), Jan 2022. ACM: [paper] 2. S. Kundu, S. Wang, Q. Sun, P. A. Beerel, M. Pedram, “BMPQ: Bit-Gradient SensitivityDrivenMixed-PrecisionQuantizationofDNNsfromScratch”,accepted as a conference paper at Design, Automation, and test in Europe (DATE), 2022. ACM: [paper] 3. S. Kundu, S. Sundaresan, M. Pedram, P. A. Beerel, “A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness”, under review as a conference paper 2022. arXiv: [paper] 4. S. Kundu, S. Lu, Y. Zhang, J. T. Liu, P. A. Beerel, “SENet: Towards Secure and Efficient Private Inference via Automated Non-Linearity Trimmed Network”, under review as a conference paper 2022. 5. S. Kundu, Q. Sun, Y. Fu, M. Pedram, P. A. Beerel, “Analyzing the Con- fidentiality of Undistillable Teachers in Knowledge Distillation”, accepted as a conference paper at Neural Information Processing Systems (NeurIPS), 2021. Openreview: [paper] 6. S. Kundu, M. Pedram, P. A. Beerel, “HIRE-SNN: Harnessing the Inherent Robustness of Energy-Efficient Deep Spiking Neural Networks by Training with Crafted Input Noise”, accepted as a conference paper at International Conference on Computer Vision (ICCV), 2021. CVF openaccess: [paper] 7. S. Kundu, S. Sundaresan, “AttentionLite: Towards Efficient Self-Attention Models for Vision”, accepted as a conference paper at IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2021. IEEE: [paper] 8. S. Kundu, G. Datta, M. Pedram, P. A. Beerel, “Spike-Thrift: Towards Energy-Efficient Deep Spiking Neural Networks by Limiting Spiking Activity via Attention-Guided Compression”, accepted as a conference paper at Winter Con- ference on Application of Computer Vision (WACV), 2021. CVF openaccess: [paper] xxiii 9. S. Kundu, M. Nazemi, P. A. Beerel, M. Pedram, “DNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs”, accepted as a conference paper at Proc. of Asia and South Pacific Design Automation Con- ference (ASP-DAC), 2021. ACM: [paper] 10. S. Kundu, M. Nazemi, M. Pedram, K. M. Chugg, P. A. Beerel, “Pre-defined SparsityforLow-ComplexityConvolutionalNeuralNetworks”, IEEE Transactions on Computers, July 2020. doi: 10.1109/TC.2020.2972520. IEEE: [paper] 11. S. Kundu*, S. Prakash*, H. Akrami, P. A. Beerel, K. M. Chugg, “pSConv: A Pre-defined Sparse Kernel Based Convolution for Deep CNNs“, Allerton Con- ference, 2019. doi: 10.1109/ALLERTON.2019.8919683. IEEE: [paper] 12. A. Fayyazi*, S. Kundu*, S. Nazarian, P.A. Beerel, M. Pedram, “CSrram: Area-Efficient Low-Power Ex-Situ Training Framework for Memristive Neuromor- phic Circuits Based on Clustered Sparsity”, International Symposium on VLSI (ISVLSI), 2019. doi: 10.1109/ISVLSI.2019.00090. IEEE: [paper] (*=Authors have equal contribution). Software Please see the author’s Github page for additional information. 1. https://github.com/ksouvik52/Skeptical2021 Skeptical student based hybrid distillation training 2. https://github.com/ksouvik52/hiresnn2021 HIRE-SNN open-source resources 3. https://github.com/ksouvik52/DNR_ASP_DAC2021 Dynamic network rewiring (DNR) open-source resources 4. https://github.com/ksouvik52/Pre-defined-sparseCNN Pre-defined sparse CNN training open-source resources xxiv Chapter 1 Introduction 1.1 Deep Neural Networks Basics Deep neural networks (DNNs), in particular, deep convolutional neural networks (CNNs) have become critical components in many real world vision applications ranging from object recognition [12–16] and detection [17–20] to image segmenta- tion[21]. Withthedemandforhighclassificationaccuracy,currentstate-of-the-art (SOTA) CNNs have evolved to have hundreds of layers [12–14,22–24], requiring millions of weights and billions of floating-point operations (FLOPs). Addition- ally, DNNs have become the de facto backbone of many other machine learning paradigms including self-supervised learning [25] and reinforcement learning [26]. Theinceptionofsuchartificialneuralnetworks(ANN)datesbackto1943when anartificialneuronwaspresentedbyWarrenS.McCullochandWalterPitts[27]. A McCulloch-Pitts neuron (a.k.a. the threshold logic unit) takes a number of binary excitatory inputs and a binary inhibitory input, compares the sum of excitatory inputswithathreshold,andproducesabinaryoutputofoneifthesumexceedsthe threshold and the inhibitory input is not set. A perceptron [28], addresses some of theshortcomingsofMcCulloch-Pittsneuronsbyintroducingweightedconnectivity between the neurons and the output can be formulated as, y = 0 if n− 1 P i=0 θ i x i > k 2 n , (2.6) and (2.7) can be approximated as R mob ≃ 1 n (2.8) R shuf ≃ ( k 2 G +1) n (2.9) which shows the complexity increment due to periodic insertion of FC kernels is negligible for relatively wide networks with large periods. Fig. 2.5 shows a 3D illustration of the per layer FLOP ratios (R mob and R shuf ) as a function of C o and P. Note that even though the per layer ratio can be less than 1, the total parameter count for MobileNet or ShuffleNet-like networks can be larger due to the presence of more layers. 2.5.2 Impact on Storage Sparsity leads to savings in storage only when the overhead of storing the auxil- iary vectors to manage sparsity is negligible. This section presents a new sparse representation specifically tailored to periodic sparse kernels and compares it with 29 existing formats. It also analyzes storage requirements of different sparse repre- sentations analytically, allowing the study of the effectiveness of such formats at different levels of density. Furthermore, it explains how the proposed representa- tion can be exploited in CNN accelerators. CSR/CSC with a Periodic Column/Row Vector TheperiodicpatternofkernelsintroducedinSection2.4allowsreusingthecolum- n/row vector in the CSR/CSC format. For example, assume a convolutional layer with 3× 3 kernels, 128 input channels, 128 output channels, and a period of four. The 4D weight tensor corresponding to this convolutional layer can be represented by a flattened weight matrix where each row corresponds to a flattened filter. As a result, the number of rows in the flattened weight matrix is equal to 128 while the number of columns is 3× 3× 128 = 1152. Because of the periodicity across filters, the structure of the rows of the flattened weight matrix will also repeat with a period of four. Therefore, one can simply store the column vector of the CSR format for the first four rows and reuse them for the subsequent rows. We refer to this new sparse storage format as CSR with a periodic column vector and denote it with CSR P , where P denotes the period of repetition of the column vec- tor. Similarly, because of the periodicity of kernels within a filter, the columns of the flattened matrix also repeat with a period of 4 × (3× 3) = 36. As a result, onecanchoosetousetheCSCformattorepresenttheflattenedsparsematrixand reusetherowvectorforgroupsof36columns. WerefertothisnewformatasCSC with a periodic row vector and denote it with CSC P , where theP here denotes the period of repetition of the row vector. Table 2.3 summarizes the notation used for comparing the storage cost of dif- ferent storage formats. Using the notation introduced here, Table 2.4 explains storage requirements of different storage formats. 30 Table 2.3: Summary of notation for matrix storage formats Variable Description H F , W F height, width of a flattened weight matrix ρ density (0≤ ρ ≤ 1) b v number of bits for representing data values b r ,b c number of bits for representing row, column values b i number of bits for representing index values b P number of bits for representing the period Table 2.4: Storage requirement of storing a matrix using dense and sparse storage formats Format Storage Requirement (bits) Dense H F W F bv COO ρH F W F (bv +br +bc) CSR ρH F W F (bv +bc)+(H F +1)b i CSC ρH F W F (bv +br)+(W F +1)b i CSR P ρH F W F bv +ρPW F bc +(H F +1)b i +b P CSC P ρH F W F bv +ρPH F br +(W F +1)b i +b P Based on Table 2.4, the COO format is expected to have higher overhead than that of the CSR and CSC formats, which have similar storage overhead. Furthermore, it is evident that the introduction of periodicity to the CSR and CSC formats can significantly decrease the storage overhead. Application to Weight Sub-Matrices As noted above, a convolutional layer with periodic sparse kernels induces a flat- tened weight matrix that also has periodically repeating columns and rows. In a CNN accelerator, the processing of a convolutional layer is often broken down into smaller operations where subsets of the flattened weight matrix are processed across multiple PEs. This processing requires accessing a sub-matrix of the flat- tened weight matrix. If this sub-matrix is large enough, it will also have row or column vectors that are repeated periodically. For example, Fig. 2.6 demonstrates a subset of a flattened weight matrix that is used in a single processing element of an architecture like Eyeriss v2 [55] (the original flattened weight matrix is built 31 P = 4 P P P W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 1 KV 2 KV 3 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 4 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 1 KV 2 KV 3 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 4 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 1 KV 2 KV 3 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 4 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 1 KV 2 KV 3 W 11 W 12 W 13 W 21 W 22 W 23 W 31 W 32 W 33 KV 4 Figure 2.6: Illustration of how periodicity in a filter leads to repeating rows of sub- matrices of the filter’s flattened weight matrix. using the first four kernel variants shown in Fig. 2.2(a)). This sub-matrix corre- spondstoprocessingthefirst(top)rowoffourkernelsof16filters. Specifically,the sub-matrix consists of 16 rows corresponding to 16 filters and 12 columns corre- sponding to the top row of four kernels per filter. Note in Fig. 2.6, the four kernels havebeenrotatedasdescribedinSection2.4. Basedontheperiodicpatternacross filters, the sub-matrix shown in Fig. 2.6 has repeating rows with a period of four and can be represented using CSR 4 . BecauseeachPEinaCNNacceleratorprocessesasmallportionoftheflattened weight matrix, b c , b r , and b i have small ranges and therefore can be represented using a small number of bits. For example, assuming b v = 8, b c = b r = 4, and b i =7, Fig.2.7acomparesstoragerequirementsofvariousexistingstorageformats at different levels of filter density. It is observed that the CSR and CSC formats yield lower total storage when the original matrix is at most 62% and 65% dense, respectively. Fig. 2.7b compares storage requirement of dense, CSR, and CSR P formats for the same matrix that was shown in Fig. 2.7a, for different values of P, and b P =6. ItisobservedthattheCSR 8 andCSR 16 yieldlowertotalstoragewhentheoriginal matrix is at most 82% and 73% dense, respectively. Furthermore, at 62% density, 32 0.0 0.2 0.4 0.6 0.8 1.0 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Storage (KiB) Dense COO CSR CSC (a) 0.0 0.2 0.4 0.6 0.8 1.0 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Storage (KiB) Dense CSR CSR 2 CSR 4 CSR 8 CSR 16 (b) Figure 2.7: Comparison of storage requirements of (a) various existing storage formats and (b) dense, CSR, and CSR P formats at different levels of density for a matrix of size 32× 12 (b v =8, b c =b r =4, b i =7, and b P =6). CSR 8 and CSR 16 yield lower total storage compared to CSR by 23% and 16%, respectively 3 . This is equivalent to 60.04% and 39.86% reduction in the overhead of storing auxiliary vectors for the CSR 8 and CSR 16 compared to the CSR format, respectively. Because the energy cost associated with transferring from the DRAMs is well- modeled as proportional to the number of bits read [102], the reduced storage requirements of CSR P /CSC P lead to a proportional reduction in the energy cost associatedwithDRAMaccess. Forexample,a50%savingsinstoragewillresultin a∼ 2× reduction in energy consumption related to DRAM access. For this reason, in the remainder of this work, we focus on savings in storage requirements with the energy savings being implicit. Hardware Support for Periodic Sparsity The low-complexity storage formats introduced earlier, i.e. CSR P /CSC P , cannot be integrated into existing accelerators without ensuring they can support the proposed periodic sparse format. For example, in Eyeriss v2, each weight value (i.e. data) is coupled with its corresponding index and they are read as a whole 3 Interestingly, CSR has similar storage requirements as RLC. In particular, as implemented in Eyeriss [54], at 62% density, RLC would lead to 0.14% more storage than CSR. 33 fromthemainmemory. Ontheotherhand,theCSR P /CSC P storethecolumn/row vector separately from the data vector and read the auxiliary vectors once for all data values. This not only requires proper adjustment of the bus that transfers data from the DRAM to the chip but also may require a minor modification in either the control logic or PEs. One approach to make an accelerator like Eyeriss v2 compatible with periodic sparsityistostoretheweightsinDRAMusingtheproposedsparseperiodicformat and modify the system-level control logic to expand the column/row vector before storing them in the PE’s scratchpad memory. In other words, the sparse colum- n/rowvectorisreadfromtheDRAMonlyonce,butreplicatedbeforebeingwritten intothescratchpadmemorycorrespondingthethecolumn/rowvectorsothatthey adhere to the CSR/CSC format. In this manner, the scratchpad memory within eachPEremainsthesameandstoresbundled(data,index)pairs. BecauseDRAM accesses consume two orders of magnitude more energy than on-chip communica- tion, we can thus achieve close to the optimal energy savings without requiring any change in the PE array or its associated control structures. Amorecomprehensiveapproachtosupportingperiodicsparsityinvolvesensur- ing the PEs can use the column/row scratchpad memory as a configurable circular buffer,which,tosupportperiodicity,willbeconfiguredtohavelength P. Thistype of support may already exist because in many cases, the size of the weight matrix processed within each PE is smaller than the size of the corresponding scratchpad memory and therefore, only a portion of the scratchpad memory is used. In this approach, the periodic column/row vector is read from the DRAM once, written into the scratchpad memory, and accessed multiple times for different rows of the weight matrix. This reduces the required on-chip communication and thus may save more memory compared to storing the expanded column/row vectors in the scratchpad memory. 34 While the presented approaches enable compression of the column/row vec- tors, one may be able to compress the index vector as well, as suggested by the row periodicity illustrated in Fig. 2.6. However, this may require more complex hardware support to expand the index vector before storing them in the PEs or adding support for the compressed index vectors within the PE. 2.6 Experimental Results Table 2.5: Nomenclature of the network architectures used in simulation Name Description of the network architecture aaa pSCn aaa network with pre-defined sparse kernel based convolution where each 2D kernel has n weights not pre-defined to be zero. aaa pSCn Pm aaa network with every m th kernel is FC and rest are pre-defined sparse kernels having n weights not pre-defined to be zero. aaa PSn Pm aaa network with both periodicity and kernel variant values of m, and each 2D kernel has n weights not pre-defined to be zero. aaa PSDn Pm aaa network with periodic kernel variants having periodicity m, where each period has m− 1 sparse kernel variants each with n weights not pre- defined to be zero and 1 FC k× k kernel. This section describes our simulation results and analysis. We first detail the datasets, architecture, and important hyperparameters used for our experiments, followed by our experimental results of our proposed pSConv approach, the in- troduction of periodicity, and our performance boosting technique. Finally, we compareourmodifiednetworkarchitectureswithMobileNetV2[57],apopularlow- complexityCNNvariantforimageclassification,intermsofFLOPs,modelparam- eters,andaccuracy. WeusedPytorch[103]todesignthemodelsandtrained/tested the models on AWS EC2 P3.2x large instances that have an NVIDIA Tesla V100 GPU. 35 Datasets, Architectures, and Hyperparameters To evaluate our models we used CIFAR-10 [104] and Tiny ImageNet [105], two widely popular image classification datasets. The input image dimensions of CIFAR-10andTinyImageNetare(32× 32× 3)and(64× 64× 3),respectively. The number of different output classes for these two datasets are 10 and 200, respec- tively. We chose variants of VGG16 [13] and ResNet18 [15] as the base network modelstoapplyourarchitecturalmodifications. TheVGG16architecturehasthir- teen 3× 3 kernel based convolutional layers. The flattened output of final CONV layer is fed to the fully connected part having three fully connected (FC) layers. 4 The CONVs of ResNet18 architecture consists of four layers each containing two basic blocks, where each basic block has two convolutional layers along with a skip connection path. We used pre-defined sparse kernels on all k× k CONV layers where k > 1 but excluded the first layer, as it is connected to the primary inputs and is thus more sensitive to zero weights. Training was performed for 120 and 100 epochs for CIFAR-10 and Tiny ImageNet, respectively. The initial learning rate was set to 0.1 with momentum of 0.9 and weight decay value to 5× 10 − 4 . Theimagedatasetswereaugmentedthroughrandomcroppingandhorizontalflips before being fed into the network in batches of 128 and 100 for CIFAR-10 and Tiny ImageNet, respectively. All results reported are the average over two train- ing experiments. Table 2.5 provides the names of each variant of network model and corresponding architecture descriptions. 4 InVGG16forCIFAR-10dataset, we usedonlyone FClayer becausetheinputimage dimen- sion is 4× smaller than Tiny ImageNet and multiple FC layers are not needed to achieve high accuracy. 36 Results for pSConv Based CNN We analyzed three different variants of regular sparse kernel based CONVs with KSS values of 4, 2 and 1 along side the baseline standard convolution based net- work. As stated earlier, in our choice of kernel patterns we ensure each of the k 2 possible kernel entries are covered by at least one sparse kernel variant. Table 2.6 provides the results in terms of accuracy and parameter count 5 with the KSS vari- ants applied in VGG16 and ResNet18 architectures. The ResNet18-based results show that even with KSS of only 4, the test accuracy degradation is within∼ 0.4% for CIFAR-10 dataset, and within∼ 0.6% for Tiny ImageNet. The same results for VGG16 show a test accuracy degradation is within∼ 0.7% for CIFAR-10 dataset, and within∼ 1.1% for Tiny ImageNet. Table 2.6: Test accuracy of pSConv based VGG16, and ResNet18 on CIFAR-10 and Tiny ImageNet. Here we use KSS of 9, 4, 2, and 1, respectively. Also, KSS of 9 means SFCC based CONVs and thus they are used as baseline to compare accuracy, and parameters. Data Model Top 1 Top 5 Parameters Parameters (%) set acc (%) acc (%) reduction VGG16 pSC9 92.8 – 14.73 M — C VGG16 pSC4 92.0 – 6.55 M 55.56 I VGG16 pSC2 91.2 – 3.27 M 77.78 F VGG16 pSC1 89.5 – 1.64 M 88.89 A ResNet18 pSC9 92.9 – 11.17 M — R ResNet18 pSC4 92.5 – 5.06 M 54.65 10 ResNet18 pSC2 91.1 – 2.62 M 76.56 ResNet18 pSC1 89.4 – 1.39 M 87.50 VGG16 pSC9 57.2 78.9 14.73 M — VGG16 pSC4 56.1 79.1 6.55 M 55.56 Tiny VGG16 pSC2 54.2 78.2 3.27 M 77.78 Image VGG16 pSC1 52.5 76.7 1.64 M 88.89 Net ResNet18 pSC9 62.4 83.2 11.17 M — ResNet18 pSC4 61.7 83 5.06 M 54.65 ResNet18 pSC2 60.2 82.7 2.62 M 76.56 ResNet18 pSC1 59.0 82.2 1.39 M 87.50 5 We considered the convolution layer parameters only to report in the tables of this section without considering the overhead of indexing. 37 Results for pSConv with Periodicity The storage and energy advantage associated with periodically repeating kernels with some specific set of kernel variants, analysed in Section 2.5.2, motivated us to evaluate its performance in terms of test accuracy. We leveraged the obser- vation provided by [101] and kept the KVS = P small for different KSS based architectures. In particular, as KSS of 4 covers more kernel entries per variant, we chose a corresponding P = KVS = 4 and covered all possible kernel entries of the 3× 3 kernels. For similar reasons, we chose larger KVS for KSS of 2 and 1, respectively (6 and 9, respectively). We selected kernel variants as described in the previous subsection. Fig. 2.8 shows the learning curves for CIFAR-10 and Tiny ImageNet datasets with different variants of VGG16 and ResNet18 models with KSS of 1. 6 It is clear that the sparse variants learn at similar rates as the corresponding baselines. Table 2.7: Test accuracy of different variants of periodic sparse kernel based VGG16 and ResNet18 on CIFAR-10 and Tiny ImageNet. The baseline architectures of Table 2.6 are used as the reference for calculating the reduction in parameters. Data Model (KVS, P) Top 1 Top 5 Parameters Parameter set acc (%) acc (%) reduction (%) C VGG16 PS4 P4 (4, 4) 91.7 – 6.55 M 55.56 I VGG16 PS2 P6 (6, 6) 90.6 – 3.27 M 77.78 F VGG16 PS1 P9 (9, 9) 87.9 – 1.64 M 88.89 A ResNet18 PS4 P4 (4, 4) 92.9 – 5.06 M 54.65 R ResNet18 PS2 P6 (6, 6) 91.5 – 2.62 M 76.56 10 ResNet18 PS1 P9 (9, 9) 89.6 – 1.39 M 87.50 VGG16 PS4 P4 (4, 4) 56.9 79.9 6.55 M 55.56 Tiny VGG16 PS2 P6 (6, 6) 53.9 77.8 3.27 M 77.78 Image VGG16 PS1 P9 (9, 9) 51.8 76.7 1.64 M 88.89 Net ResNet18 PS4 P4 (4, 4) 61.9 83 5.06 M 54.65 ResNet18 PS2 P6 (6, 6) 60.7 82.9 2.62 M 76.56 ResNet18 PS1 P9 (9, 9) 58.9 81.8 1.39 M 87.50 Table 2.7 shows the impact of an added periodicity constraint on test accuracy with our proposed variants. Note that because of the overhead of storing auxiliary vectors,theoverallstoragereductionissmallerthantheonesreportedinTable2.7. 6 Similar trends is observed with KSS of 2 and 4 in VGG16 and ResNet18, and so we did not show in separate plots for brevity’s sake. 38 (a) (b) (c) (d) Figure 2.8: (a), and (b) shows the test accuracy vs. epochs for CIFAR-10 dataset in different variants of VGG16 and ResNet18 models, respectively; (c), and (d) are plots of top 5 error rate vs. epochs for Tiny ImageNet dataset in different variants of VGG16 and ResNet18 models, respectively. The KSS for all the variants is 1. For example, for VGG16 PS4 P4, the reduction in the number of parameters is 55.6%, but including the storage of the auxiliary vectors in CSR 4 format, the reduction is approximately 44.6%. If CSR format is used, the reduction in overall storage requirements, relative to the baseline is approximately 25%. Results for Boosting The results without and with periodically repeating sparse kernel patterns as dis- cussed earlier, show considerable performance degradation at low KSS values such as 1. This subsection presents the performance of the network models with the 39 (a) (b) (c) (d) Figure 2.9: Test accuracy vs. FLOPs count plots for different datasets on different architectures: CIFAR-10 on (a) VGG16, (b) ResNet18 variants; Tiny ImageNet on (c) VGG16, (d) ResNet18 variants. proposed boosting methodin whichwe periodically incorporateFC kernels (k× k) in the 3D filter. 7 To evaluate the value of boosting, we measure its impact when periodicity P is set to 8 and 16 as well as when applied to the non-boosting configurations used in Table 2.7. We tested the same sparse kernel variants as those used in the previoussubsection. Thus,whenthenumberofuniquevariantsarelessthanP,we randomly chose some of the sparse kernel variants to repeat before placing the FC kernels. However, for simulation of aaa PSD1 P8 models we randomly choose 7 of 9 unique sparse kernel variants. Note that because each period will now contain 7 Here, we focus on results with one FC kernel per period, i.e., η = 1. However, we also evalu- ated performance with larger values of η . For example, η = 2 for P = 16, yields similar accuracy as η = 1 for P = 8. Both models have similar parameter counts but the latter has significantly lower storage costs, suggesting restricting our model space to have η =1 is reasonable. 40 one FC kernel, the proposed criteria of covering all kernel entries within a period is automatically satisfied. Table 2.8: Test accuracy of different variants of VGG16, and ResNet18 on CIFAR-10 with periodic sparse kernels boosted through insertion of periodic FC kernels. Model (KVS, P) Test Improvement Parameters Parameter acc (%) over periodic reduction (%) VGG16 PSD4 P8 (5, 8) 92.5 +0.87 7.57 M 48.61 VGG16 PSD4 P16 (5, 16) 92.0 +0.39 7.06 M 52.1 VGG16 PSD2 P8 (7, 8) 91.9 +1.32 4.71 M 68.1 VGG16 PSD2 P16 (7, 16) 91.3 +0.74 3.99 M 72.92 VGG16 PSD1 P8 (8, 8) 91 +3.14 3.27 M 77.78 VGG16 PSD1 P16 (10, 16) 89.8 +1.97 2.46 M 83.33 VGG16 PSD4 P4 (4, 4) 92.4 +0.77 8.59 M 41.67 VGG16 PSD2 P6 (6, 6) 92 +1.42 5.18 M 64.81 VGG16 PSD1 P9 (9, 9) 91.05 +3.22 3.09 M 79 ResNet18 PSD4 P8 (5, 8) 92.9 +0.00 5.82 M 47.83 ResNet18 PSD4 P16 (5, 16) 92.8 -0.15 5.43 M 51.26 ResNet18 PSD2 P8 (7, 8) 92.5 +1.09 3.68 M 67 ResNet18 PSD2 P16 (7, 16) 92.3 +0.81 3.15 M 71.78 ResNet18 PSD1 P8 (8, 8) 92.5 +2.84 2.62 M 76.56 ResNet18 PSD1 P16 (10, 16) 92.0 +2.4 2.01 M 82.02 ResNet18 PSD4 P4 (4, 4) 93.0 +0.1 6.58 M 41 ResNet18 PSD2 P6 (6, 6) 92.4 +0.9 4.04 M 63.8 ResNet18 PSD1 P9 (9, 9) 92.2 +2.6 2.48 M 77.77 Table 2.9: Test accuracy of different variants of VGG16, and ResNet18 on Tiny Ima- geNet with periodic sparse kernels boosted with periodic FC kernels. Model (KVS, P) Top 1 Improvement Parameters Parameter acc (%) over periodic reduction (%) VGG16 PSD4 P8 (5, 8) 57.3 +0.35 7.57 M 48.61 VGG16 PSD4 P16 (5, 16) 56.9 +0.0 7.06 M 52.1 VGG16 PSD2 P8 (7, 8) 55.9 +1.95 4.71 M 68.1 VGG16 PSD2 P16 (7, 16) 55.5 +1.55 3.99 M 72.92 VGG16 PSD1 P8 (8, 8) 55.3 +3.55 3.27 M 77.78 VGG16 PSD1 P16 (10, 16) 55.1 +3.3 2.46 M 83.33 VGG16 PSD4 P4 (4, 4) 57.3 +0.35 8.6 M 41.67 VGG16 PSD2 P6 (6, 6) 56.3 +2.35 5.18 M 64.81 VGG16 PSD1 P9 (9, 9) 55 +3.2 3.09 M 79 ResNet18 PSD4 P8 (5, 8) 61.8 -0.09 5.82 M 47.83 ResNet18 PSD4 P16 (5, 16) 61.7 -0.23 5.43 M 51.26 ResNet18 PSD2 P8 (7, 8) 60.6 -0.13 3.68 M 67 ResNet18 PSD2 P16 (7, 16) 60.2 -0.48 3.15 M 71.78 ResNet18 PSD1 P8 (8, 8) 60.0 +1.15 2.62 M 76.56 ResNet18 PSD1 P16 (10, 16) 59.0 +0.15 2.01 M 82.02 ResNet18 PSD4 P4 (4, 4) 62.9 +1.0 6.58 M 41 ResNet18 PSD2 P6 (6, 6) 60.5 -0.23 4.04 M 63.8 ResNet18 PSD1 P9 (9, 9) 59.6 +0.75 2.48 M 77.77 Table 2.8 and 2.9 show the classification accuracy improvement compared to their sparse periodic counterparts and parameter count reduction compared to 41 the corresponding baseline models. The results show that boosting yields an im- provement of up to 3.2% (3.6%) in classification accuracy for CIFAR-10 (Tiny ImageNet). With sparse KSS of 4, the average performance improvement com- pared to periodic sparse models is∼ 0.3%. This is quite intuitive as the potential improvement is lower when KSS is high. However, for low KSS the average im- provement is ∼ 2.3%. For example, ResNet18 with KSS of 1 and repeating FC kernels with a period of 8 on CIFAR-10 provides an accuracy degradation of only ∼ 0.4% compared to the baseline, which was earlier ∼ 3.3% without the FC ker- nels inserted. This motivates the use of boosted pre-defined kernels that are very sparse. We observed similar trends with Tiny ImageNet as well. The relative cost of the increase in parameters due to boosting is low and, as the periodicity of the fully connected kernel placement increases, it becomes negligible. Fig. 2.9 showstheaccuracyvs. FLOPs 8 relationfordifferentarchitecturevariants. Models whose points lie towards the top-left have better accuracy with fewer FLOPs. In particular, for VGG16 and ResNet18 variants on CIFAR-10 and VGG16 variants on Tiny ImageNet, boosting performs consistently well, whereas, as we can see from Fig. 2.9 (d), boosting is not as beneficial for Tiny ImageNet on ResNet18. In general,weseethat,withmodestcomputationoverhead,boostingconsistentlyim- proves accuracy for models with extremely low KSS and maintains high accuracy otherwise. It is important to emphasize that the overall parameter overhead is a function of both periodicity and KSS, as exemplified by the four sparse models described in Table 2.10 analyzed using the storage requirement formulas in Table 2.4. Com- paring models 1 and 2, which have the same sparse KSS, shows the impact of periodicity; as does comparing models 3 and 4. In contrast, comparing models 1 and 3 shows the impact of KSS for fixed periodicity; as does comparing models 2 8 We consider FLOPs associated with only the convolution layers because they generally rep- resent the vast majority of FLOPs. 42 and 4. The last two columns of the table represent the parameter counts normal- ized with respect to the baseline model. Averaging across the four examples, the table shows that CSR P reduces the overall parameter count compared to CSR, including the sparse matrix representation, by 22%. Perhaps more importantly, the results show that the CSR P format can reduce the overall parameter count by as much as 70% compared to the baseline model. To better evaluate the space and choice of KVs, we generated model variants with six different random seeds. We tested VGG16 and ResNet18 models with KSS of4and2toclassifyCIFAR-10andTinyImageNet. Weobserveddifferences oflessthan1%betweentheminimumandmaximumclassificationaccuracyacross the different seeds. In particular, for ResNet18 PSD2 P8 and ResNet18 PSD4 P8 the gaps between minimum and maximum accuracy are 0.55% and 0.44%, respec- tively,averagedoverthetwodatasets. ForVGG16 PSD2 P8andVGG16 PSD4 P8 these values are 0.65% and 0.65%, respectively. Table 2.10: Parameters reduction and corresponding normalized storage requirement including indexing overhead for four VGG16 variants with both CSR P andCSR format of compressed storage. No. Model Model param. Normalized param. Normalized param. reduction (%) count, using CSR P count, using CSR 1 VGG16 PSD4 P8 48.61 0.66 0.85 2 VGG16 PSD4 P16 52.10 0.69 0.81 3 VGG16 PSD1 P8 77.78 0.34 0.42 4 VGG16 PSD1 P16 83.33 0.30 0.35 Lastly, to demonstrate boosting has general benefits, Table 2.11 shows the re- sultsofboostingwithTinyImageNet 9 whentheFCkernelsareplacedperiodically, with period P D , in between sparse kernels with no pre-defined KVS or kernel vari- ants.Note, the lack of structure of non-periodic sparse kernel based CNNs makes the models have higher indexing overhead compared to the periodic models ana- lyzed here. 9 For the CIFAR-10 dataset we obtained similar results, with ResNet18 pSC4 P8 exceeding the baseline performance with an average test accuracy of 92.95%. 43 Table2.11: Testaccuracyofboostingasageneralmethodtoimproveaccuracy. Dataset used here is Tiny ImageNet. Model (KVS, P D ) Top 1 Parameters Parameter acc (%) reduction (%) VGG16 pSC4 P8 (–, 8) 56.6 7.57 M 48.61 VGG16 pSC4 P16 (–, 16) 56.2 7.06 M 52.1 VGG16 pSC2 P8 (–, 8) 56.6 4.71 M 68.1 VGG16 pSC2 P16 (–, 16) 56.4 3.99 M 72.92 VGG16 pSC1 P8 (–, 8) 55.5 3.27 M 77.78 VGG16 pSC1 P16 (–, 16) 54.8 2.46 M 83.33 ResNet18 pSC4 P8 (–, 8) 61.8 5.82 M 47.83 ResNet18 pSC4 P16 (–, 16) 62.3 5.43 M 51.26 ResNet18 pSC2 P8 (–, 8) 61.3 3.68 M 67 ResNet18 pSC2 P16 (–, 16) 60.5 3.15 M 71.78 ResNet18 pSC1 P8 (–, 8) 59.8 2.62 M 76.56 ResNet18 pSC1 P16 (–, 16) 59.2 2.01 M 82.02 2.6.1 Performance Comparison with ShuffleNet and Mo- bileNetV2 Because ShuffleNet [58] and MobileNetV2 [57] are two widely-accepted low- complexity CNN architectures, we compared them with our proposed pre-defined periodic sparse models that have similar or fewer FLOPs. 10 In particular, Fig. 2.10(a) shows that for CIFAR-10 the ResNet18 PSD1 P16 increases accuracy to 92% compared to the baseline MobileNetV2 (ShuffleNet) accuracy of 90.3% (∼ 89%). Notethatourobtainedaccuraciesarealsosuperiorthanreportedin[106] and only around 1% less than the accuracy reported in [107] which was trained for 180additionalepochs. Thepre-definedsparseCNNmodelVGG16 PSD1 P8With 0.073 G FLOPs, has approximately 1.24× (1.34× ) fewer computation complexity yet still outperforms MobileNetV2 (ShuffleNet) in terms of accuracy. For Tiny ImageNet, as shown in Fig. 2.10(b), our best classifying model provides an accu- racy improvement of 3.2% with only 4% (2.6%) increased complexity compared to MobileNetV2 (ShuffleNet). 10 Note that we kept the hyperparameters for MobileNetV2 training the same as ResNet18 except the weight decay which was set to 0 as recommended by the original papers [57]. 44 (a) (b) (c) (d) Figure2.10: Performancecomparisonofourproposedarchitecturesthathavesimilaror fewer FLOPs than ShuffleNet and MobileNetV2 with comparable or better classification accuracy on (a-b) CIFAR-10 and (c-d) Tiny ImageNet. Moreover, as we can see from Fig. 2.11(a), and (b), with 2.42× (1.08× ) fewer parameters our proposed models perform similar to ShuffleNet for Tiny ImageNet (CIFAR-10). Similarly, the parameter requirement of our proposed models with similar accuracy as MobileNetV2 are 1.15× , and 2.38× lower for CIFAR-10 and Tiny ImageNet, respectively. 11 11 These values can be translated to the normalized parameter count with the help of the formulas in Table 2.4. 45 (a) (b) Figure 2.11: Comparison of the number of model parameters of the network models described in Fig 2.10 for (a) CIFAR-10 and (b) Tiny ImageNet. 2.6.2 Performance Evaluation on Networks Models with Scaled Down Width Squeezingthenetworklayers, i.e. reducingthenumberofchannelsper3Dfilterby a factor of α w (<1.0), popularly known as the width multiplier, is another simple technique to reduce the network’s FLOPs and storage requirement [56,108,109]. Tofurtherestablishtheideaofthepre-definedperiodicsparsity, weapplyourpro- posed kernels in squeezed variant of the ResNet18 architecture with an α w of 0.5. TheimportantnetworkmodelparametersofthesqueezedvariantsofResNet18and MobileNetV2 models are described in Table 2.12. With the iso hyperparameter settings, the baseline accuracy for ResNet18 with α w =0.5 are 91.1%, and 59.1% for CIFAR-10, and Tiny ImageNet, respectively. We trained several variants of this squeezed model with KSS values of 4, 2, and 1, each with the fully connected kernel repeating after every 8 and 16 kernels. Fig. 2.12 shows our proposed vari- ants of squeezed ResNet18 consistently outperforms both MobileNetV2 0.75 and MobileNetV2 in classification accuracy, keeping the number of FLOPs similar or lower. Inparticular,Fig. 2.12(a)showsthatonCIFAR-10dataset,toprovidesim- ilar accuracy the squeezed ResNet18 with KSS of 2 and periodicity of 16 requires 46 2.36× fewer FLOPs compared to MobileNetV2. Also, the ResNet18 variant that requires the least number of FLOPs, provides∼ 1% improved accuracy with 2.6× fewer computations compared to MobileNetV2 0.75. A similar trend is observed for Tiny ImageNet, as shown in Fig. 2.12(b). Averaged over the two datasets, the proposed squeezed ResNet18 variants provides similar accuracy with 2.42× , and 2.37× fewer FLOPs compared to MobileNetV2 0.75 and MobileNetV2, re- spectively. On the same datasets, when we constrain the number FLOPs to be similar, pre-defined periodic sparsity can provide an average accuracy improve- ment of∼ 3.16% and∼ 2.48%, compared to MobileNetV2 with α w of 0.75 and 1.0, respectively. The model parameter reduction factors are proportional to the com- putation reduction and as the ResNet18 0.5 model has comparable parameters as MobileNetV2, advantageinstorageforthesparseversionsofResNet18 0.5isquite clear, and thus not discussed in details for brevity’s sake. Table 2.12: CONV layer channel width parameters with different α w values of the network models. Name α w Convolution layer different channel sizes ResNet18 1.0 [64, 128, 256, 512] ResNet18 0.5 0.5 [32, 64, 128, 256] MobileNetV2 1.0 [16, 24, 32, 64, 96, 160, 320] MobileNetV2 0.75 0.75 [12, 18, 24, 48, 72, 120, 240] 2.7 Conclusions This work showed that with pre-defined sparsity in convolutional kernels the net- work models can achieve significant model parameter reduction during both train- ing and inference without significant accuracy drops. However, managing sparsity requires matrix indexing overhead in terms of storage and energy efficiency. To address this shortcoming, we added periodicity to the sparsity, periodically using same sparse kernel patterns in the convolutional layers, significantly reduce the indexing overhead. 47 (a) (b) (c) (d) Figure2.12: PerformancecomparisonintermsoftestaccuracyandFLOPsofdifferent squeezed (width multiplier 0.5) ResNet18 variant models with MobileNetV2 (MobV2) having width multiplier 1.0 and 0.75 on (a-b) CIFAR-10, and (c-d) Tiny ImageNet. Furthermore,todealwiththeperformancedegradationduetopre-definedspar- sity,weintroducedalow-costnetworkarchitecturemodificationtechniqueinwhich FC kernels are periodically inserted in between sparse kernels. Experimental re- sults showed that, compared to the sparse-periodic variants, this boosting tech- nique improves average classification accuracy by up to ∼ 2.3%, averaged over two periodicity of 8, and 16 in ResNet18 and VGG16 architecture on CIFAR-10 and Tiny ImageNet. We also demonstrated the merits of the proposed architectures with squeezed variants of ResNet18 (width multiplier < 1.0) and have shown it to outperform MobileNetV2 by an average accuracy of∼ 2.8% with similar FLOPs. 48 Chapter 3 Layer Sensitivity Driven Mixed-Precision Quantization This chapter first provides the necessary introduction to and motivates for mixed- precision quantization in Section 3.1. Section 3.2 provides the the background and related works to perform quantization operation in weights and activation for a DNN model. Section 3.3 discloses the presented method to yield SOTA quantizedmodelswithmixed-precision. Section3.4providesdetailedexperimental evaluations of the method presented and the chapter concludes in Section 3.5. 3.1 Introduction and Motivation The introduction of binary neural networks (BNNs) [60] and XNOR-net [59] showed the potential benefits of quantization. Early works focused on homogeneous-precision (HPQ) models in which all layer weights/activations have the same bit widths. To address the issue of significant accuracy sacrifice of the HPQ models, more recent works have demonstrated the mixed-precision quantiza- tion (MPQ) in which different layers can be assigned different bit widths based on the layer significance evaluated through various metrics, including Hessian spec- trum [110,111], and principal component analysis (PCA) [112]. Mostofthesensitivity-drivenmethodsrequirethepresenceofabaselineFP-32 pre-trained model. Alternatively, neural architecture search (NAS) can be used to determine layer bit widths. Notable work in this domain includes DNAS [113] and 49 HAQ [114], which translate the model compression problem to a search problem of efficient bit-width assignment to different layers. However, these methods re- quire a compute-intensive search procedure which is added on top of the training. Lastly, reference [115] uses intermediate activation densities to estimate the layer sensitivity and assign quantization bit widths, but does not re-evaluate bit-width assignments, limiting performance. Moreover, their quantization method does not necessarily yield models that can satisfy a target hardware constraint. Meanwhile, the increased demand for data-privacy has increased the need for on-device training and fine-tuning [116]. This trend makes many of these quantization-aware training methods impossible on resource-constrained devices. AnalternativecanbetorentcostlyGPUclusters,usequantizationonprivatedata, transfer the quantized model to resource-constrained device, and finally remove sensitive data from the server cluster before terminating the session. However, the prohibitively expensive training time may significantly increase the cluster cost, that generally charges on hourly basis. This motivates the development of effi- cient training solutions that can yield mixed-precision quantized models with no baseline pre-training or iterative training. Ourcontribution.Ourcontribution. Wepresentanefficient“duringtrain- ing”MPQmethodthatdoesnotrequireapre-trainedmodel. Toeffectivelysearch the large design space of MPQ, we decompose the core problem into two sub- problems. First, we use a novel bit-gradient-analysis driven layer-sensitivity evalu- ation to rank the layers based on their significance. Second, using this information and a specific target hardware constraint, we formulate an integer linear program (ILP) to decide the bit-precision of each layer. This combination of steps becomes the core of our bit-gradient sensitivity driven MPQ method, which we denote as BMPQ. BMPQ can be integrated into most training methods without any signifi- cant increase in training time because these extra steps are relatively low cost and areperformedatregularintervalsseparatedbymultipleepochsofnormaltraining. 50 (a) (b) Figure3.1: (a)IllustrationoftheproposedBMPQmethod. (b)Illustrationofsuperior performance of BMPQ compared to BNNs. Both methods use training from scratch. The dynamic change in the bit-width assignment of BMPQ and resulting increase in accuracy are illustrated in Fig. 3.1(a), and (b), respectively. 3.2 Preliminaries and Related Work 3.2.1 Quantization Earlyworks[117]usedrule-basedstrategiesthatrequiredhumanexpertisetoquan- tize a model. Subsequent works focused on HPQ [59,60,118] but often yielded reduced accuracy compared to the FP-32 baseline. To address this issue, works includingnon-uniformquantization[119],channel-wisequantization[120],andpro- gressive quantization-aware fine-tuning [61] were proposed. Recently, researchers have proposed using layer significance to guide quantiza- tionofdifferentbitwidthstodifferentlayersandthusboosttheaccuracy. However, the space of possible assignments is large. In particular, for a model of L layers with N B bit-width choices, there can be (N B ) L options to consider. To handle this issue, [114] converted the quantization problem into a reinforcement learning problem based on an actor-critic model. [113] proposed a differentiable neural ar- chitecture search. Others use sensitivity analysis metrics [110–112] to determine the layer importance and bit-width assignment but require a FP-32 pre-trained 51 model or rely on iterative training [121]. Thus, despite all these efforts, a single- shotduring-trainingMPQapproachthatcanyieldsimilartobaselineperformance while achieving ultra high compression has been largely missing. 3.2.2 PACT Non-Linearity Function for Activation The unbounded nature of the ReLU non-linear function can introduce significant approximationerrorwhenquantifyingactivations,particularlyforlowbit-precision activations [122]. Alternatively, clipped ReLU activations can provide bounded output, but finding a global clipping factor that maintains the model accuracy is quite challenging [122]. A Parameterized Clipping Activation (PACT) function [123] with a per-layer parameterized clipping level α has been proven effective in thelow-precisiondomain. Foraninputa i ,PACTproducesanoutputa o asfollows: a o =0.5(|a i |−| a i − α |+α ) = 0, a i ∈(−∞ ,0) a i , a i ∈[0,α ) α, a i ∈[α, +∞) (3.1) The output is then linearly quantized to provide a o q . The computation of a o q for a k-bit activation is a o q =round(a o · 2 k − 1 α ) α 2 k − 1 (3.2) The resemblance of PACT with ReLU increases with the increase of the value α . Note that α is a trainable parameter and, during backpropagation, ∂ao q ∂α is computed by using a straight-through estimator (STE) [124]. 52 3.3 Methodology 3.3.1 Notation Let an L-layer DNN Φ with function f Φ (X,Θ ), with each layer l parameterized by θ l . A quantized version of f Φ is denoted as f q Φ (X q ;Θ q ) in which each layer l is parameterized by θ l q l with input activation tensor A l q l . We train our model to minimize a lossL such that f Φ (.) can closely mimic f q Φ (.), and thus minimize any performance degradation. For HPQ, q l is same for all l, whereas for MPQ, q l may vary with l. Definition 1. Support bit widths: For mixed-precision quantization, we define the support bit widths S q as the set of possible bit widths that can be assigned to the parameters of any layer l of the model, excluding the first and last layers that are fixed to 16 bits as in [115]. 3.3.2 Loss Bit Gradient The gradient of the loss with respect to a weight scalar ∂L ∂w indicates the direction that reduces the output error at the highest rate. Moreover, larger magnitude gradients lead to more significant changes in the weights, and thus correlate well with larger weight significance [4]. Inspired by this observation, we extend the notion of loss gradients to the bit-level for a quantized weight tensor, and propose a layer sensitivity metric that is driven by normalized bit gradients (NBG). For the l th layer of a DNN, the quantization of a floating-point tensor θ l to a fixed-point (signed) tensor θ l q l is S θ l =max(|θ l |)/(2 q l − 1 − 1);θ l ∈R d l (3.3) θ l q l =round(θ l /S θ l)· S θ l, (3.4) 53 where d l is the tensor dimension and S θ l is the quantization scaling factor. The quantized weights have a staircase function which is non-differentiable. To solve this problem we use STE, similar to other quantization methods [110,111]. To reduce storage, it is recommended to store the fixed-point ( θ l q l /S θ l) ∈ {− 2 q l − 1 ,...,2 q l − 1 } d l instead of θ l q l . As is typical, we convert the scaled quantized weights to their corresponding 2’s complement representation [125] as given by θ l q l /S θ l =− 2 q l − 1 · b q l − 1 + q l − 2 X i=0 2 i · b i . (3.5) The loss bit gradient for each bit position is then derived as ∇ b L=[ ∂L ∂b q l − 1 , ∂L ∂b q l − 2 ,..., ∂L ∂b 0 ] (3.6) where ∂L ∂b i = ∂L ∂θ l q l · ∂θ l q l ∂b i . (3.7) For a layer l with maximum support bit width q max , we compute the loss bit gradient for each of the q max bits, yielding a d l × q max floating-point matrix. We sumtheabsolutevaluesofeachrowtoproducead l × 1vector. ThelayerNBGisset to be the average value of this vector. Finally, we compute the epoch-normalized bit gradient (ENBG) of a layer as the mean of its NBGs over i epochs. First- order derivatives like gradients can sometimes fail to capture the importance of a weightvaluebasedonitsmagnitude. However, theamortizednatureoftheENBG computationoverallweightsandepochsmakestheprobabilityofthisphenomenon occurring vanishingly small. Definition 2. Epoch Intervals (EI): We define the epoch intervals as the ranges of epochs over which we collect the NBG of each layer to calculate the ENBG. For training a model with periodic epoch intervals, the k th interval starts 54 withthe( P k− 1 i=0 ep i int +1) th epochwhereep i int =ep int foranyi. Foraperiodicepoch intervals, we can use different ep i int for different interval indices i. We use periodic EI ep int =20. 3.3.3 ILP-Driven Iterative Bit-Width Assignment After each epoch interval, we re-assign the bit widths to maximize performance by formulating an ILP that maximizes the total layer sensitivity subject to a given hardware constraint C. In particular, to capture the bit-width assignment at the start of the k th interval for a layer l, we introduce a constrained integer variable Ω k l , which is multiplied by the negative ENBG of that layer g l k− 1 . Next we solve the following problem Objective: minimize L− 1 X l=0 (g l k− 1 .Ω k l ), (3.8) Subject to: L− 1 X l=0 φ(Ψ {Ω k l })≤C (3.9) Here, Ψ( .) is a function that translates the assignment variable Ω k l to the corre- sponding bit width q k l and φ(.) is a function that translates the bit width to a cost associated with the layer. For example, ifC is a memory-constraint, then φ(.) converts the bit-width assignment to a measure of memory usage. Notice that, in reality g l k− 1 is computed assuming the bit-width assignments of other layers are fixed despite the fact that they may change after the ILP. This approximation enables our ILP optimization to search over a large combinatorial space and any errorassociatedwiththeapproximationismitigatedbytherepeatedre-evaluation of the ILP after each epoch interval. 55 Figure 3.2: Step-wise description of layer bit-width evaluation. Recently, an iterative training approach [111] has also used an ILP to assign layer bit widths but only as a post-training optimization 1 . Moreover, reference [111] computed a layer-variable coefficient based on an L 2 -norm of the difference betweenFP-32andquantizedweightsofthatlayer,whichmayfurtherincreasethe storage overhead. In contrast, we use the negative ENBG values as coefficients of the respective Ω k l that does not require any L 2 -difference compute/storage cost. 3.3.4 BMPQ Training For the initial ep w warm-up epochs, we train the model with each layer quantized to max(N 1 , N 2 , ..., N m ) bits where S q =[N 1 ,...,N m ]. Throughout the training, we follow the recommendation of [115] to quantize the activation of layer l with the same number of bits as that used for the weights of that layer. During the weight update, we first allow the weights to be the updated FP-32 values, and then quantize to a specific bit-precision and compute the loss based on forward- pass evaluation with the quantized weights/activations. We use the ReLU non- linearityforthelastlayerandusethePACTnonlinearityforallotherintermediate 1 Wedonotquantitativelycompareourworkwiththisworkbecausethisworkusesaniterative (rather than a single-shot) training approach and thus it does not constitute a fair comparison to our work. 56 Algorithm 1: BMPQ Training Data: Input dataset X, parameter Θ , different support bit widths S q and a hardware constraintC, initial warm-up epochs ep w , total epochs ep, total layers L, interval epoch ep int . 1 X q ← X 2 [q 0 ,...,q L− 1 ]← initLayerBitWidths() 3 for i← 0 to to ep do 4 for j← 0 to n B do 5 for l← 0 to n B do 6 θ l q l ← quantizeWt(θ l ,q l ) 7 A l q l ← quantizeAct(A l ,q l ) 8 end 9 L← computeLoss(X Q B ,Θ Q ;Y B ) 10 ∇ b L← updateBitGrad(∇L) 11 updateWeight(Θ ,∇L) 12 end 13 if (i+1)%ep int ==0 and (i+1)≥ ep w then 14 evalEBGQ(Θ ) 15 I← ILPQuant(Θ ) 16 [q 0 ,...,q L− 1 ]← updateLayerBitWidths(I) 17 end 18 end layers with low bit-precision. For a bit width N i = 2, we use ternary weight quantization [126] to minimize the Euclidean distance between the FP-32 and quantized weights. The fact that we perform quantized training for ep int epochs after every bit-width assignment iteration helps us avoid post-training fine-tuning operationstomaintainaccuracy. Thestep-wisedetailsoftheevalENBG(.)function is presented in Fig. 3.2. Details of the training method is presented in Algorithm 1. 57 3.4 Experimental Results 3.4.1 Experimental Setup Models and Datasets. We selected three widely used datasets, CIFAR-10 [104] CIFAR-100 [104] and Tiny-ImageNet [105] and chose popular CNN models for im- ageclassification,VGG16[13]andResNet18[15]. Forallthedatasetsweusedstan- darddataaugmentations(horizontalflipandrandomcropwithreflectivepadding) to train the models with a batch size of 128. BMPQ training settings. For CIFAR-10 and CIFAR-100, we trained our models for 200 epochs with initial learning rate (LR) of 0.1 that decayed by 0.1 after 80 and 140 epochs. For Tiny-ImageNet, we used 100 training epochs and used decay epochs as 40 and 70 keeping other hyperparameters the same as that forCIFAR.Weusedsupportbitwidthsof4and2bitsforallourtrainingbutfixed the first and last layers to have 16-bits. For the ResNet models, we ensured the downsampling layers have the same bit-width assignment as its input layer [115]. We used the SGD optimizer with momentum of 0.9 and an weight decay of 3e − 4 . 3.4.2 Results Table 3.1 shows the performance of the BMPQ trained models compared to their respective full-precision counterparts. In particular, for CIFAR-10 dataset the BMPQ trained models can yield a model compression ratio 2 of up to 15.4× while sacrificing a drop in absolute accuracy of up to only 1 .14%. Similarly, for CIFAR- 100 and Tiny-ImageNet, the BMPQ generated models provide near full precision 2 We define compression ratio as the ratio of bits required to store a FP-32 model to that required by the BMPQ model of same architecture. 58 Table3.1: PerformanceofBMPQgeneratedmodelscomparedtotherespectivebaseline full precision (FP-32) models. Dataset Model layer-wise bit width Test Acc (%) Compression Compression ratio (r 32 M ) ratio (r 16 M ) C Full precision 93.9 1× – I VGG16 [16, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 4, 16] 93.56 10.5× 5.2× F [16, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 2, 4, 16] 93.37 13.2× 6.6× A [16, 4, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16] 93.21 15.4× 7.7× R Full precision 95.14 1× – 10 ResNet18 [16, 2, 2, 4, 2, 4, 4, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2, 16] 94.54 13.4× 6.7× [16, 2, 2, 4, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 16] 94.10 14.4× 7.2× C Full precision 73 1× – I VGG16 [16, 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 2, 2, 2, 4, 16] 72.61 11.7× 5.8× F [16, 4, 4, 4, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 16] 72.2 14.6× 7.3× A [16, 4, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 16] 71.26 15.4× 7.7× R Full precision 77.5 1× – 100 ResNet18 [16, 2, 2, 4, 2, 4, 4, 4, 2, 4, 4, 2, 4, 4, 4, 4, 2, 16] 75.98 9.4× 4.7× [16, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 4, 2, 16] 75.72 10.1× 5× Tiny- VGG16 Full precision 60.82 1× – Image [16, 4, 4, 4, 4, 4, 4, 2, 4, 4, 2, 2, 4, 2, 4, 16] 59.29 10× 5× Net ResNet18 Full precision 64.15 1× – [16, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 16] 63.27 8.8× 4.4× performances while yielding a compression ratio of up to 15.4× and 10.1× , re- spectively. This clearly shows the efficacy of BMPQ generated models to retain baseline performance while requiring significantly lower model storage. Analysis of ENBG at various iterations. We analyzed the ENBG snap- shots of VGG16 (on CIFAR-10) at the end of different epoch values. In particular, we choose two early-stage training epochs 20 and 40 and two mid-level training epochs100and120. AsshowninFig. 3.3(a),theENBGrepresentedlayersensitiv- ity changes significantly between epoch 20 and 40, hinting at the need for iterative re-evaluation during training. Also, a similar trend is observed at the mid-level training epochs (Fig. 3.3(b)), that forces the ILP to re-assign the 10 th and 14 th layer from 2-b to 4-b and 4-b to 2-b, respectively. 59 (a) (b) Figure3.3: (a),(b)LayersensitivitiesbasedonENBGforVGG16onCIFAR10,during early and late phase of the training, respectively. ep i indicates that the normalization was performed after i th epoch. Table 3.2: Comparison with single-shot MPQ achieved through analysis of activation density(AD).Note,tohaveafaircomparisonwereporttheaccuracyofourmodelsafter 120, 120, and 60 epochs for CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively. Model Dataset AD [115] BMPQ Improved Acc (%) Acc (%) compression VGG16 CIFAR-10 91.62 92.28 2.1× ResNet18 CIFAR-100 71.51 73.96 2.2× ResNet18 Tiny-ImageNet 44 58.54 2.9× 3.4.3 Comparison with Single-Shot Training Table 3.2 presents the comparison of the presented BMPQ with the recently proposed single-shot MPQ method [115] on CIFAR-10 3 , CIFAR-100, and Tiny- ImageNet. In particular, we can see that BMPQ models provide an improved accuracy of up to 14.54% with up to 2.9× less parameter bits. 3.4.4 BMPQ Generated Models as Teachers Knowledge distillation [76] is a popular technique to transfer the useful knowledge from a trained model (usually a large teacher model) to another model (usually a small student model). However, a resource-constrained edge device may not always support the full storage of an FP-32 teacher model or the transfer costs of this model from the cloud. We thus propose an unique application of the BMPQ models as a replacement to high-cost baseline teachers. Table 3.3 shows that 3 The original paper used VGG19 compared to VGG16 of ours. 60 Table 3.3: Boosting of BNN model performance through BMPQ and FP-32 trained models as teachers. Dataset: CIFAR-10, model: ResNet18 (both teacher and student) Teacher Teacher Student Student KD ∆ precision accuracy (%) precision accuracy (%) accuracy (%) accuracy (%) FP-32 95.14 BNN 88.90 89.72 +0.82 BMPQ 94.54 BNN 88.90 89.94 +1.04 the performance improvement of a BNN after distillation from a BMPQ model is similar to that obtained using distillation from a FP-32 teacher. 3.4.5 Discussion Memory saving for inference. Let layer l of an L-layer model have p l parame- ters. Represented using homogeneous FP-32, the total storage requirement (MB) of the model weights can be given by M fp32 =4∗ ( L− 1 X l=0 p l 2 20 ). (3.10) For a BMPQ generated model, the weight storage cost (MB) and corresponding compression ratio r 32 M (and r 16 M compared to a 16-b baseline) can be computed as M BMPQ =( 4 32 )∗ ( L− 1 X l=0 p l · q l 2 20 ) (3.11) r 32 M = M fp32 M BMPQ and r 16 M =0.5∗ r 32 M . (3.12) Notethat, foreachlayer, weneedtostoreonlyonescalingfactorinFP-32. Hence, itsoverheadisnegligibleandignored. Column5inTable3.1showsthecorrespond- ing r 32 M for our BMPQ models. Compute cost and MPQ hardware support. Let the l th CONV layer of a DNN with inputI∈R H l i × W l i × C l i and outputO∈R H l o × W l o × C l o have a weight tensor θ l ∈ R k l × k l × C l i × C l o represented with q l bits. We exclude the data and instruction 61 Figure 3.4: Energy consumed by the CONV layers of VGG16 on CIFAR-10. Here, we add the E mem and E MAC values to get E total . flow costs and similar to [112], for a q l -bit quantized layer with FP-32 scaling factors,weadoptthefollowingsimplifiedmodelofmemoryaccessandcomputation E l q =E l mem +E l MAC (3.13) where E l mem =C l o E FP− 32 mem +(H l i W l i C l i +(k l ) 2 C l i C l o )E q l mem (3.14) E l MAC =(H l o W l o C l o )E FP-32 MAC +(H l o W l o (k l ) 2 C l i C l o )E q l MAC . (3.15) It is clear from the above equations that, with sufficiently large ( k l ) 2 C l i , the en- ergy consumption is dominated by the quantized bit component. Note that, the above energy estimation is conservative for 1-bit quantization, because the 1-bit computation can reduce to a simple XNOR and popcount. Fig. 3.4 shows that a BMPQ model with accuracy drop of only 0.34% can have total energy benefit of 9.9× over the FP-32 counterpart, where operation energy costs are taken from Table 3.4 to compute E total . These estimated energy savings may be realized by a recently proposed hardware platform that supports mixed-precision quantiza- tion[127]. Similarsavingsmightalsobepossibleusingin-memorymixed-precision hardware proposed in [128]. 62 Table 3.4: Estimate of energy consumption. Operation Estimated energy (pJ) [115] 32-b MAC INT 3.2 32-b MAC FP 4.6 q l -b MAC [(3.1*q l )/32 + 0.1] q l -b memory access 2.5*q l 3.5 Conclusions Inthischapterwepresentedamixed-precisionquantizationtrainingmethoddriven by epoch-normalized bit-gradients without any pre-trained model requirement. Our proposed ENBGs capture the sensitivity of DNN layers and drive an ILP formulation that iteratively assigns MPQ bit widths after certain epochs during training. Our results demonstrated the efficacy of BMPQ models when compared tobothFP-32baselinesandexistingsingle-shotMPQschemes. Wefurtherdemon- strated the efficacy of BMPQ models as potential teachers that can boost the ac- curacy of ultra-low-precision BNN student models. With the growing demand of privacy-preserving, on-device training and inference, we believe this work will act as a foundation for energy-efficient, on-device quantization. Efficient inference ar- chitecturewithpreciseenergy-modelingandanalysisoftheseBMPQmodelsunder real-life adversarial attack are two interesting future research directions. 63 Chapter 4 Model Compression for Spiking Neural Networks Thischapterfirstprovidestheintroductionandmotivationbehindgeneratingcom- presseddeepSNNmodelsinSection4.1. Preliminariesandreviewsofrelatedwork are provided in Section 4.2. Section 4.3 details our model compression scheme for SNN training. Detailed experimental evaluations are provided in Section 4.4 and finally the chapter concludes in Section 4.5. 4.1 Introduction and Motivation As discussed in chapter 1, SNNs, with the support from suitable event-driven tar- get hardware, are promising low-power alternative to conventional ANNs. With the advent of efficient deep SNN training strategies, model parameters and com- putation energy have also increased rapidly. Model compression including prun- ing[4,66,129]andquantization[59,130]hasmitigatedthisissueinANNs. InSNNs, theactivationvaluesarerepresentedthroughbinaryspikes,thusleveragingbenefits of ultra-low activation quantization. However, unfortunately, application of model compressionviapruninghasremainedachallengeintheSNNparadigm. Forexam- ple,itisobservedthatthespikecodingofSNNsmakestheiraccuracyverysensitive to model compression [75]. Moreover, for approaches based on ANN-to-SNN con- version, the ANN models are recommended to not have batch-normalization (BN) 64 Figure4.1: HistogramofthegradientsforCONVlayer7ofVGG16withtargetdensity of 0.4 at an early stage of training (after 10 epochs) to classify CIFAR-10. layers [131]. This distinction is important because BN plays a key role in train- ing loss convergence [33] and its absence makes achieving significant compression without a large performance drop more difficult. Among the handful of works on SNN compression through pruning [38,132], most are limited to shallow networks on small datasets like MNIST. A recent effort [75] has combined spatio-temporal backpropagation(STBP)andalternatingdirectionmethodofmultipliers(ADMM) topruneSNNsduringspike-basedtraining. However,SNNtrainingproceduresare memoryintensiveandhavelongtrainingtimes[74]. Moreover,toachievehighper- formance with ADMM, hand-tuning of the per-layer target parameter density is required, which itself is a tedious iterative procedure that often requires expert insight into the model [133]. Recently, sparse-learning (SL) [4,134] has emerged as a promising compression solution for ANNs as it does not require per-layer target parameter density and can achieve high compression in a single training iteration with better accuracy than many other approaches [66,69,129]. However, this non- iterative strategy suffers from non-convergence in BN-less deep ANN compression that can be attributed to the explosion of gradients illustrated in Fig. 4.1. We propose a non-iterative attention-guided compression (AGC) technique for deep SNNs. In particular, our novel sparse-learning strategy uses attention-maps of an unpruned pre-trained meta model (Fig. 4.2) to mitigate non-convergence of the BN-less ANN and guide the compression. This approach is different from the idea of distillation [76,135], because the meta-model in our approach can be 65 of lower complexity than the model to be compressed and thus we do not use the KL-divergence loss between models. In our approach we first compress an ANN model specifically-designed for SNN conversion, then apply the ANN-to-SNN con- version technique [42]. To reduce the number of time steps required for inference, we extend the hybrid SNN training strategy by supporting SL-based SNN train- ing. The proposed method only requires a global target parameter density, as opposed to ADMM where we need to provide this for each layer of the model as hyperparameter. 4.2 Preliminaries and Related Work 4.2.1 SNN Fundamentals The main distinction between ANN and SNN functionality lies is their notion of time. In ANNs, inference is performed based on a single feed-forward pass through the network. An SNN, on the contrary, consists of a network of neurons that communicate through a sequence of binary spikes over a certain number of time steps T that is often referred to as the inference latency of the SNN. Every synaptic neuron of an SNN layer has spiking dynamics that are characterized with the Integrate-Fire (IF) [136] or Leaky-Integrate-Fire (LIF) [137] model. The iterativeversionoftheLIFneurondynamicscanbemodeledthroughthefollowing differential equation, u t+1 i =(1− dt τ )u t i + dt τ I (4.1) where u t+1 i represents the membrane potential of i th neuron at time step t+1, τ is a time constant, and I is the input from a pre-synaptic neuron. However, for 66 evaluation of the model in a discrete time [138], the iterative model of Eq. 4.1 for a linear layer can be modified as u t+1 i =λu t i + X j w ij O t j − v th O t i (4.2) O t i = 1, if u t i >v th 0, otherwise (4.3) where the decay factor (1− dt τ ) of Eq. 4.1 is replaced by the term λ , where λ is set to 1 for IF and less than 1 for LIF. Here, O t i and O t j represents the output spikes of current neuron i and its pre-synaptic neuron j, respectively. w ij represents the weight between the two and v th is the firing threshold of current layer. Inference is performed by simply comparing the total number of spikes generated by each outputneuronoverT timesteps. TrainingSNNs,ontheotherhand,ischallenging because exact gradients for binary spike trains are undefined, forcing the use of approximategradients,andthetrainingcomplexityscaleswiththenumberoftime steps T, which can be large. 4.2.2 ANN-to-SNN Conversion The ANN-to-SNN conversion based training algorithm is applicable for only IF neuron models. It was originally introduced in [48] and recently extended in [131] to improve accuracy on deep models. In this method, a constrained ANN model (no bias, max-pool, or BN layer) with ReLU activation is first trained. The ANN weights are copied to an SNN model and the analog input training data of the ANN is converted into rate-coded input spikes through a Poisson event gener- ation process over T time steps (detailed in Section 4.4). The firing threshold v th of each layer is set to the maximum value of the P w ij O t j (the 2 nd term in 67 Eq. 4.3) evaluated over the T time steps computed using a subset of the train- ing images. This threshold tuning operation ensures that the IF neuron activity precisely mimics the ReLU function of the corresponding ANN. Even though this conversion technique largely mitigates the training complexity of deep SNNs and achieves state-of-the-art inference, the resulting SNNs generally have larger infer- ence latency (T ≈ 2500) and this decreases their energy efficiency. For our SNN models we adopt a time and memory efficient hybrid training strategy [74] where we use the SNN training for only few epochs using a linear surrogate-gradient [47] based SL, as will be detailed in Section 4.3.2. 4.2.3 Model Compression in SNNs To improve the energy-efficiency of the SNN models, [132] developed a pruning methodology that used the sparse firing characteristics of IF neurons in different layers to adjust the corresponding number of synaptic weight updates during SNN training. Other works relied on strategies like dynamic pruning by introducing multi-strength SNN (M-SNN) models [139] or considering the correlation between pre and post-neuron spike activity [38]. However, the authors of most of these works have evaluated their approaches on shallow architectures for small datasets like MNIST and DVS. Recently, [75] used ADMM to compress the models while performing STBP based SNN training. However, as mentioned earlier, ADMM requires added hyperparameters of per-layer target parameter density which de- mand iterative training [133] or complex procedures like reinforcement learning to be added to the training loop, making the already tedious SNN training more difficult. Moreover, these method fail to provide conventional ANN equivalent ac- curacy for the compressed models. Note that these SNN compression strategies are applied during SNN training which is memory intensive because of the need to perform back propagation through time. Recently few works have also focused on modelquantization[136,140],anotherwell-knownstrategytoyieldenergy-efficient 68 deepmodels. Althoughourproposedapproachfocusesonpruning, itcaneasilybe extended to support pruning on quantized models, as these are largely orthogonal techniques. To solve the above mentioned issues, we propose to prune the constrained ANN models (designed for SNN conversion) using attention-guided compression. This strategy, detailed in Section 4.3.1, requires only a global target parameter density and performs sparse weight updates (sparse-learning) to avoid requiring iterative training. The specific sparse-learning we adopt [4] is computationally more efficient than other similar strategies [141] and uses a more comprehensive approach of regrowing the weights based on the magnitude of their momentum andoutperformssimilarapproaches[141,142]. Ourhypothesisisthattheproposed approachcantargetANNsdesignedforSNNconversionandacceleratethecurrent tedious training techniques for SNN compression to yield superior performance. 4.3 Hybrid Sparse Learning (SL) of SNNs We now introduce our two-step hybrid sparse-learning strategy for SNNs. First, Section4.3.1detailsourANNtrainingmethodusingattention-guidedcompression (AGC) that targets conversion-friendly ANNs. Section 4.3.2 then presents our sparse-learning based supervised SNN training to finally generate the compressed SNN model. The approach is a form of sparse-learning because during the entire procedure we update only a fraction of the weights to be non-zero and always satisfy the cardinality constraints of non-zero weights for the compressed network. 69 (a) (b) Figure 4.2: Two major training stages of the proposed scheme: (a) ANN training using attention-guided compression (AGC), (b) Sparse-learning based SNN training using surrogate gradient-based training. 70 4.3.1 Attention-guided Compression (AGC) Let us assume a convolutional layer l activation tensor A l ∈ R H l i × W l i × C l i with C l i feature planes/channels and spatial dimension of H l i × W l i . An activation-based mappingF converts this 3D tensor to a flattened spatial attention map, i.e., F :R H l i × W i × C i →R H i × W i (4.4) One of the most widely used attention map function is, F p = P C l i c=1 |A c | p , where p ≥ 1 is a parameter choice that determines the relative degree of emphasis the most discriminative parts of the feature map should be given. Recently, several works propose to distill knowledge from a computation heavy teacher to a less complex student by penalizing the student according to the difference of their associated attention maps [135]. This difference is added to the student model’s lossfunctionandhelpstrainthestudentmodeltocloselyfollowtheteachermodel’s inference behavior. Inspired by the above mentioned framework, we introduce a meta-model Ψ m to guide the model compression of a BN-less ANN Ψ c . In particular, we add to Ψ c ’s loss function an activation-based attention-transfer loss term to minimize the 71 differences between the meta and compressed models’ activation maps. In our case, the Ψ m is either a low-complexity unpruned model or the unpruned variant oftheΨ c ,incontrasttothecomputation-heavyteachermodelsusedindistillation. Moreover,asthepurposeofΨ m istoavoidthegradientexplosionduringtheinitial part of training, we remove the attention-guided (AG) loss component (1 st term in Eq. 4.5) after a certain number of epochs ϵ . This allows Ψ c to not be upper- bounded by the performance of Ψ m . 1 More precisely, our proposed loss function for AGC is L= α 2 X j∈I Q Ψ c j ∥Q Ψ c j ∥ 2 − Q Ψ m j ∥Q Ψ m j ∥ 2 2 +L Ψ c CE (y,˜ y), (4.5) where α is the scale factor for AG loss that is set to zero after ϵ epochs and the 2 nd term is the standard cross-entropy (CE) loss where ˜ y and y are Ψ c output and the one-hot label, respectively. The terms Q Ψ c j and Q Ψ m j represent the j th pair of vectorized versions of attention-maps F of specific layers of Ψ c and Ψ m , respectively. We take the difference of the l 2 -normalized attention-maps, evaluate l 2 -norm of the result, and accumulate over all layer pairs in j ∈ I. In general, we choose pairs of layers where the spatial dimensions of the models Ψ m and Ψ c are similar. In particular, details of the pairs for the VGG and ResNet models are presentedintheSupplementaryMaterial. However,pairsoflayershavingdifferent shapes can also be computed by matching shapes through interpolation [135]. The very fact that the Ψ m can be a light model as opposed to Ψ c , reduces the computation complexity of pre-training compared to distillation. We should also emphasize that we start the ANN training with initialized weights and a random prune-mask that satisfies the non-zero parameter budget 1 We note that some distillation approaches also penalize the network using a weighted KL- divergencebetweentheprobabilisticoutputsofthetwonetworksparticularlyforamorecomplex teacher model. However, we empirically verified that adding KL-divergence to the loss worsens performance of Ψ c . Also, this added term reduces the importance of theL Ψ c CE which is critical in sparse-learning. 72 Figure 4.3: Plot of test accuracy versus epochs for ResNet12 on CIFAR-10 for model compressed using AGC with VGG9 chosen as Ψ m . associated with the user-given target parameter density d. Based on the loss of Eq. 4.5 we evaluate the layer’s importance, computing the normalized momentum contributed by its non-zero weights during a epoch. This evaluation helps us decide which layers should have more non-zero weights under the given parameter budget and update the pruning mask accordingly. More precisely, we re-grow the weights with the highest momentum magnitude after pruning a fixed percentage of least-significant weights from each layer based on their magnitude, as suggested in [4]. Details of AGC are shown in Algorithm 2. Fig. 4.3 shows the successful compression of ResNet12 to a target density d = 0.1 using the proposed AGC framework and contrasts that with the significant accuracy drop observed when compressed using SL [4]. 4.3.2 Sparse Learning based SNN Training After the successful compression of the ANN model, we compute the threshold of each layer via the threshold generation algorithm proposed in [131]. We then perform SNN training for a few epochs (≈ 20) to reduce the inference time step. The standard supervised SNN training uses a surrogate gradient [47,143,144] to make backpropagation-based optimization feasible given the discontinuous nature 73 Algorithm 2: Detailed Algorithm for Attention-Guided compression. 1 Input: runEpochs, momentum µµµ l , prune rate p, initial Θ , initial Π , target density d, Ψ m , ϵ . Data: i=0..runEpochs, pruning rate p=p i=0 2 3 for l← 0 to L do 4 θ l ← init(θ l ) &π l ← createMaskForWeight(θ l ,d) 5 applyMaskToWeights(θ l ,π l ) 6 end 7 for i← 0 to runEpochs do 8 α =α ∗ Bool(i<ϵ ) 9 for j← 0 to numBatches do 10 L= α 2 ∗L AG +L CE 11 ∂L ∂θ =computeGradients(Θ ,L) 12 updateMomentumAndWeights( ∂L ∂θ ,µµµ ) 13 for l← 0 to L do 14 applyMaskToWeights(Θ l ,Π l ) 15 end 16 end 17 tM← getTotalMomentum(µµµ ) 18 pT← getTotalPrunedWeights(Θ ,p i ) 19 p i ← linearDecay(p i ) 20 for l← 0 to L do 21 µµµ l ← getMomentumContribution(θ l ,π l ,tM,pT) 22 Prune(θ l ,π l ,p i ,pT) 23 Regrow(θ l ,π l ,µµµ l · tM,pT) 24 applyMaskToWeights(θ l ,π l ) 25 end 26 end of the neuron spikes. The surrogate gradient is typically a pseudo-derivative in the form of a linear or exponential function of the membrane potential. However, the existing SNN training scheme must be adjusted in the context of our proposed compression framework. In particular, in our SNN training the neuron membrane dynamics are modeled as 74 u t+1 i =u t i + X j m ij ∗ w ij O t j − v th O t i (4.6) O t i = 1, if z t i >0, 0, otherwise (4.7) where z t i = ( u t i v th − 1) denotes the normalized membrane potential and m ij ∈{0,1} denotes the fixed prune-mask between a neuron i and its pre-synaptic neuron j achievedattheendofANNtraining,wherea0and1indicateabsenceandpresence of synaptic weights, respectively. Note that Eq. 4.7 does not model the leak part so that it can support IF training. Thus, during the forward propagation, the weighted sum of the pre-synaptic neuron spikes are accumulated in the membrane potential of the current layer neurons. At each synaptic neuron the IF model of the activation function compares the membrane potential and the threshold of that layer to generate an output spike. This is repeated for all layers until the last layer. For the last layer we accumulate the inputs over all time steps and pass them through a softmax layer to compute the multi-class probability. Duringbackpropagationwithalearningrateη thelinearlayerweightscanthen be updated as w ij =w ij − ηδw ij (4.8) δw ij =m ij ∗ X t ∂L ∂O t i ∂O t i ∂z t i ∂z t i ∂u t i ∂u t i ∂w t ij (4.9) where O t i is the thresholding function. The term ∂O t i ∂z t i requires a pseudo-derivative and we follow [47] to define this as ∂O t i ∂z t i =γ ∗ max{0,1−| z t i |} (4.10) 75 Figure4.4: Inputrate-codedspikeequivalentimagesfordifferentnumberoftimesteps T. where γ is known as a damping factor of the backpropagation error. Because we update only weights with corresponding mask value of 1, we term this as ‘sparse- learning’ and the whole approach as a form of hybrid SL. 4.4 Experiments This section first describes how we evaluate the effectiveness of the proposed com- pression scheme and then presents the compression results on CIFAR-10, CIFAR- 100andTiny-ImageNetwithVGGandResNetmodelvariants. Finally, todemon- stratetheenergy-efficiencyofthegeneratedmodels,thesectionpresentsadetailed evaluation of the FLOPs and compute energy for the compressed SNNs (SNN C ). Input Data TotrainourANNs,weusedthestandarddata-augmentedinputsetforeachmodel. However, for the ANN-to-SNN conversion and SNN training we used a rate-coded variant obtained through a Poisson generator function that produces a spike train with rate that is proportional to the input pixel value. In particular, it generates a random number at every time step for each pixel in the input image that is compared with the normalized pixel value. An output spike is generated if the random number is less than the pixel. As T increases, the rate-coded input spike train becomes a closer approximation to the analog input (Fig. 4.4). 76 Model and ANN Training For the ANN training with VGG and ResNet, we adopted several constraints that facilitate efficient SNN conversion [131]. In particular, our ANN models are de- signed without bias terms or BN layers. Also, our pooling operations use average pooling because for binary spike based activation layers max pooling incurs sig- nificant information loss. We used dropout to regularize both the ANN and SNN modelsfortheuncompressedbaselinetraining. However,asthecompressedmodels havesignificantlylesschanceofover-fitting,weremovedtheconvolutionaldropout layersduringtheentirehybridSLprocedure. 2 FortheResNet12modelwereplaced the initial convolution layer with a pre-processing block consisting of a series of three convolution layers of size 3× 3 with a stride of 1. After the pre-processing block of ResNet12 four basic block layers are used each of which has two 3× 3 CONV layers (Fig. 4.5). We performed the ANN training for 240 epochs with an initial learning rate (LR) of 0.01 that decayed by a factor of 0.1 after 150, 180, and 210 epochs. We hand tuned and set both α and ϵ epoch to be 100. We used a starting prune rate p of 0.5 that decays linearly every epoch. Unless stated otherwise, for the meta- model we used an unpruned VGG9 ANN designed and trained with the same constraints. Conversion and SNN Training For the first hidden layer, we compute the maximum input to a neuron over all its neurons across all T time steps for a set of input images and set this value as the layer threshold [131]. We sequentially compute the thresholds of the subsequent 2 In particular we removed dropout for d≤ 0.3 as we assumed any density lower than this as sufficientcompressionandempiricallyverifiedthatadditionofdropoutaddsnoaccuracybenefit. 77 Figure 4.5: A ResNet basic block layer for target density (a) d = 1.0 (left) and d = 0.1 (right). layers similarly taking the maximum across all neurons and time steps. 3 We con- sideredonly512inputimagestolimitconversiontimeandusedathresholdscaling factorof0.74forSNNtrainingandinference,followingtherecommendationin[74]. Initialized with these layer thresholds and the trained ANN weights, we per- formed our sparse-learning based SNN training for only 20 and 12 epochs for CIFAR and Tiny-ImageNet, respectively. We set γ = 0.3 [47] and used a starting LR of 10 − 4 which decays by a factor of 0.5 every 7 (5) epochs for CIFAR (Tiny- ImageNet). Due to resource and memory constraints we performed the 12 epochs of SNN training on Tiny-ImageNet with a subset of 20,000 images (1/5 th of the total training set) and evaluated on 5,000 (1/2 of the total test set) test images to report our final test accuracy. Note that the dropout units are implemented with element-wisemultiplicationwithrandomlygeneratedmasksthatarekeptconstant for the entire SNN training. 3 For the ResNet variant, the threshold evaluation is done only for the pre-processing block convolution layers [74]. 78 Table 4.1: Model performances with AGC based training on CIFAR-10, CIFAR-100, and Tiny-ImageNet after a) ANN training, b) ANN-to-SNN conversion and c) SNN training. Compre- a. b. Accuracy (%) with c. Accuracy (%) Architecture ssion ANN (%) ANN-to-SNN conversion after sparse ratio accuracy T = 2500 Reduced T SNN training Dataset : CIFAR-10 VGG11 1× 91.57 91.17 89.16 89.84 10× 91.10 90.64 86.16 90.45 VGG16 1× 92.55 92.01 84.79 91.13 2.5× 92.97 92.92 90.08 91.29 20× 91.85 91.39 79.08 90.74 33.4× 91.79 91.22 72.53 90.15 ResNet12 1× 91.37 90.87 88.98 90.41 10× 92.04 91.71 83.46 90.79 Dataset : CIFAR-100 VGG11 1× 66.30 64.18 62.49 64.37 4× 67.40 65.10 62.57 64.98 VGG16 1× 67.62 65.91 54.30 64.69 10× 67.45 65.84 51.63 64.63 ResNet12 1× 61.61 59.85 56.97 62.60 10× 63.52 61.43 52.66 63.02 Dataset : Tiny-ImageNet VGG16 1× 56.56 56.8 51.14 51.92 2.5× 57.00 56.06 51.9 52.7 4.4.1 Results with AGC Table 4.1 shows the performance of our proposed compression scheme for all three datasets. To evaluate models at reduced time steps we chose T as 100 (175), 120 (200) and 150 for VGG (ResNet) variant to classify on CIFAR-10, CIFAR-100 and Tiny-ImageNet, respectively. The results show that our sparse-learning based SNN training significantly improves the model performance for classification at reduced time steps. In particular, for VGG16 with a compression ratio of 33.4× , our hybrid SL can improve the accuracy by∼ 18% compared to what is achievable with original conversion-based models with the same reduced T. The SNN trained compressed models perform similar to their uncompressed counterparts with a compression ratio of up to 33.4× , 10× , and 2.5× for CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively. In particular, for lower compression ratios we obtain improved classification performance which may be due to better regularization of 79 Table 4.2: Performance comparison of the proposed hybrid SL with state-of-the-art deep SNNs on CIFAR-10 and CIFAR-100. Authors Training Architecture Compress- Accuracy Time type ion ratio (%) steps Dataset : CIFAR-10 Cao et al. ANN-SNN 3 CONV, 1× 77.43 400 (2015) [145] conversion 2 linear Sengupta et ANN-SNN VGG16 1× 91.55 2500 al. (2019) [131] conversion Wu et al. Surrogate 5 CONV, 1× 90.53 12 (2019) [146] gradient 2 linear Rathi et al. Hybrid VGG16 1× 91.13 100 (2020) [74] training 1× 92.02 200 Deng et al. STBP 11 layer 1× 89.53 8 (2020) [75] training CNN Deng et al. STBP 11 layer 4× 87.38 8 (2020) [75] training CNN This work Hybrid SL VGG16 2.5× 91.29 100 33.4× 90.15 100 Dataset : CIFAR-100 Deng et al. STBP 11 layer 2× 57.83 8 (2020) [75] training CNN This work Hybrid SL VGG11 4× 64.98 120 AGC. For example, the VGG11 model on CIFAR-100 with 4× compression ratio hasanincreasedclassificationperformanceof0 .61%comparedtotheuncompressed baseline which only uses dropout for regularization. To the best of our knowledge, we are the first to report successful compression results on Tiny-ImageNet. Table 4.2 provides a comparison of the performances of models generated through our hybrid SL with state-of-the-art deep SNNs. On CIFAR-10, our ap- proach outperforms the compressed models [75] with an increased classification performance of 2.77% and 8.35× better compression ratio. On CIFAR-100, our approachsimultaneouslyyields2× highercompressionand7.15%higheraccuracy. 80 (a) (b) (c) Figure 4.6: Plot of test accuracy versus epochs with different target densities for (a) ResNet12 on CIFAR-10, (b) VGG11 on CIFAR-100, and (c) VGG16 on Tiny-ImageNet. The SNN training is done at reduced time steps. 4.4.2 Analysis of Energy Consumption Spiking Activity To model energy consumption, we assume a generated SNN spike consumes a fixed amount of energy [147]. Based on this assumption, earlier works [74,131] have adopted the average spiking activity (also known as average spike count) of an SNN layer l, denoted ζ l , as a measure of compute-energy of the model. In particular, ζ l is computed as the ratio of the total spike count in T steps over all the neurons of the layer l to the total number of neurons in that layer. Thus lower the spiking activity the better is the energy efficiency. Fig. 4.7 shows the per-image average number of spikes for each layer with uncompressed and compressed (d = 0.03) VGG16 while classifying on CIFAR-10 over 100 time steps. As we can see, the spiking activity for all the layers reduces significantly with compression. For example, the average spike count of the 11 th convolutional layer of the uncompressed model is 11.7. For the compressed variant the value is only 0.44. In particular, the spiking activity of the uncompressed model can increase from 1.3× to 25.4× across different layers of the SNN. 81 Figure4.7: AveragespikingactivitygeneratedateachlayerofVGG16whileclassifying over the test set of CIFAR-10 for model having parameter density (d) of 1.0 and 0.03. Measure of FLOPs and Computation Energy Consideraconvolutionallayerlwithweighttensorθ l ∈R k l × k l × C l i × C l o takesaninput activation tensor A l ∈R H l i × W l i × C l i , where H l i ,W l i , k l , C l i and C l o are input height, width, filter height (or width), channel size, and number of filters, respectively. This section quantifies the energy associated with producing the corresponding output activation map O l ∈ R H l o × W l o × C l o for a standard ANN, an uncompressed SNN, and finally a compressed SNN. Table 4.3: Convolutional layer FLOPs for ANN and SNN models Model FLOPs of a CONV layer l Variable Value ANN FL l ANN (k l ) 2 × H l o × W l o × C l o × C l i SNN FL l SNN (k l ) 2 × H l o × W l o × C l o × C l i × ζ l SNN C FL l SNN C ( P H l o − 1 x=0 P W l o − 1 y=0 P C l o − 1 p=0 P C l i − 1 n=0 P k l − 1 i=0 P k l − 1 j=0 ζ l n,x+i,y+j × m l p,n,i,j ) ThenumberofFLOPSneededforlayer l ofastandardANN,denoted FL l ANN , iseasytocalculateandshowninrow1ofTable4.3[80]. Theformulacanbeeasily adjusted for an uncompressed SNN in which each neuron spike in layer l triggers 82 (a) (b) (c) (d) (e) (f) Figure 4.8: Comparison of ANN to SNN in terms of (a-c)FLOPs and (d-f)normalized compute energy for VGG16 with different parameter density to classify (a,d) CIFAR-10, (b,e) CIFAR-100, and (c,f) Tiny-ImageNet. a weight accumulation across each of its connected post-synaptic neurons in layer l+1, denoted as FL l SNN in Table 4.3. For a compressed SNN model, however, the calculation is complicated by the fact that the presence of spikes at a pre-synaptic neuron triggers accumulations in a subset of post-synaptic neurons due to sparsity. In particular, we assume that masked weights are not accumulated via inexpensive zero-gating logic and the resulting calculation for FL l SNN C is shown in row 3 of Table 2.2, assuming a stride value of 1. Here, ζ l n,x+i,y+j represents total spike count accumulated over T time steps at the (x+i,y+j) th input activation map element in the n th channel and m l p,n,i,j represents the mask for weight of location (i,j) in the n th channel of the p th filter. m l p,n,i,j =0, if w l p,n,i,j =0 and 1, otherwise. For ANNs, FLOPs are dominated by the total multiply accumulate (MAC) operation of the CONV and linear layers. On the other hand, for SNNs, the 83 FLOPs are limited to accumulates (ACs) as the spikes are binary and thus simply indicatewhichweightsneedtobeaccumulatedatthepost-synapticneurons. Thus theinferencecomputeenergyattheCONVlayersforthemodelscanbequantified as E ANN =( L X l=1 FL l ANN )· E MAC (4.11) E SNN =( L X l=1 FL l SNN )· E AC (4.12) E SNN C =( L X l=1 FL l SNN C )· E AC (4.13) where E ANN represents the energy for an ANN layer, and the energy for the uncompressed and compressed SNN layer is represented as E SNN and E SNN C , respectively. Here, E MAC and E AC are the energy consumption for a MAC and AC operation respectively. As we can see in Table 4.4 E AC is ∼ 32× lower than E MAC [8]. Fig. 4.8 illustrates the energy consumption and FLOPs for ANN and SNN models of VGG16 while classifying three datasets, where the energy is normalized to that of an equivalent uncompressed ANN. As we can see, the number of FLOPs for an SNN is larger than that for an ANN with similar number of parameters. However, becausetheACsconsumesignificantlylessenergythanMACs, asshown inTable4.4,SNNsaresignificantlymoreenergyefficient. Inparticular,forCIFAR- 10 a compressed SNN consumes 12.2× less compute energy than a comparable compressed ANN with similar parameters and 38.7× less compute energy than a comparable uncompressed ANN. 4 For CIFAR-100 and Tiny-ImageNet with SNN compression, the energy-efficiency can reach up to 10 .8× and 5.2× , respectively, as opposed to ANN models having similar parameters. 4 Here, we used layer spike counts averaged over 20 test input samples to evaluate the SNN FLOPs. 84 Table 4.4: Estimated energy costs for various operations in 45 nm CMOS process at 0.9 V [8]. Serial No. Operation Energy (pJ) 1. 32-bit multiplication int 3.1 2. 32-bit addition int 0.1 3. 32-bit MAC 3.2 (#1 + #2) 4. 32-bit AC 0.1 (#2) 4.5 Conclusions Thischapterproposedahybridsparse-learningapproachforgeneratingcompressed deep SNN models that have reduced spiking activity and thus high energy ef- ficiency. In particular, we first introduced a novel attention-guided ANN com- pression, then used the ANN-to-SNN conversion by sequentially fixing the firing threshold of each layer, and finally performed training of the SNN model using a sparse-learning based approach that started with the compressed ANN weights. Experimentalevaluationshowedthat,thegeneratedsparseSNNshavecompression ratios of up to 33.4× with negligible drop in accuracy. Moreover, the reduced time steps to perform inference further reduces the average spiking activity of the mod- els required for classification. Compared to unpruned and iso-parameter ANNs, our generated SNNs are up to 38.7× and 12.2× more energy efficient, respectively, with no significant drop in accuracy. 85 Part II: Robustness of Compressed Energy-Efficient Models 86 Chapter 5 Efficient Training for Robust Yet Pruned Models This chapter first provides the introduction and motivation for generating com- pressed yet robust models in Section 5.1. Reviews of related work are provided in Section 5.2. Section 5.3 details our robust model compression scheme. Section 5.4 presents detailed experimental results demonstrating benefits and efficacy of the approach in terms of accuracy as well as compute benefits and finally the chapter concludes in Section 5.5. 5.1 Introduction and Motivation Despite the proliferation of deep learning-powered applications, machine learning modelshaveraisedsignificantsecurityconcernsduetotheirvulnerabilityto adver- sarial examples, i.e., maliciously generated images which are perceptually similar to clean ones with the ability to fool classifier models into making wrong predic- tions[148,149]. Variousrecentworkhaveproposedassociateddefensemechanisms including adversarial training [149], hiding gradients [150], adding noise to the weights [2], and several others [151]. Meanwhile, large model sizes have high inference latency, computation, and storage costs that represent significant challenges in deployment on IoT devices. Thus reduced-size models [56,57] and model compression techniques e.g., prun- ing [4,65,152], have gained significant traction. In particular, earlier work showed 87 that without a significant accuracy drop, pruning can remove more than 90% of the model parameters [4,65] and that ensuring the pruned models have struc- ture can yield observed performance improvements on a broad range of compute platforms [153]. However, adversarial training that increases network robustness, generally demands more non-zero parameters than needed for only clean data [1] as illustrated in Fig. 5.1(a). Thus a naively compressed model performing well on cleanimages, canbecomevulnerabletoadversarialimages. Unfortunately, despite a plethora of work on compressed model performance on clean data, there have been only a few studies on the robustness of compressed models under adversarial attacks. In particular, some prior works [72,154] have tried to design a compressed yet robust model through a unified constrained optimization formulation by using the alternating direction method of multipliers (ADMM) in which dynamic L 2 regu- larization is the key to outperforming state of the art pruning techniques [155]. However, these efforts require the network designer to specify layer-wise sparsity ratios, which requires prior knowledge of an effective compressed model. This knowledge may not be available and thus training may require multiple iterations to determine good layer-sparsity ratios. In other schemes like Lasso [71], a target compression ratio cannot be set because the final compression ratio is not deter- mined until training is completed. Moreover, Lasso requires separate re-training to increase the accuracy after the assignment of non-significant weights to zero, resulting in costly training. In contrast, we present a dynamic network rewiring (DNR), a unified training framework to find a compressed model with increased robustness that does not require individual per-layer target sparsity ratios. In particular, we introduce a hybrid loss function for robust compression which has three major components: a cleanimageclassificationloss, adynamic L 2 -regularizerterminspiredbyarelaxed 88 (a) (b) Figure 5.1: (a) Weight distribution of the 14 th convolution layer of ResNet18 model for different training schemes: normal, adversarial [1], and noisy adversarial [2]. (b) An adversarially generated image (ˆ x) obtained through FGSM attack, which is predicted to be the number 5 instead of 8 (x). version of ADMM [156], and an adversarial training loss. Inspired by sparse- learning-based training scheme of [4], we then propose a non-iterative training frameworktoachievearobustprunedDNNusingtheproposedloss. Inparticular, DNR dynamically arranges per layer pruning ratios using normalized momentum, maintaining the target pruning every epoch, without requiring any fine tuning. 5.2 Preliminaries and Related Work 5.2.1 Adversarial Attacks on DNNs Recently, various adversarial attacks have been proposed to find fake images, i,e., adversarialexamples,whichhavebarely-visibleperturbationsfromrealimagesbut still manage to fool a trained DNN. One of the most common attacks is the fast gradient sign method (FGSM) [149]. Given a vectorized input x of the real image and corresponding label t, FGSM perturbs each element x in x along the sign of the associated element of the gradient of the inference loss w.r.t. x as shown in Eq. 5.1 and illustrated in Fig. 5.1(b). Another common attack is the projected gradient descent (PGD) [1]. The PGD attack is a multi-step variant of FGSM 89 where ˆ x k=1 = x and the iterative update of the perturbed data ˆ x in k th step is given in Eq. 5.2. ˆ x=x+ϵ × sign(∇ x L(f Φ (x,Θ ;t))) (5.1) ˆ x k =Proj Pϵ (x) (ˆ x k− 1 +α × sign(∇ x L(f Φ (ˆ x k− 1 ,Θ ;t))) (5.2) Here, the scalar ϵ corresponds to the perturbation constraint that determines the severity of the perturbation. f Φ (x,Θ ;t) generates the output of the DNN, pa- rameterized by Θ . Here, Proj projects the updated adversarial sample onto the projection space P ϵ (x) which is the ϵ -L ∞ neighbourhood of the benign sample 1 x. α is the attack step size. NotethatthesetwostrategiesassumetheattackerknowsthedetailsoftheDNN andarethustermedaswhite-boxattacks. Wewillevaluatethemeritofourtraining scheme by measuring the robustness of our trained models to the fake images generated by these attacks. We argue that this evaluation is more comprehensive than using images generated by attacks that assume limited knowledge of the DNN [157]. Moreover, we note that PGD is one of the strongest L ∞ adversarial example generation algorithms [1] and use it as part of our proposed framework. 5.2.2 Model Compression ADMMisapowerfuloptimizationmethodusedtosolveproblemswithnon-convex, combinatorial constraints [158]. It decomposes the original optimization problem into two sub-problems and solves the sub-problems iteratively until convergence. Pruningconvolutionalneuralnetworks(CNNs)canbemodeledasanoptimization 1 It is noteworthy that the generated ˆ x are clipped to a valid range which for our experiments is [0,1]. 90 problem where the cardinality of each layer’s weight tensor is bounded by its pre- specifiedpruningratio. IntheADMMframework,suchconstraintsaretransformed toonesrepresentedwithindicatorfunctions, suchasI θ (z)=0for|z|≤ nand+∞ otherwise. Here,z denotes the duplicate variable [158] and n represents the target number of non-zero weights determined by pre-specified pruning ratios. Next, the original optimization problem is reformulated as: L ρ (f Φ (Θ ,z,λ ;t))=L(f Φ (ˆ x k− 1 ,Θ ;t))+I θ (z)+⟨λ, θ − z⟩+ ρ 2 ||θ − z|| 2 2 (5.3) where λ is the Lagrangian multiplier and ρ is the penalization factor when param- etersθ andzdiffer. Eq. (5.3)isbrokenintotwosub-problemswhichsolve θ andz iteratively until convergence [155]. The first sub-problem uses stochastic gradient descent (SGD) to update θ while the second sub-problem applies projection to find the assignment of z that is closest toθ yet satisfies the cardinality constraint, effectively pruning weights with small magnitudes. Not only can ADMM prune a model’s weight tensors but it also has as a dy- namic regularizer. Such adaptive regularization is one of the main reasons behind the success of its use in pruning. However, ADMM-based pruning has several drawbacks. First, ADMMrequirespriorknowledgeoftheper-layerpruningratios. Second, ADMM does not guarantee the pruning ratio will be met, and therefore, an additional round of hard pruning is required after ADMM completes. Third, not all problems solved with ADMM are guaranteed to converge. Fourth, to im- prove the convergence, ρ needs to be progressively increased across several rounds of training, which increases training time [158]. Sparse learning [4] addresses the shortcomings of ADMM by leveraging ex- ponentially smoothed gradients (momentum) to prune weights. It redistributes pruned weights across layers according to their mean momentum contribution. 91 The weights that will be removed and transferred to other layers are chosen ac- cording to their magnitudes while the weights that are brought back (reactivated) are selected based on their momentum values. On the other hand, a major short- coming of sparse learning compared to ADMM is that it does not benefit from a dynamic regularizer and thus often yields lower levels of accuracy. Furthermore, existing sparse-learning schemes only support irregular forms of pruning, limiting speed-up on many compute platforms. Finally, sparse-learning, to the best of our knowledge, has not previously been extended to robust model compression. 5.3 Dynamic Network Rewiring (DNR) To tackle the shortcomings of ADMM and sparse-learning this section introduces a dynamic L 2 regularizer that enables non-iterative training to achieve high accu- racy with compressed models. We then describe a hybrid loss function to provide robustnesstothecompressedmodelsandanextensiontosupportstructuredprun- ing. 5.3.1 Dynamic Regularizer ForaDNNparameterizedbyΘ withLlayers,weletθ l representtheweighttensor of layer l. In our sparse-learning approach, these weight tensors are element-wise multiplied (⊙ ) by corresponding binary mask tensors (π l ) to retain only a fraction of non-zero weights, thereby meeting a target pruning ratio. We update each layer mask in every epoch similar to [4]. The number of non-zeros is updated based on the layer’s normalized momentum and the specific non-zero entries are set to favor large magnitude weights. We incorporate an ADMM dynamic L 2 regularizer [155] into this framework by introducing duplicate variable z for the non-zeroweights,whichisinturnupdatedatthestartofeveryepoch. Unlike[155], we only penalize differences between the masked weights ( θ l ⊙ π l ) of a layer l and 92 their corresponding duplicate variable z l . Because the total cardinality constraint of the masked parameters is satisfied, i.e. P L l=1 card(θ l ⊙ π l ) ≤ n, the indicator penalty factor is redundant and the loss function may be simplified as L ρ (f Φ (x,Θ ,z,Π ;t))=L(f Φ (x,Θ ,Π ;t))+ ρ 2 L X l=1 ||θ l ⊙ π l − z l || 2 2 (5.4) where, ρ is the dynamic L 2 penalizing factor. This simplification is particularly important because the indicator function used in Eq. 5.3 is non-differentiable and its removal in Eq. 5.4 enables the loss function to be minimized without decom- position into two sub-problems. 2 Moreover, SGD with this loss function converges similarly to the SGD withL(f Φ (x,Θ ,π ;t)) and more reliably than ADMM. Intu- itively, the key role of the dynamic regularizer in this simplified loss function is to encourage the DNN to not change values of the weights that have large magnitude unlessthecorrespondinglossislarge, similartowhatthedynamicregularizerdoes in ADMM-based pruning. 5.3.2 Hybrid Loss Function For a given input image x, adversarial training can be viewed as a min-max opti- mization problem that finds the model parameters θ that minimize the loss asso- ciated with the corresponding adversarial sample ˆ x, as shown below: arg min Θ {arg max ˆ x∈Pϵ (x) L(f Φ (x,Θ ,Π ;t))} (5.5) In our framework, we use SGD for loss minimization and PGD to generate ad- versarial images. More specifically, to boost classification robustness on perturbed 2 Note this simplified loss function also drops the term ⟨λ, θ − z⟩ becausez is updated withθ at the beginning of each epoch, forcing the Lagrangian multiplier λ and its contribution to the loss function to be always 0. 93 dataweproposeusingahybridlossfunctionthatcombinestheproposedsimplified loss function in Eq. 5.4 with adversarial image loss, i.e., L tot =(1− λ )L ρ (f Φ (x,Θ ,z,Π ;t))+λ L(f Φ (ˆ x;Θ ,Π ;t)) (5.6) λ provides a tunable trade-off between the two loss components. Observation 1. A DNN only having a fraction of weights active throughout the training can be trained with the proposed hybrid loss to finally converge similar to that of the unpruned model (mask Π = 1) to provide a robust yet compressed model. This is exemplified in Fig. 5.2(a) which shows similar convergence trends for bothprunedandunprunedmodels, simultaneouslyachievingboththetargetcom- pression and robustness while also mitigating the requirement of multiple training iterations. 5.3.3 Support for Channel Pruning Let the weight tensor of a convolutional layer l be denoted as θ l ∈R k l × k l × C l i × C l o , wherek l representsthe height (andwidth) ofthe convolutional kernel, and C l o and C l i represent the number of filters and channels per filter, respectively. We convert this tensor to a 2D weight matrix, with C l o and C l i (k l ) 2 being the number of rows and columns, respectively. We then partition this matrix into C l i sub-matrices of C l o rows and (k l ) 2 columns, one for each channel. To compute the importance of a channelc, wefindtheFrobeniusnorm(F-norm)ofcorrespondingsub-matrix, thus effectively compute f l c =|θ l :,c,:,: | 2 F . Based on the fraction of non-zero weights that need to be rewired during an epoch t, denoted by the pruning rate p t , we compute the number of channels that must be pruned from each layer, c l pt , and prune the c l pt channels with the lowest F-norms. We then compute each layer’s importance based on the normalized momentum contributed by its non-zero channels. These 94 (a) (b) Figure 5.2: (a) Training loss vs. epochs and (b) Pruning sensitivity per layer for VGG16 on CIFAR-10. importance measures are used to determine the number of zero-F-norm channels r l t ≥ 0 that should be re-grown for each layer l. More precisely, we re-grow the r l t zero-F-norm channels with the highest Frobenius norms of their momentum. We note that this approach can easily be extended to enable various other forms of structured pruning. Moreover, despite supporting pruning of both convolution and linear layers, this work focuses on reducing the computational complexity of a DNN. We thus experiment with pruning only convolutional layers because they dominate the computational complexity [80]. The detailed pseudo-code of the proposed training framework is shown in Algorithm 3. ItisnoteworthythatDNR’sabilitytoarrangeper-layerpruningratioforrobust compression successfully avoids the tedious task of hand-tuning the pruning-ratio based on layer sensitivity. To illustrate this, we follow [65] to quantify the sensi- tivity of a layer by measuring the percentage reduction in classification accuracy on both clean and adversarial images caused by pruning that layer by x% without pruning other layers. Observation 2. DNN layers’ sensitivity towards clean and perturbed images are not necessarily equal, thus determining layer pruning ratios for robust models is particularly challenging. 95 Algorithm 3: DNR Training. Data: weightθ l , momentum µ l , binary mask π l ,l =0..k Data: density d, i=0..numEpochs, pruning rate p=p i=0 pT: irregular or channel 1 for l← 0 to k do 2 θ l ← init(θ l ) 3 π l ← createMaskForWeight(θ l ,d) 4 applyMaskToWeights(θ l ,π l ) 5 z l ← θ l ⊙ π l 6 end 7 for t← 0 to numEpochs do 8 for j← 0 to numBatches do 9 L=computeCleanLoss(x x x)+updateDynmicRegularizr(θ ,z) 10 L adv =computePerturbedLoss(ˆ x x x) 11 L tot =updateRobustLoss(L,L adv ) 12 ∂Ltot ∂θ =computeGradients(θ ,batch) 13 updateMomentumAndWeights( ∂Ltot ∂θ ,µ ) 14 for l← 0 to k do 15 applyMaskToWeights(θ l ,π l ) 16 end 17 end 18 tM← getTotalMomentum(µ ) 19 pT← getTotalPrunedWeights(θ ,p t ) 20 p t ← linearDecay(p t ) 21 for l← 0 to k do 22 µ l ← getMomentumContribution(θ l ,π l ,tM,pT) 23 Prune(θ l ,π l ,p t ,pT) 24 Regrow(θ l ,π l ,µ l · tM,pT) 25 applyMaskToWeights(θ l ,π l ) 26 z l ← θ l ⊙ π l 27 end 28 end As exemplified in Fig. 5.2(b), for x = 95% there is significant difference in the sensitivity of the layers for clean and perturbed image classification. DNR, on the contrary, automatically finds per-layer pruning ratios (overlaid as pruning sensitivityasin[65])thatserveswellforbothtypesofimageclassificationtargeting a global pruning of 95%. 96 5.4 Experimental Results In this section, we first describe the experimental setup we used to evaluate the ef- fectiveness of the proposed robust training scheme. We then compare our method against other state-of-the-art robust pruning techniques based on ADMM [72] and L 1 lasso [71]. We also evaluate the merit of DNR as a clean-image pruning scheme andshowthatitconsistentlyoutperformscontemporarynon-iterativemodelprun- ing techniques [4,65,68,152]. We finally present an ablation study to empirically evaluatetheimportanceofthedynamicregularizerintheDNR’slossfunction. We used Pytorch [103] to write the models and trained/tested on AWS P3.2x large instances that have an NVIDIA Tesla V100 GPU. Models and Datasets Weselectedthreewidelyuseddatasets,CIFAR-10[104]CIFAR-100[104]andTiny- ImageNet[105]andpickedtwowellknownCNNmodels,VGG16[13]andResNet18 [15]. BothCIFAR-10andCIFAR-100datasetshave50Ktrainingsamplesand10K test samples with an input image size of 32× 32× 3. Training and test data size for Tiny-ImageNet are 100k and 10k, respectively where each image size is of 64× 64× 3. For all the datasets we used standard data augmentations (horizontal flipandrandomcropwithreflectivepadding)totrainthemodelswithabatchsize of 128. Adversarial Attack and DNR Training Settings For PGD, we set ϵ to 8/255, the attack step size α = 0.01, and the number of attack iterations to 7, the same values as in [2]. For FGSM, we choose the same ϵ value as above. We performed DNR based training for 200/170/60 epochs for CIFAR- 10/CIFAR-100/Tiny-ImageNet, with a starting learning rate of 0.1, momentum 97 value of 0.9, and weight decay value of 5e − 4 . For CIFAR-10 and CIFAR-100 the learning rate (LR) was reduced by a factor of 0.2 after 80, 120, and 160 epochs. For Tiny-ImageNet we reduced the LR value after 30 and 50 epochs. In addition, we hand-tuned ρ to 10 − 4 and set the pruning rate p = 0.5. We linearly decreased the pruning rate every epoch by p totalepochs . Finally, to balance between the clean and adversarial loss, we set β to 0.5. Lastly, note that we performed warm-up sparse learning [4] for the first 5 epochs with only the clean image loss function before using the hybrid loss function with dynamic regularization (see Eq. 5.6) for robust compression for the remaining epochs. 5.4.1 Results with DNR We analyzed the impact of our robust training framework on both clean and ad- versarially generated images with various target compression ratios in the range [0.01,1.0], where model compression is computed as the ratio of total weights of themodeltothenon-zeroweightsintheprunedmodel. AsshowninFigs.5.3(a-b) DNRcaneffectivelyfindarobustmodelwithhighcompressionandnegligiblecom- promise in accuracy. In particular, for irregular pruning our method can compress up to∼ 20× with negligible drop in accuracy on clean as well as PGD and FGSM basedperturbedimages, comparedtothebaselinenon-prunedmodels, testedwith VGG16 on CIFAR-10 and ResNet18 on CIFAR-100. 3 Observation 3 As the target compression ratio increases, channel pruning degrades adversarial robustness more significantly than irregular pruning . AswecanseeinFig. 5.3(a-b),theachievablemodelcompressionwithnegligible accuracy loss for structured (channel) pruned models is ∼ 10× lower than that achievable through irregular pruning. This trendmatches withthat of the model’s performanceoncleanimage. However, aswecanseeinFig. 5.3(c), thepercentage 3 A similar trend is observed for VGG16 on CIFAR-100 and ResNet18 on CIFAR-10. 98 Table 5.1: Results on VGG16 to classify Tiny-ImageNet. Pruning Compression % Channel Accuracy (%) type -ratio present Clean FGSM PGD Unpruned-baseline 1× 100 50.91 18.19 13.87 Irregular 20.63× 98.52 51.71 18.21 14.46 Channel 1.45× 74 51.09 17.92 13.54 of channels present in our channel-pruned models can be up to ∼ 10× lower than its irregular counterparts, implying a similarly large speedup in inference time on a large range of compute platforms [4]. As shown in Table 5.1, DNR can compress the model up to 20.63× without any compromise in performance for both clean and perturbed image classification. Itisalsonoteworthythatallouraccuracyresultsforbothcleanandadversarial images correspond to models that provide the best test accuracy on clean images. This is because robustness gains are typically more relevant on models in which the performance on clean images is least affected. Comparison with State-of-the-art Here, were compare the performance of DNR with ADMM [72] and L 1 lasso based [71] robust pruning. For ADMM based robust pruning we followed a three stage compression technique namely pre-training, ADMM based pruning, and masked retraining, performing pruning for 30 epochs with ρ admm = 10 − 3 as described in[72]. L 1 lassobasedpruningaddsaL 1 regularizertoitslossfunctiontopenalize the weight magnitudes, where the regularizer coefficient determines the penalty factor. Table 5.2 shows that our proposed method outperforms both ADMM and L 1 Lasso based approaches by a considerable margin, retaining advantages of both worlds 4 . In particular, compared to ADMM, with VGG16 (ResNet18) model on CIFAR-10, DNR provides up to 3.4% (0.78%) increased classification accuracy on perturbed images with 1.24× (1.48× ) higher compression. Compared to L 1 Lasso, we achieve 10.38× (3.15× ) higher compression and up to 2.6% (0.55%), and 3.5% 99 (a) (b) (c) (d) (e) (f) Figure5.3: Modelcompressionvs. accuracy(onbothcleanandadversariallygenerated images) for irregular and channel pruning evaluated with VGG16 on CIFAR-10 (a-b) and ResNet18 on CIFAR-100 (c-d). (e-f) Comparison of channel pruning with irregular pruningintermsof%ofchannelspresent. Notethatthe%ofchannelspresentcorrelates with inference time [3,4]. (1.4%)increasedaccuracyonperturbedandcleanimages,respectively,forVGG16 (ResNet18) on CIFAR-10 classification. 4 Romanized numbers in the table are results of our experiments, and italicized values are directly taken from the respective original papers. 100 Table 5.2: Comparison of DNR, ADMM based, and L 1 lasso based robust pruning schemes on CIFAR-10. No pre- Per-layer Targe t Pruning Compre- Accuracy (%) Model Method trained sparsity pruning type ssion model knowledge met ratio Clean FGSM PGD not-needed ADMM [72] × × ✓ Irregular 16.78× 86.34 49.52 40.62 VGG16 ADMM naive × ✓ ✓ 19.74× 83.87 42.46 32.87 L 1 Lasso [71] ✓ ✓ × 2.01× 83.24 50.32 42.01 DNR ✓ ✓ ✓ 20.85× 86.74 52.92 43.21 ADMM [72] × × ✓ Irregular 14.6× 87.15 54.65 46.57 ResNet18 ADMM naive × ✓ ✓ 19.74× 86.10 50.49 42.24 L 1 Lasso [71] ✓ ✓ × 6.84× 85.92 55.20 46.80 DNR ✓ ✓ ✓ 21.57× 87.32 55.13 47.35 Observation 4 Naively tuned per-layer pruning ratio degrades both robustness and clean-image classification performance of a model. For this, we evaluated robust compression using naive ADMM, i.e. using naively tuned per-layer pruning-ratio (all but the 1st layer ∼ x% for a x% total sparsity). As shown in Table 5.2, this clearly degrades the performance, implying layer-sparsity tuning is necessary for ADMM to perform well. Ablation Study To understand the performance of the proposed hybrid loss function with a dy- namic L 2 -regularizer, we performed ablation with both VGG16 and ResNet18 on CIFAR-10 for a target parameter density of 5% and 50% using irregular and chan- nel pruning, respectively. As shown in Table 5.3, using the dynamic regularizer improves the adversarial classification accuracy by up to 2 .83% for VGG16 and ∼ 3% for ResNet18 with similar clean-image classification performance. 5.4.2 Pruning to Classify Clean-only Images To evaluate the merit of DNR as a clean-image only pruning scheme (DNR-C), we trainedusingDNRwiththesamelossfunctionminustheadversariallossterm(by 101 Table5.3: ComparisonofDNRwithandwithoutthedynamicregularizerforCIFAR-10 classification. Accuracy (%) with Accuracy (%) with Model Method: DNR irregular pruning channel pruning Clean FGSM PGD Clean FGSM PGD VGG16 Without dynamic L 2 87.01 50.09 40.62 86.28 49.49 41.25 With dynamic L 2 86.74 52.92 43.21 85.83 51.03 42.36 ResNet18 Without dynamic L 2 87.45 53.52 45.33 87.97 53.10 45.91 With dynamic L 2 87.32 55.13 47.35 87.49 56.09 48.33 setting β = 1.0 in Eq. 5.6) to reach a target pruning ratio. Table 5.4 shows that our approach consistently outperforms other state-of-the-art non-iterative pruning approaches based on momentum information [4,65], reinforcement-learning driven auto-compression (AMC) [152], and connection-sensitivity [68] 4 . The δ value in the seventh column represents the error difference from corresponding non-pruned baseline models. We also present performance on CIFAR-100 for VGG16 and ResNet18 and Tiny-ImageNet for VGG16. 6 In particular, we can achieve up to 34.57× (12.61× ) compression on CIFAR-10 dataset with irregular (channel) prun- ing maintaining accuracy similar to the baseline. On CIFAR-100 compression of up to 22.45× (5.57× ) yields no significant accuracy drop (less than 2 .7% in top-1 accuracy)withirregular(channel)pruning. Moreover, ourevaluationshowsapos- siblepracticalspeedupofupto6.06× forCIFAR-10and2.41× forCIFAR-100can be achieved through channel pruning using DNR-C. For Tiny-ImageNet, DNR-C can provide compression and speed-up of up to 11.55× and 1.53× , respectively with negligible accuracy drop. 6 To have an “apple to apple” comparison we provide results on ResNet50 model for classifi- cation on CIFAR-10. All other simulations are done on only the ResNet18 variant of ResNet. 102 Table 5.4: Comparison with state-of-the-art non-iterative pruning schemes on CIFAR- 10 and comparison of deviation from baseline on CIFAR-100 and Tiny-ImageNet. Dataset Model Method Pruning Compress- Error (%) δ from Speedup type ion ratio top-1 baseline VGG16 SNIP [68] Irregular 32.33× 8.00 -0.26 – Sparse-learning [4] 32.33× 7.00 -0.5 – DNR-C 34.57× 6.50 -0.09 1.29× DNR-C Channel 12.61× 8.00 -1.5 6.06× CIFAR ResNet50 GSM [65] Irregular 10× 6.20 -0.25 – -10 AMC [152] 2.5× 6.45 +0.02 – DNR-C 20× 4.8 -0.07 1.75× ResNet18 DNR-C Irregular 20.32× 5.19 -0.10 1.31× Channel 5.67× 5.36 -0.27 2.43× VGG16 DNR-C Irregular 20× 27.14 -1.04 1.07× CIFAR Channel 2.76× 28.78 -2.68 2.06× -100 ResNet18 DNR-C Irregular 22.45× 24.9 -1.17 1.13× Channel 5.57× 25.28 -1.55 2.41× Tiny VGG16 DNR-C Irregular 11.55× 40.96 +0.36 1.01× ImageNet Channel 1.74× 42.61 -1.28 1.53× 5.4.3 GeneralizedRobustnessAgainstPGDAttackofDif- ferent Strengths Fig. 5.4 presents the performance of the pruned models as a function of the PGD attack iteration and the attack bound ϵ . In particular, we can see that, for both irregular and channel pruned models, the accuracy degrades with higher number of attack iterations. When ϵ increases, the accuracy drop is similar in both the pruningschemes. Thesetrendssuggestthatourrobustnessisnotachievedthrough gradient obfuscation [71]. 5.5 Conclusions Inthischapterweaddresstheopenproblemofachievingultra-highcompressionof DNN models while maintaining their robustness through a non-iterative training approach. In particular, the proposed DNR method leverages a novel sparse- learning strategy with a hybrid loss function that has a dynamic regularizer to 103 (a) (b) (c) (d) Figure5.4: OnCIFAR-10,theperturbeddataaccuracyofResNet18underPGDattack versus increasing (a), (c) attack iteration and (b), (d) attack bound ϵ for irregular (5% density), and channel pruned (50% density) models, respectively. achieve better trade-offs between accuracy, model size, and robustness. Further- more, our extension to support channel pruning shows that compressed models produced by DNR can have a practical inference speed-up of up to∼ 10× . 104 Chapter 6 A Fast Learnable Once-for-All Adversarial Training This chapter first provides the introduction and motivation behind the require- ment of conditional learning in yielding SOTA models for both clean and robust performance in Section 6.1. Reviews and limitations of the related work are pro- videdinSection6.2. Section6.3providesthenecessarydetailsofourproposedfast learnable once-for-all adversarial training (FLOAT). Section 6.4 presents detailed experimental results demonstrating superiority of FLOAT and finally, the chapter concludes in Section 6.5. 6.1 Introduction and Motivation With the growing usage of DNNs in safety-critical and sensitive applications in- cluding autonomous-driving [159] and medical image analysis [160], it has be- come crucial that they have high classification accuracy on both clean and adversarially-perturbed images [52]. However, most robustness improvement tech- niques [2,150,151] including the popular adversarial training [1,161] often come at various costs. Firstly, most of these methods suffer from increased training times due to the additional back-propagation overhead caused by generating perturbed images. Secondly, adversarialdefensessometimescauseasignificantdropinclean- image accuracy [162], highlighting an accuracy-robustness trade-off that has been explored both theoretically and experimentally [162–164]. Moreover, the defenses 105 rely on several hyperparameters whose settings force the model to work at a spe- cific point along this trade-off. This is disadvantageous in applications in which the desired trade-off depends on context [52]. A naive solution to this problem is to use multiple networks trained with dif- ferent priorities between clean and adversarial images. This however, comes with the heavy cost of both increased training time and inference memory. Alterna- tively, recent work has proposed training a once-for-all adversarial network (OAT) that supports conditional learning [52], enabling the network to adjust to differ- ent input distributions. In particular, after each batch-normalization (BN) layer, they add a feature-wise linear modulation (FiLM) module [165] whose weights are controlled by a parameter λ . For inference, the user sets λ to enable an in-situ trade-off between accuracy and robustness. The disadvantage with this approach is that the added FiLM modules increase the parameter count, training time, and network latency, limiting applicability in resource-constrained, real-time applica- tions. Moreover, our investigation shows that the CA-RA performance of OAT is heavily dependent on the choice of training hyperparameter λ s. For example, the accuracy with ResNet34 on CIFAR-10 varies up to 21.97%. To resolve the issues mentioned above we provide a two-fold solution. First, in view of the above concerns, we present a fast learnable once-for-all adversarial training (FLOAT). In FLOAT, we train a model using a novel mechanism wherein eachweighttensorofthemodelistransformedbyconditionallyaddinganoiseten- sor based on a binary parameter λ , yielding state-of-the-art (SOTA) test accuracy for clean and adversarial images by in-situ setting λ = 0.0 and 1.0, respectively. For inference, we further show that model robustness can be correlated to the strength of the noise-tensor scaling factor. This motivates a simple yet effective noise re-scaling approach controlled by an user-provided floating-point parameter that can help the user to have a practical accuracy-robustness trade-off. Because FLOAT does not require additional layers to perform conditioning, it incurs no 106 increase in latency and causes only a negligible increase in parameter count com- pared to the baseline models. Moreover, compared to OAT, FLOAT training is up to 1.43× faster, attributable to the fact that FLOAT does not require training with intermediate fine-grained values of λ s. Secondly, for efficient deployment of the models to resource-limited edge de- vices, we present FLOAT sparse (FLOATS), an extension of FLOAT, that not only provides adaptive tuning between RA and CA, but also facilitates high levels ofmodelcompression(viapruning)withoutincurringanyadditionaltrainingtime. In particular, we propose and empirically evaluate the efficacy of FLOATS with bothirregularandstructuredchannelpruning,namelyFLOATS-iandFLOATS-c, respectively. However, despite the potential speed-up on underlying hardware [3], channel pruning often costs classification performance [84] because of its strictly constrained form of sparsity. We thus extend FLOATS to propose a globally- structured locally-irregular hybrid sparsity. In particular, we perform channel reduction through network slimming [166] reducing latency and memory usage, and use irregular pruning in conjunction with this to further reduce memory cost. These new models not only provide compression, but enable an in-situ inference trade-off across accuracy, robustness, and complexity. ToevaluatethemeritsofFLOAT,weconductextensiveexperimentsonCIFAR- 10, CIFAR-100, Tiny-ImageNet, SVHN, and STL10 with ResNet34 (on both CI- FAR and Tiny-ImageNet datasets), WRN16-8, WRN40-2, respectively. As shown in Fig. 6.1, compared to OAT, FLOAT can provide improved accuracies of up to ∼ 6%, and ∼ 10%, on clean and perturbed images, respectively, with reduced parameter budgets of up to 1.47× . FLOATS can yield even further parameter- efficiency of up to 2 .69× with similar CA-RA benefits. 107 Figure6.1: Normalizedmemoryvs. TestaccuracyforFLOATandFLOATwithirregu- larsparsity(FLOATS-i)comparedtotheexistingstate-of-the-artOATfor(a)ResNet34, (b) WRN16-8, and (c) WRN40-2, respectively. CA and RA represent clean-image clas- sification accuracy and robust accuracy (accuracy on adversarial images), respectively. For each model we normalized the memory requirement with the maximum memory needed to store corresponding model. 6.2 Preliminaries and Related Work 6.2.1 Notation ConsideramodelΦwith LlayersparameterizedbyΘ thatlearnsafunctionf Φ (.). For a classification task on dataset X with distribution D, the model parameters Θ are learned by minimizing the empirical risk (ERM) as follows L(f Φ (x,Θ ;t)), (6.1) where t is the ground-truth class label, x is the vectorized input drawn from X, andL is the cross-entropy loss function. 6.2.2 Robust Model Training Several forms of adversarial training (AT) have been proposed to improve robust- ness [1], [167], [168]. They use clean as well as adversarially-perturbed images to train a model. As discussed in earlier chapter, projected gradient descent (PGD) 108 attack, recognized as one of the strongest L ∞ adversarial example generation al- gorithms [1], is typically used to create adversarial images during training. For PGD-AT, the model parameters are then learned by the following ERM [(1− λ )L(f Φ (x,Θ ;t)) | {z } L C +λ L(f Φ (ˆ x,Θ ;t)) | {z } L A ], (6.2) whereL C andL A correspond to the clean and adversarial image classification loss components,respectively,weightedbythescalarλ . Hence,forafixed λ andadver- sarialstrength, themodellearnsafixedtradeoffbetweenaccuracyandrobustness. For example, an AT with λ value of 1 will allow the model to completely focus on perturbed images, resulting in a significant drop in clean-image classification accuracy. Another strategy to improve model robustness is through the addition of noise to the model weight tensors. For example, [2] introduced the idea of noisy weight tensors with a learnable noise scaling factor and improved robustness against gradient-based attacks. However, this strategy also incurs a significant drop in clean image classification accuracy. 6.2.3 Conditional Learning Conditional learning involves training a model with multiple computational paths that can be selectively enabled during inference [169]. For example, [170–172] enhancedaDNNmodelwithmultipleearlyexitbranchesatdifferentarchitectural depths to allow early predictions of various inputs. [166] introduced switchable BNs that enable the network to adjust the channel widths dynamically, providing an in-situ efficient trade-off between complexity and accuracy. Recently, [173] used switchable BNs to support runtime bit-width selection of a mixed-precision network. Another conditional learning approach used feature transformation to modulate intermediate DNN features [52,174–176]. In particular, [52] used FiLM [165] to adaptively perform a channel-wise affine transformation after each BN 109 Figure 6.2: Impact of various training λ choices on the conditionally trained OAT. During testing we use S λ =[0,0.2,0.7,1.0]. stagethatiscontrolledbythehyperparameterλ ofEquation6.2. Suchconditional training that is able to yield models that can provide SOTA CA-RA trade-off on various λ choices during inference are popularly known as Once-for-all adversarial training (OAT) [52]. Limitations of FiLM-based model conditioning. Each FiLM module in OAT is composed of two fully-connected (FC) layers with leaky ReLU activation functions and dimensions that are integer multiples of the output feature-map channel size. Despite requiring a relatively small number of additional FLOPs, the FiLM module can significantly increase the number of model parameters and associated memory access cost [8]. Moreover, the increased number of layers can significantly increase training time and inference latency [177], thus potentially prohibiting its use in real-time applications. Additionally, we investigated OAT’s performance on the choice of the training λ set (S λ ), as shown in Fig. 6.2. Interestingly, the CA and RA can vary up to 11.03% and 21.97%, respectively. This implies that,OAT’s performance may vary significantly based on both the size and specific values in S λ . In particular, the choice of S λ can significantly impact the robustness at λ = 0, sometimes leading to no robustness. 110 Figure 6.3: Comparison of a conditional layer between existing FiLM based approach in OAT (left) and proposed approach in FLOAT (right). This implies that to obtain models that yield near optimal CA-RA trade-offs, S λ must be carefully chosen, implying the need for an additional compute-heavy hyperparameter search or prior user expertise. 6.3 Proposed Approach 6.3.1 FLOAT This section details our FLOAT training strategy. We refer to the conditions for a model being trained on either clean or adversarial images as the two training boundary conditions. During training, we use a binary conditioning parameter λ to force the model to focus on either of these two conditions, removing the need to search a more fine-grained set of λ choices. Toformalizeourapproach,consideraL-layerDNNparameterizedbyΘ andlet θ l ∈R k l × k l × C l i × C l o represent the layer l weight tensor, where C l o and C l i represent the number of filters and channels per filter, respectively, and k l represents the kernel height/width. We transform each parameter ofθ l , by adding a noise tensor η l ∈R k l × k l × C l i × C l o scaled by a parameter α l and conditioned by λ , as follows, ˆ θ l =θ l +λ · α l · η l ; η l ∼N (0,(σ l ) 2 ). (6.3) 111 Note that the standard deviation σ l of the noise matches that of its weight tensor. λ =0 and 1 generate the original weight tensor and its noisy variant, respectively. As illustrated in Algorithm 4, we train our models by partitioning an image batchB intotwoequalsub-batchesB 1 andB 2 , onewithclean(IFM C )imagesand the other with perturbed variants (IFM A ) (lines 5 and 7 in Algorithm 4). We use the PGD-7 attack to generate perturbations on the image batchB 2 . As illustrated in Fig. 6.3(b), the original and noisy weight tensors are convolved only with clean and perturbed variants, respectively. Note that the noise scaling factor α l (line 10) is trainable and its magnitude can be different in each layer to minimize the total training loss. The post-convolution feature maps for clean and adversarial inputs can differ significantly in their respective mean and variances [34,178]. Therefore, theuse ofa single BNto learnboth distributions may limitthe model’s performance [52]. To solve this problem, we extend the λ -conditioning to choose between two BNs, BN C and BN A , dedicated for IFM C and IFM A , respectively. Our approach differs from previous efforts in several ways. Earlier research performednoise-injectionviaregularization[179,180]andperturbedweighttensors [2] to boost model robustness at the cost of a significant accuracy drop on clean images. In contrast, we use noise tensors to transform a shared weight tensor and yield a model that can be configured in-situ to provide SOTA accuracy on either clean or perturbed images. Our approach is similar to λ -conditioning used by [52]. However, instead of transforming activations using added FiLM-based layers trained with multiple values of λ [52], we transform weight tensors using added noise conditioned by binary λ . Compared to [52], we thus require models with significantly fewer parameters and training scenarios, yielding faster training (up to 1.43× ). FLOAT generalization with noise re-scaling. One limitation of the FLOAT as proposed above is that it allows the user to choose between two bound- ary conditions only. This limits applicability when the user is not confident about 112 Figure 6.4: Post-training model performance on both clean and gradient-based attack- generated adversarial images, with different noise re-scaling factor λ n . which condition to use during inference. To motivate more continuous in-situ con- ditioning, we analyze a ResNet20 model with noisy weight tensors trained with PGD-AT on CIFAR-10 [2]. Post-training, we re-scaled α l for each layer l, using a new floating-point parameter λ n to yield λ n · α l . Interestingly, as shown in Fig. 6.4, as the re-scaling factor decreases, the model robustness decreases and the clean-image accuracy increases. Based on this observation, we introduce a practical means of post-training in- situ calibration by adding a re-scaling parameter λ n to the inference model 1 . This allows us to enable a practical accuracy-robustness trade-off in FLOAT during inference. We also define a threshold λ th such that for λ n > λ th we select BN A to perform inference and select BN C otherwise. [52] selected BN C and BN A when λ =0 and λ> 0, respectively. We follow a similar approach by setting λ th =0. 6.3.2 Extension to Model Compression via Pruning Pruningisaparticularformofmodelcompressionthathasbeeneffectiveinreduc- ing model size and compute complexity for large DNNs for resource-constrained deployment [3,69,181–184]. Motivated by these results, we incorporate a form 1 Note that λ n is a continuous variable between 0 and 1 where as λ is binary. λ n = 0 and λ n =1matchesthetrainingboundaryconditions. OAT,ontheotherhand,usesasinglevariable λ that can be any floating point value in [0 ,1] during both training and inference. 113 Algorithm 4: FLOATS Algorithm Data: Training set X∼ D, model parameters Θ , trainable noise scaling factor α , binary conditioning parameter λ , mini-batch sizeB, global parameter density d, initial mask Π , prune type (irregular/channel) t p . 1 , Output: trained model parameters with density d. 2 Θ ← applyMask(Θ ,Π ) 3 for i← 0 to to ep do 4 for j← 0 to n B do 5 B/2 (X 0:B/2 ,Y 0:B/2 )∼ D 6 L C ← computeLoss(X 0:B/2 ,Θ ,λ =0,α ;Y 0:B/2 ) 7 ˆ X B/2:B ← createAdv(X B/2:B ,Y B/2:B ) 8 L A ← computeLoss( ˆ X B/2:B ,Θ ,λ =1,α ;Y B/2:B ) 9 L← 0.5∗L C +0.5∗L A 10 updateParam(Θ ,α ,∇ L ,Π ) 11 end 12 updateLayerMomentum(µ ) 13 pruneRegrow(Θ ,Π ,µ ,d)// Prune fixed % of active and regrow fraction of 14 inactive weights 15 Π ← updateMask(Π ,t p ,µ ) 16 end of pruning called sparse learning 2 [182] into FLOAT, which we refer to FLOAT sparse-irregular (FLOATS-i). The resulting approach not only provides a CA- RA trade-off, but also meets a target global parameter density d. In particular, FLOATS ranks every layer based on the normalized momentum of its non-zero parameters. Based on this ranking, FLOATS dynamically allocates more weights tolayerthathavelargermomentumandfewerweightstootherlayers, whilemain- taining the global density constraint. To be more precise, let the binary pruning maskbeparameterizedbythesetΠ withelementsπ l representingthemasktensor for layer l. The fraction of 1s inπ l is proportional to its relative layer importance evaluated through momentum. 2 Every update of the model happens sparsely, meaning only a fraction of the weights are updated, while others remain as zero. 114 (a) (b) Figure 6.5: (a) Comparison of channel density (weights plotted in abs. magnitude) for FLOATS irregular and channel, for the 29 th CONV layer of WRN40-2 on STL10 while both are trained for d = 0.3. (b) Convolutional layer operation path for FLOATS slim. Note, the switchable BNs correspond to BNs for each SF. To further ensure that the pruned models have structure and enable speed-up on a wide range of existing hardware [3], we propose FLOATS-c that performs channel pruning. In FLOATS-c, for a layer l, we convert the 4Dθ l to a 2D weight matrix with C l o rows and (k l ) 2 C l i columns that is further partitioned in to C l i sub- matrices of C l o rows and (k l ) 2 columns. To evaluate the channel importance, we compute the Frobenius norm (F-norm) of each sub-matrix c by computing f l c = ||θ l :,c,:,: || 2 F . Wethenkeeporremoveachannelbasedontherankingoff l c ’s,enabling pruningatthechannellevel. AsdepictedinFig. 6.5(a),theweightheatmapsshow that for the same layer FLOATS-c can yield only 20.3% non-zero channels, while FLOATS-i retains all the channels. In fact for the same target d, the channel density can be 10× lower for some layers as compared to that in FLOATS-i. We note that this large scale channel reduction sometimes comes at a non-negligible accuracy drop as shown in Table 6.1. A globally structured locally irregular pruning. To simultaneously ben- efitfromaggressiveparameterreductionviairregularpruningandwidthreduction viachannelpruning, whilemaintaininghighaccuracy, weproposeaformofhybrid compression called FLOATS slim. FLOATS slim leverages the idea of slimmable networks [166] to train a model with channel widths that are scaled by a global channel slimming-factor (SF). On top of this, we use FLOATS-i to yield a locally 115 irregular model with even fewer parameters for a specific SF. We perform both of these optimizations simultaneously, training with multiple SFs, including SF =1 (Algorithm detailed in the supplementary material). Note, unlike FLOATS-c, where different layers might have different SFs, FLOATS slim yields uniform SFs for all layers. However, in FLOATS slim, a model with SF< 1.0 is trained as a shared-weight sub-network of the model with SF = 1.0, contrasting FLOATS-c, where only one model of a specific d is trained. Fig. 6.5(b) depicts the weight conditioned convolution operation in FLOATS slim. 6.4 Experimental Results and Analysis 6.4.1 Experimental Setup Models and datasets. To evaluate the efficacy of the presented algorithms, we performed detailed experiments on five popular datasets, CIFAR-10, CIFAR- 100 [104], Tiny-ImageNet [105] with ResNet34 [15], SVHN [185] with WRN16- 8 [186], and STL10 [187] with WRN40-2 [186]. Hyperparameters and training settings. In order to facilitate a fair com- parison,forCIFAR-10,SVHN,andSTL10weusedsimilarhyperparametersettings as [52] 3 . For CIFAR-100, we followed same hyperparameter settings as that with CIFAR-10. ForTiny-ImageNetwetrainedthemodelfor120epochswithaninitial learning rate of 0.1 an used cosine decay. For adversarial image generation during training, we used the PGD-k attack with ϵ and k set to 8/255 and 7, respectively. We initialized the noise scaling-factor α l for layer l to 0.25 as described in [2]. We used the PyTorch API [103] to implement our models and trained them on a Nvidia GTX Titan XP GPU. 3 We followed the official repository https://github.com/VITA-Group/Once-for-All- Adversarial-Training 116 Evaluationmetrics. Clean(standard)accuracy(CA):classificationaccuracy on the original clean test images. Robust Accuracy (RA): classification accuracy onadversariallyperturbedimagesgeneratedfromtheoriginaltestset. WeuseRA as the measure of robustness of a model. To directly measure the robustness vs accuracy trade-off, we evaluated the clean and robust accuracy values of models generated through FLOAT at various λ values and compared with those yielded through OAT and PGD-AT. We used the average of the best CA and RA values over three different runs with varying random seeds, for each λ value to report in our results. 6.4.2 Performance of FLOAT Sampling λ n . Unless stated otherwise, to evaluate the performance of FLOAT during validation we chose a set of λ n s as S λ n = {0.0,0.2,0.7,1.0}. Note that setting λ n to 0.0 or 1.0 corresponds to the values of λ used during training. Also, we measure the accuracy of FLOAT using two different settings of λ th , 0.0 (similar to OAT) and 0.5. For λ th = 0.5, we update the noise scaling factor by using the following simple equation α l new = α l · 2· λ n ; if λ n ≤ 0.5 α l · 2· (λ n − 0.5); if 0.5<λ n ≤ 1.0 (6.4) As depicted in Fig. 6.6 (a)-(e), the FLOAT models generalize well to yield a semi- continuous accuracy-robustness trade-off. Also, across all the datasets, λ th = 0.5 yields a more gradual transition between the two boundary conditions. Consider the setting where λ n = 0.2. With λ th = 0.5, we observe a 4.76% improvement in CAandareductioninRAof15.9%onaverageoverallfivedatasetswhencompared with λ th = 0.0. The improvement in clean accuracy here can be attributed to the use of BN C . However, this configuration shows a drop in CA and an improvement 117 inRAwhencomparedtotheconfigurationwhere λ n =0.0. Thiscanbeattributed to the use of noisy weights (refer to Eq. 6.4) during inference. Thus, it can be concluded that a user who cares more about clean image performance than adversarial robustness, should set λ th >0.0 to see a less abrupt drop in CA. Note that, because the generation of adversarial images is noisy, it is not always true that increasing λ will always significantly improve robustness. Consequently, in some cases, we obtain improved clean image performance without a significant drop in robustness. 6.4.3 Comparison with OAT and PGD-AT We trained the benchmark models following OAT and PGD-AT with λ s sampled from a set S λ =S λ n on three datasets, CIFAR-10, SVHN, and STL10. Discussion on CA-RA trade-off. Fig. 6.7(a)-(c) show the comparison of FLOAT with OAT and PGD-AT in terms of CA-RA trade-offs. The FLOAT modelsshowsimilarorsuperiorperformanceattheboundaryconditionsaswellas at intermediate sampled values of λ . In particular, compared to OAT and PGD- AT models, FLOAT models can provide an improved RA of up to 14.5% (STL10, λ = 0.2) and 22.52% (CIFAR-10, λ = 0.0), respectively. FLOAT also provides improved CA of up to 6.5% (STL10, λ = 1.0) and 6.96% (STL10, λ = 1.0), compared to OAT and PGD-AT generated models, respectively. Interestingly, for bothFLOATandOAT,inalltheplotswegenerallyseeasharpdropinrobustness whilemovingfromtop-lefttobottom-right. Thiscanbeattributedtototheswitch from BN A to BN C based on the λ th , in the forward pass of the inference model. 118 (a) (b) (c) (d) (e) Figure 6.6: Performance of FLOAT on (a) CIFAR-10, (b) STL10, (c) SVHN, (d) CIFAR-100, and (e) Tiny-ImageNet with variousλ n valuessampledfromS λ n fortwodifferent λ th forBN C toBN A switching. Thenumbersinthebracketcorrespondsto (CA,RA)fortheboundaryconditionsofλ =0andλ =1. λ n variesfromlargesttosmallestvaluefromtop-lefttobottom-right. 119 (a) (b) (c) Figure 6.7: Performance comparison of FLOAT with OAT and PGD-AT generated models on (a) CIFAR10, (b) SVHN, and (c) STL10. λ varies from largest to smallest value in S λ for the points from top-left to bottom-right. Discussion on training time and inference latency. Due to the presence of the additional FiLM modules, OAT requires more time than standard PGD-AT to train. However, a single PGD-AT training can only provide a fixed accuracy- robustness trade-off. For example, to have trade-off with 4 different λ s PGD-AT training time increases proportionally by a factor of 4. FLOAT, on the contrary, due to absence of additional layers, trains faster than OAT. In particular, Fig. 6.8(a)showsthenormalizedper-epochtrainingtime(averagedover200epochs)of OAT and PGD-AT are, respectively, up to 1.43× and 1.37× slower than FLOAT. Network latency increases with the increase in the number of layers for both standard and mobile GPUs [188], [177], primarily because layers are operated on sequentially[177]. TheadditionalFiLMmodulesinOATsignificantlyincreasethe layercount. Forexample, foreachbottlenecklayerinResNet34, OATrequirestwo FiLMmodules,yieldingatotaloffouradditionalFCsperbottleneck. Ontheother hand, FLOAT, similar to a single PGD-AT trained model, requires no additional layers or associated latency, making it more attractive for real-time applications. Discussion on model parameter storage cost. Unlike OAT, where the FiLM layer FCs significantly increase the parameter count, the additional BN layers and scaling factors of FLOAT represent a negligible increase in parameter count. In particular, assuming parameters are represented with 8-bits, a FLOAT 120 (a) (b) (c) Figure 6.8: Comparison of FLOAT with OAT and PGD-AT in terms of (a) normalized trainingtimeperepochand(b)modelparameterstorage(neglectingthestoragecostfor the BN and α ) (c) CONV layer compute delay on conventional ASIC (using the delay model of Eq. 7, 8, and 9) architecture [5] evaluated on ResNet34 for CIFAR-10. Note here, PGD-AT:1T yields 1 model for a specific λ choice. ResNet34 has only 21.28 MB memory cost compared to 31.4MB for OAT. Fig. 6.8(b) shows that FLOAT models, similar to PGD-AT:1T, can yield up to 1.47× lower memory. Discussion on FLOPs. Compared to the standard PGD-AT, FLOAT incurs additional compute cost of addition of noise with the weight ten- sor during forward pass. For example, for ResNet34 with ∼ 21.28 M parameters, FLOAT needs similar number of additions for noisy weight transformation. How- ever, compared to the total operations of ∼ 1.165 GFLOPs, the transformation adds on 1.182% additional computation. Moreover, as a single addition can be up to 32× cheaper than a single FLOP [8], we can gracefully ignore such transforma- tion cost in terms of FLOPs. OAT, on the other hand, also incurs negligibly less FLOPs overhead of up to only∼ 1.7% [52]. Discussion on compute delay in Von-Neumann ASIC hardware. A neural network deployed on a conventional Von-Neumann hardware has two dom- inant operation types: memory read and Multiply-accumulate (MAC). Based on the assumption that these operations are sequential, as in [5,189], the convolution layer delay to compute C l o output-features can be estimated τ conv ≈⌈ (k l ) 2 C l i C l o (B IO /B W )N bank ⌉τ read +⌈ (k l ) 2 C l i C l o N Mult ⌉H l o W l o τ mult . (6.5) 121 Figure6.9: PerformancecomparisonofFLOATslim,FLOATS(-i)slimwithOATslim. We used ResNet34 on CIFAR-10 to evaluate the performance. where B IO is the memory input-output (IO) bandwidth and B W is the bit-width of each weight stored in memory. N bank and N mult corresponds to the number of hardware memory banks and multiply units. Similar to earlier literature [5,189], for a standard hardware we assume the values of B IO , B W , N bank , and N mult to be 16, 8, 4, and 175, respectively. A single memory read and multiply operation time is denoted by τ read and τ mult , respectively. Their values for a 65nm CMOS process technology are 9ns and 4ns, respectively [189]. Based on similar assumptions, the delay model for modified CONV layer l for FLOAT (τ F conv ) and OAT (τ O conv ) can be estimated as τ F conv ≈⌈ (k l ) 2 C l i C l o (B IO /B W )N bank ⌉τ read +⌈ (k l ) 2 C l i C l o N Mult ⌉(1+H l o W l o )τ mult , (6.6) τ O conv ≈⌈ (k l ) 2 C l i C l o +2C l o +4(C l o ) 2 (B IO /B W )N bank ⌉τ read +⌈ (k l ) 2 C l i C l o N Mult ⌉H l o W l o τ mult + ⌈ 2C l o +4(C l o ) 2 N Mult ⌉τ mult . (6.7) Here, the first term corresponds to the read delay and remaining term(s) cor- respond to the delay associated with the multiplications. We ignore the energy associated with reading α l because it is negligible compared to the read energy for the other model parameters. Based on these Eqs, Fig. 6.8(c) shows the minimum, maximum, and average normalized delays with respect to the τ conv . In particular, conditional CONV layer delay of FLOAT can be up to 1.66× faster compared to that of OAT, illustrating its efficacy on conventional architecture. 122 Table 6.1: Performance comparison between different compressed FLOAT variants trainedonCIFAR-10withResNet34. ✓✓,✓, and✗ indicateaggressive, non-aggressive, and no reduction, respectively, compared to the baseline of FLOAT. Algorithm Acc. % (λ =0.0) Acc. % (λ =1.0) CR↑ CRF↑ Reduced Potential CA RA CA RA storage speed-up FLOAT 94.83 22.52 89.1 56.71 1× 1× ✗ ✗ FLOATS-i 94.12 18.7 88.6 55.92 10× 1× ✓✓ ✗ FLOATS-c 93.84 17.2 86.87 53.2 2.94× 1.54× ✓ ✓ FLOATS slim 94.26 19.1 88.9 55.44 4.76× 2× ✓✓ ✓ 6.4.4 Performance of FLOATS Table 6.1 shows the performance of FLOATS with irregular, channel, and slimmable compression. The FLOATS slim model was trained with two represen- tativeSFsof1.0and0.5withaglobaltargetdensityd=0.3. Wereportitsperfor- mance with SF= 0.5. Here, compression ratio (CR) and channel reduction factor (CRF) are computed as 1 d and 1 % of total channels present , respectively. Compared to FLOATS-c,FLOATSslimrequires1.62× lessstorage,resultsinupto2× speed-up, and yields 2.24% higher classification accuracy. Moreover, FLOATS slim provides uswithauniquethree-waytrade-offbetweenrobustness,accuracy,andcomplexity, requiring only single training pass. Fig. 6.9illustratestheefficacyofFLOATslimcomparedtoOATslim. FLOAT slim provides significantly improved performance for all tested values of λ for both the SFs. In particular, FLOAT slim yields up to 3.6% higher accuracy. Adding sparsity, FLOATS slim yields similar accuracy improvement with up to 2.95× less parameters. Moreover, GPU hardware measurements show that our slimmable networks trains up to 1.90× faster compared to OAT slim. 6.4.5 Generalization on Various Perturbation Techniques TodemonstratethegeneralizationofFLOATmodelsondifferentattacks, weshow their performance on images adversarially-perturbed through PGD-20 and FGSM 123 (a) (b) (c) Figure 6.10: Performance comparison of FLOAT with OAT on (a) PGD-20 and (b) FGSMattackgeneratedimages. (c)CA-RAplotofFLOATvs. PGD-ATonautoattack. All evaluations are done with ResNet34 on CIFAR-10. λ varies from largest to smallest value in S λ for the points from top-left to bottom-right. attacks. We follow [52] to generate the PGD-20 perturbations and set the number of steps to 20, keeping other hyperparameters the same as PGD-7. For FGSM, we make ϵ = 8/255 following [52]. As shown in Fig. 6.10(a)-(b), under both the attacks, FLOAT can achieve in-situ accuracy-robustness trade-offs similar to that of OAT. Moreover, we have analyzed FLOAT’s robustness with an ensemble of parameter-free attacks, namely the ‘random’ variant of autoattack [190] 4 . Details of the autoattack hyperparameters are provided in the Supplementary Materials. As depicted in 6.10(c), compared to the PGD-AT yielded models, FLOAT consis- tently yields better RA with similar or improved CA. 6.5 Conclusions This chapter addressed the largely unexplored problem of enabling an in-situ in- ference trade-off between accuracy, robustness, and complexity. We proposed a fast learnable once-for-all adversarial training (FLOAT) which uses model condi- tioning to capture the different feature-map distributions corresponding to clean and adversarial images. FLOAT transforms its weights using conditionally added 4 We have followed the official repo https://github.com/fra31/auto-attack to generate the at- tack. 124 scaled noise and dual batch normalization structures to distinguish between clean and adversarial images. The approach avoids increasing the layer count, unlike other state of the art alternatives, and thus does not suffer from increased net- work latency. We then extended FLOAT to include sparsity to further reduced complexity and latency providing an in-situ trade-off including model complexity. Extensive experiments showed FLOAT’s superiority in terms of improved CA-RA performance, reduced parameter count, and faster training time. 125 Chapter 7 Spiking Neural Network Robustness: Analysis and Improvement This chapter first provides the introduction and motivation behind generating in- herently robust deep SNN models in Section 7.1. Initial robustness analysis of the SNNs yielded via traditional training is provided in Section 7.2. Section 7.3 details our SNN training method to yield inherently robust models that can have improved robustness compared to the traditionally trained SNN as well as the ANN counterparts. Detailed experimental evaluations are provided in Section 7.4 and finally the chapter concludes in Section 7.5. 7.1 Introduction and Motivation Well crafted adversarial images with small, often unnoticeable perturbations can fool a well trained ANN to make incorrect and possibly dangerous decisions [9, 191,192], despite their otherwise impressive performance on clean images. To improve the model performance of ANNs against attacks, training with various adversarially generated images [1,84] has proven to be very effective. Few other prior art references [193,194] have applied noisy inputs to train robust models. 126 However,allthesetrainingschemesincurnon-negligiblecleanimageaccuracydrop and require significant additional training time. Brain-inspired [195] deep spiking neural networks (SNNs) have also gained sig- nificanttractionduetotheirpotentialforloweringtherequiredpowerconsumption of machine learning applications [196,197]. The underlying SNN hardware can use binary spike-based sparse processing via accumulate (AC) operations over a fixed number of time steps 1 T which consume much lower power than the traditional energy-hungrymultiply-accumulate(MAC)operationsthatdominateANNs[198]. Recent advances in SNN training by using approximate gradient [47] and hybrid direct-input-coded ANN-SNN training with joint threshold, leak, and weight op- timization [199] have improved the SNN accuracy while simultaneously reducing the number of required time steps. This has lowered both their computation cost, which is reflected in their average spike count as shown in Fig. 7.1(b), and infer- ence latency. However, the trustworthiness of these state-of-the-art (SOTA) SNNs under various adversarial attacks is yet to be fully explored. Some earlier works have claimed that SNNs may have inherent robustness against popular gradient-based adversarial attacks [200–202]. In particular, Sharmin et al.[201]observedthatrate-codedinput-driven(Fig. 7.1(a))SNNshave inherent robustness, which the authors primarily attributed to the highly sparse spiking activity of the model. However, these explorations are mostly limited to small datasets on shallow SNN models, and more importantly, these techniques give rise to high inference latency. This chapter extends this analysis, asking two key questions. 1. To what degree does SOTA low-latency deep SNNs retain their inherent robustnessunderbothblack-boxandwhite-boxadversarial-attackgeneratedimages? 1 Here, a time step is the unit of time taken by each input image to be processed through all layers of the model. 127 (a) (b) (c) Figure 7.1: (a) Direct and rate-coded input variants of the original image. (b) Layer wise average spikes for VGG11. (c) Performance of direct-input VGG11 SNN and its equivalent ANN under various white-box (WB) and black-box (BB) attacks. Both the evaluations are done on CIFAR-100. 2. Can computationally-efficient training algorithms improve the robustness of low-latency deep SNNs while retaining their high clean-image classification accu- racy? In view of the above concerns we make a two-fold contribution. We first em- pirically study and provide detailed observations on inherent robustness claims about deep SNN models when the SNN inputs are directly coded. Interestingly, we observe that despite significant reductions in the average spike count, deep direct-inputSNNshavelowerclassificationaccuracycomparedtotheirANNcoun- terparts on various white-box and black-box attack generated adversarial images, as exemplified in Fig. 7.1(c). Based on these observations, we present HIRE-SNN, a spike timing dependent backpropagation (STDB) based SNN training algorithm to better harness SNN’s inherent robustness. In particular, we optimize the model trainable parameters using images whose pixel values are perturbed using crafted noise across the time steps. More precisely, we partition the training time steps T intoN equal-length 128 Table 7.1: Comparison of model performances under various white-box and black-box attacksonbothCIFAR-10andCIFAR-100. Notethat italicized valuesaretakendirectly from the original paper. Accuracy (%) with Accuracy (%) with Accuracy (%) with Model- ANN high latency SNN-BP [201] low latency SNN-BP Attack category Clean FGSM PGD Clean FGSM PGD Clean FGSM PGD Dataset : CIFAR-10 VGG5-WB 90.2 13.3 2.0 89.3 15.0 3.8 87.9 35.5 5.3 VGG5-BB 90.2 24.0 6.4 89.3 21.5 16 87.9 38.3 6.7 ResNet12-WB 92.6 19.9 2.0 – – – 91.9 21.1 0.2 ResNet12-BB 92.6 28.6 4.3 – – – 91.9 24.7 0.6 Dataset : CIFAR-100 VGG11-WB 69.5 16.9 8.2 64.4 15.5 6.3 65.6 16.4 2.9 VGG11-BB 69.5 23.5 15.3 64.4 21.4 16.5 65.6 19.0 6.2 ResNet12-WB 61.5 13.5 2.8 – – – 61.9 10.5 0.6 ResNet12-BB 61.5 23.2 12.0 – – – 61.9 14.1 2.0 periodsoflength⌊T/N⌋andtraineachimage-batchovereachperiod,addinginput noiseaftereachperiod. Thekeyfeatureofourapproachisthat,insteadofshowing the same image repeatedly, we efficiently use the time steps of SNN training to input different noisy variants of the same image. This avoids extra training time and,becauseweupdatetheweightsaftereachperiod,requireslessmemoryforthe storage of intermediate gradients compared to traditional SNN training methods. To demonstrate the efficacy of our scheme we conduct extensive evaluations with both VGG [13] and ResNet [15] SNN model variants on both CIFAR-10 and CIFAR-100 [104] datasets. 7.2 Initial Study: SNN Robustness Analysis To motivate our novel training algorithm to harness robustness, this section de- scribes an empirical analysis into the robustness of traditionally-trained SNNs on gradient-based adversarial attacks. We performed traditional SNN training with 129 the initial weights and thresholds set to that of a trained equivalent ANN and generated through the conversion process, respectively 2 . 7.2.1 Performance Analysis We first performed SNN training with direct-coded inputs and evaluated the ro- bustness of the trained models under various white-box and black-box attacks. Interestingly, as shown in Table 7.1, the generated deep SNNs, i.e., VGG11 and ResNet12, consistently provide inferior performance against various black-box at- tacks compared to their ANN counterparts. For example, we observe that the VGG11 SNN provides only 6.2% accuracy on the PGD black-box attack, while its ANN equivalent provides an accuracy of 15.25%. These results imply that tradi- tional SNN training appears to be insufficient to harness the inherent robustness of low-latency direct-input deep SNNs. It is important to note that [201] observed that, for rate-coded SNNs, spike based sparse activation maps correlates with adversarial success. To extend this analysistodirect-inputSNNs,weexaminetwodistinctmetricsoftheSNN’sspiking activity, as defined below. Definition 1. Spiking activity (SA). We define a layer’s spiking activity as the ratio of number of spikes produced over all the neurons accumulated across all time units T of a layer to the total number of neurons present in that layer. We alsodefinealayer’sSAdividedby T asthe time averaged spiking activity (TASA). Observation 1. Compared to rate-coded SNNs, deep SNNs with direct-coded inputs and lower latency generally exhibit lower SA but higher TASA, especially in the initial convolution layers. Particularly, this distinction can be seen for VGG11 SNN in Figs. 7.2(b) and 7.1(b)andsuggeststhattheobservationthatsparseSAcorrelateswellwithsuccess 2 The description of our training hyperparameter settings is given in Section 7.4.1 for all the experiments in this section. 130 (a) (b) Figure 7.2: Per layer TASAs of VGG5 and VGG11 on CIFAR-10 and CIFAR-100, respectively. againstadversarialimages[201]canextendtolow-latencydirect-codedSNNsifthe spiking activity is quantified using TASA. VGG5 also shows lower sparsity level of spikes at the early layers (Fig. 7.2(a)) compared to its rate-coded counterpart. Observation 2. Direct-input coded SNNs yield lower clean accuracy and no significant improvement in adversarial image classification accuracy as latency T is reduced. Earlier research has shown that robustness to adversarial images of SNNs trained on rate-coded inputs improves with the reduction in training time steps [201]. Motivated by this we performed a similar analysis on VGG11 using direct input CIFAR-100. Interestingly, as shown in Fig. 7.3, as T reduces the classi- fication performance on both black box and white box attack generated images does not improve. Intuitively, these attacks are more effective on direct-coded inputs because of the lack of approximation at the inputs, unlike for Poisson gen- erated rate-coded inputs. However, SNNs with rate-coded inputs generally require larger training time and memory footprint [199] to reach competitive accuracy. In Fig. 7.3 we relate the reduction in network performance on clean images to the aggressive reduction in the number of training time steps. Definition 2. Perturbation distance (PD). We define perturbation distance as the L 2 -norm of the absolute difference of pixel values between a real image and 131 (a) (b) Figure 7.3: Classification performance of VGG11 on CIFAR-100, under (a) white-box and (b) black-box attacks as number of time steps T varies. (a) (b) Figure 7.4: (a) PD vs LIF leak parameter for a fixed threshold (0.8) and latency (T =10)averagedovertworandomlychoseninputimagesthatareperturbedwithPGD- 1. (b) Intermediate layer spike PD for VGG5 fed with a randomly-selected CIFAR-10 clean image and its perturbed variant. its adversarially-perturbed variant. Similarly, for an intermediate layer with spike based activation maps [82,83], we define spike PD as the L 2 -norm of the abso- lute difference of the normalized spike-based activation maps generated at a layer output when fed with an original and its perturbed variant, respectively. Observation 3. Leaky integrate and fire (LIF) non-linearity applying layers contribute to the inherent robustness of rate-coded input driven SNNs by diminish- ing the perturbation distance [201]. Unfortunately, this observation does not gen- erally hold for direct-coded SNNs in which the LIF layers may increase or degrade the perturbation distance, suggesting that the impact of the leak parameter must be considered jointly with other factors, including related weights and thresholds. The LIF operation in SNNs yields non-linear dynamics that can be contrasted to the 132 piecewise linear ReLU operation in traditional ANNs. To analyze their impact on image perturbation distance, we fed an LIF layer the clean images taken from a digit classification dataset [203] along with their perturbed variants, sweeping the leak parameter value and measuring the impact on the perturbation distance. As depictedinFig. 7.4(a), theleakfactorhelpsreducetheperturbationdistanceonly if its value falls in a certain range. To further study the impact of LIF layers, we analyzed the spike PD of the models. In particular, we fed two VGG5 SNN models trained with two different seeds (M1 and M2) with a randomly-sampled CIFAR-10 clean image x C and its black-boxattackgeneratedvariantx P andcomputedthecorrespondingintermedi- ate layer spike PDs. Both M1 and M2 classified x C correctly, however, M2 failed to correctly classifyx P . Interestingly, as shown in Fig. 7.4(b) despite the presence of the LIF layers, the spike PD values do not always reduce as we progress from layertolayerthroughthenetwork. Moreover,thisdegreeofunpredictabilityseems to be irrespective of whether the model classifies the image correctly. We conclude that despite LIF’s promise to reduce input perturbation, its impact is also a func- tionofotherparameters,includingthetrainableweights,leak,threshold,andtime steps. Based on these empirical observations, we assert that the majority of the rea- sons that make rate-coded SNN inherently robust are either absent or need careful tuning for direct-input SNN models, as presented in the next section. 7.3 HIRE-SNN Training This section presents our training algorithm for robust SNNs. As shown in Eq. 4.3 the LIF neuron functional output at each time step recursively depends on its stateinprevioustimesteps[39]. EachinputpixelintraditionalSNNtrainingusing direct-coded inputs, is fed into the network as a multi-bit value that is fixed over 133 Figure 7.5: Traditional and proposed training schemes, respectively. Here the green and orange blocks represent activation maps and the gradients that are generated after passingtheinputimage. Fortheproposedtrainingschemeweusetwocolorvariantsdeep and light, respectively, to highlight the sets of activation maps and gradients from an imageanditsnoisyvariantduringtwodifferentperiods. Theyellowblocksrepresentthe weight tensors that get updated from accumulated gradients. In proposed, we compute the input gradient with these updated weights to craft the noise. Here, we assumed T =4 andN =2. the T time steps and yield an order of magnitude reduction in latency compared to rate-coded alternatives. However, our approach is different than direct coding because we partition the training time steps T into N equal-length periods and feed in a different perturbed variant of the image during each period of ⌊T/N⌋ steps. To be more precise, consider an SNN model defined by the function g(x,y;T) implicitlyparameterizedbyθ . AssumeaninputbatchB ofsizeH i × W i × C i × n B , where H i ,W i , and C i represent spatial height, width, and channel length of an image, respectively, with n B as the number of images in the batch. In contrast to traditional approaches, where weight update happens only after T steps, we 134 allow different perturbed image variants generation and weight update to happen atsmallintervalof⌊T/N⌋stepswithinthewindowofT, foranimagebatch. This important modification allows us to train the model with different adversarial image variants without costing any additional training time. We compute the gradient of the loss with respect to each input pixel x to craft the perturbation for next period. Through an abuse of notation, we define ϵ s and ϵ t as the pixel noise step and bound, respectively, and generate perturbation scalar for each of the H i × W i × C i pixels of an image as κ =clip[κ +ϵ s × sign(∇ x L),− ϵ t ,+ϵ t ] (7.1) where κ represents the perturbation for an input pixel x of a batchB computed at the p th period. Note that for current batch, we initialize κ in the first period with the perturbation computed at the last period of the previous batch. In contrast, the computation of the perturbation of other periods is based on the computed perturbation from the corresponding previous period. It is noteworthy that ϵ s is not necessarily the same as ϵ of the FGSM or PGD attacks, and we generally choose ϵ s to be sufficiently small to not lose significant classification accuracy on clean images. We include weights θ , threshold v t and leak l k parameters in the trainable parameters Θ to retain clean image accuracy at low latencies [199]. Our detailed training algorithm called HIRE-SNN is presented in Algorithm 5. It is noteworthy that, apart from noise crafted inputs, our training framework can easily be extended to support various input encoding [201,204] as well as image augmentation techniques [205] that can improve classification performance. 135 Algorithm 5: HIRE-SNN Training Algorithm 1 Input: Training examples (X, Y), noise bound [-ϵ t , ϵ t ], noise step ϵ s , learning rate η , SNN training t-steps T, total training epochs N ep , iterationN. 2 // Initialize parameters 3 κ ← 0 4 for l← 1 to L do 5 θ l ← ANN trained θ l 6 v l t ← initThreshold(θ l ,X) 7 l l k ← 1.0 8 end 9 for n← 1 to N ep do 10 for each batch B⊂ (X,Y) do 11 for p← 1 to N do 12 // Compute gradients through STDB 13 δ θ ← E (x,y)∈B [∇ W L(g(x+κ,y ; T N ))] 14 δ vt ← E (x,y)∈B [∇ vt L(g(x+κ,y ; T N ))] 15 δ l k ← E (x,y)∈B [∇ l k L(g(x+κ,y ; T N ))] 16 // Compute perturbation 17 δ x ← [∇ x L(g(x+κ,y ; T N ))] 18 κ ← clip(κ +ϵ s ∗ sign(δ x ),− ϵ t ,ϵ t ) 19 // Update trainable parameters 20 θ ← θ − η ∗ δ θ 21 v t ← v t − η ∗ δ vt 22 l k ← l k − η ∗ δ l k 23 end 24 end 25 end 7.4 Experiments 7.4.1 Experimental Setup Dataset and ANN training. For our experiments we used two widely accepted image classification datasets, namely CIFAR-10 and CIFAR-100. For both ANN and direct-input SNN training, we use the standard data-augmented (horizontal flip and random crop with reflective padding) input. For rate-coded input based SNN training, we produce a spike train with rate proportional to the input pixel via a Poisson generator function [82]. We performed ANN training for 240 epochs with an initial learning rate (LR) of 0.01 that decayed by a factor of 0.1 after 150, 180, and 210 epochs. 136 ANN-SNN conversion and SNN training. We performed the ANN-SNN conversion as recommended in [199] to generate initial thresholds for the SNN training. We then train the converted SNN for only 30 epochs with batch-size of 32 starting with the trained ANN weights. We set starting LR to 10 − 4 and decay it by a factor of 5 after 60%, 80%, and 90% completion of the total training epochs. Unless stated otherwise, we used training time steps T of 6, 8, and 10 for VGG5, VGG11, and ResNet12, respectively. To avoid overfitting and perform regularization we used a dropout of 0.2 to train the models. The ϵ s is chosen to be 0.013 and 0.025 (apart from the ϵ s sweep test) to train with VGG5 and VGG11, respectively, with ϵ t equal to ϵ s . For ResNet12 we chose ϵ s to be 0.008 and 0.015 on CIFAR-10 and CIFAR100, respectively. Also, N is set to 2 unless otherwise mentioned. The basic motivation to pick hyperparametersN, ϵ s , and ϵ t is to ensure there is only an insignificant drop in the clean image accuracy while still improving the adversarial performance. We conducted all the experiments on a NVIDIA 2080 Ti GPU having 11 GB memory with the models implemented using PyTorch [103]. Further training and model details along with analysis on the hyperparameters are provided in the supplementary material. Adversarial test setup. For PGD, we set ϵ for the L ∞ neighborhood to 8/255, the attack step size α = 0.01, and the number of attack iterations K to 7, the same values as in [201]. For FGSM, we choose the same ϵ value as above. 7.4.2 Performance Against WB and BB Attacks To perform this evaluation, for each model variant we use three differently trained networks: ANN equivalent Φ ANN , hybrid traditionally trained SNN Φ T SNN , and SNNtrainedwithproposedtechniqueΦ P SNN , alltrainedtohavecomparableclean- image classification accuracy. We compute ∆ d as the difference in clean-image 137 Table7.2: PerformancecomparisonofSNNmodelsgeneratedusingtheproposedtrain- ing scheme on clean and adversarially-generated images under a white-box attack. Accuracy (%) with ∆ a over traditional ∆ a over ANN Model proposed SNN training SNN training equivalent Clean(∆ d ) FGSM PGD FGSM PGD FGSM PGD Dataset : CIFAR-10 VGG5 87.5 (-0.4) 38.0 9.1 +2.5 +3.8 +25 +7.1 ResNet12 90.3 (-1.6) 33.3 3.8 +12.2 +3.5 +13.4 +1.8 Dataset : CIFAR-100 VGG11 65.1 (-0.4) 22.0 7.5 +5.7 +4.6 +5.1 -0.7 ResNet12 58.9 (-3.0) 19.3 5.3 +8.8 +4.7 +5.8 +2.5 classification performance between Φ P SNN and Φ T SNN . We define ∆ a 3 as the ac- curacy difference between Φ P SNN and either of Φ T SNN or Φ ANN while classifying on perturbed image. Note, both Φ P SNN and Φ T SNN are trained with direct inputs. Table 7.2 shows the absolute and relative performances of the models generated through our training framework on white-box attack generated images using both FGSM and PGD attack techniques. In particular, we observe that with negligible performance compromise on clean images, Φ P SNN consistently outperforms Φ T SNN forallthemodelsonbothdatasets. Specifically,weobservethattheperturbedim- age classification can have an improved performance of up to 12 .2% and 8.8%, on CIFAR-10 and CIFAR-100 respectively. Compared to Φ ANN we observe improved performance of up to 25% on WB attacks. Table 7.3 shows the model performances and comparisons on black-box attack generated images using both FGSM and PGD. For this evaluation, for each model variant we used the same model trained with a different seed to generate the perturbedimages. ForallthemodelsonboththedatasetsweobserveΦ P SNN yields higher accuracy on the perturbed images generated through BB attack compared to those generated through WB attack, primarily because of BB attacks yield weaker perturbations [9,206]. Importantly, we observe superior performance of Φ P SNN over both Φ T SNN and Φ T SNN under this weaker form of attack. In particular, 3 ∆ a between model M1 and M2 is Acc M1 %− Acc M2 %. 138 Table7.3: PerformancecomparisonofSNNmodelsgeneratedusingtheproposedtrain- ing scheme on clean and adversarially-generated images under a black-box attack. Accuracy (%) with ∆ a over traditional ∆ a over ANN Model proposed SNN training SNN training equivalent Clean FGSM PGD FGSM PGD FGSM PGD Dataset : CIFAR-10 VGG5 87.5 42.1 14.9 +3.9 +8.3 +18.1 +8.5 ResNet12 90.3 38.4 7.8 +13.7 +7.2 +9.7 +3.5 Dataset : CIFAR-100 VGG11 65.1 29.1 16.1 +10.0 +9.9 +5.6 +0.9 ResNet12 58.9 24.5 12.1 +10.4 +10.1 +1.3 ∼ 0 (a) (b) Figure 7.6: (a) Normalized GPU memory usage and (b) average training time for a batchof200imagesforVGG5, VGG11, andResNet12whentrainedwiththetraditional and proposed approaches. Φ P SNN provides an improvement ∆ a of up to 13.7% and 10.4% on CIFAR-10 and CIFAR-100, compared to Φ T SNN . Fig. 7.6 shows the normalized random access memory (RAM) memory and average training time for 200 batches for both the traditional and presented SNN training. Interestingly, due to the shorter update interval the proposed approach require less memory by up to∼ 25% while incurring no extra GPU training time. 7.4.3 Discussion Here, we evaluate the potential presence of obfuscated gradients through ex- periments with the HIRE-SNN trained models under different attack strengths. We then study the efficacy of noise crafting and performance under no trainable 139 threshold-leak condition. Finally, we evaluate the impact of the new knob ϵ s in trading off clean and perturbed image accuracy. Gradientobfuscationanalysis. Weconductedseveralexperimentstoverify whether the inherent robustness of the presented HIRE-SNNs come from an incor- rect approximation of the true gradient based on a single sample. In particular, theperformanceofgeneratedmodelswascheckedagainstthefivetests(Table7.4) proposed in [9] that can identify potential gradient obfuscation. AsshowninTable7.2and7.3,forallthemodelsonbothdatasetsthesingle-step FGSM performs poorly compared to its iterative counterpart PGD. This certifies the success of Test (i), as listed in Table 7.4. Test (ii) passes because our black- box generated perturbations in Table 7.3 yield weaker attacks 4 than their white- box counterparts shown in Table 7.2. To verify Tests (iii) and (iv) we analyzed VGG5 on CIFAR-10 with increasing attack bound ϵ . As shown in Fig. 7.7(a), the classification accuracy decreases as we increase ϵ and finally reaches an accuracy of∼ 0%. Test (v) can fail only if gradient based attacks cannot provide adversarial examples for the model to misclassify. It is clear from our experiments, however, that FGSM and PGD, both variants of gradient based attacks, can sometimes fool the network despite our training. We also evaluated the VGG5 performance with increased attack strength by increasing the number of iterations K of PGD and found that the model’s robust- ness decreases with increasing K. However, as Fig. 7.7(b) shows, after K = 40, the robustness of the model nears an asymptote. In contrast, if the success of the HIRE-SNNs arose from the incorrect gradient of a single sample, increasing the attack iterations would have broken the defense completely [2]. 4 Note that here we say an attack is weaker than other when the classification accuracy on that attack-generated images is higher compared to the images generated through the other. 140 (a) (b) Figure 7.7: White-box PGD attack performance as a function of (a) bound ϵ and (b) attack iterations K with VGG5 on CIFAR-10. Thus, based on these evaluations we conclude that even if the models include obfuscatedgradients,theyarenotsignificantsourceoftherobustnessfortheHIRE- SNNs. Importance of careful noise crafting. To evaluate the merits of the pre- sented noise crafting technique, we also trained VGG11 with a version of our training algorithm with the perturbation introduced via Gaussian noise. In par- ticular, we pertubed the image pixels using Gaussian noise with zero mean and standard deviation equal to ϵ s . It is clear from Fig. 7.8 that compared to the traditional training, the proposed training with perturbation generated through Gaussiannoise(GN)failstoprovideanynoticeableimprovementontheadversary- generated images both under white-box and black-box attacks. In contrast, train- ing with carefully crafted noise significantly improves the performance over that withGNagainstadversarybyupto6.5%and9.7%,onWBandBBattack-created images, respectively. Table 7.4: Checklist set of tests for characteristic behaviors caused by obfuscated and masked gradients [9]. Checks to identify gradient obfuscation Fail Pass i) Single-step attack performs better compared to iterative attacks ✓ ii) Black-box attacks performs better compared to white-box attacks ✓ iii) Increasing perturbation bound can’t increase attack strength ✓ iv) Unbounded attacks can’t reach∼ 100% success ✓ v) Adversarial example can be found through random sampling ✓ 141 Figure 7.8: Comparison of traditional SNN vs. proposed training with both GN and crafted input noise. Training were performed with direct-input VGG11 on CIFAR-100. Efficacy of proposed training when threshold and leak parameters are not trainable. To further evaluate the efficacy of proposed training scheme, we trained VGG5 on CIFAR-10 using our technique but with threshold and leak parameters fixed to their initialized values. As shown in Table 7.5, our gener- ated models still consistently outperform traditionally trained models under both white-box and black-box attacks with negligible drop in clean image accuracy. In- terestingly, fixing the threshold and leak parameters yields higher robustness at the cost of lower clean-image accuracy. This may be attributed to the difference in adversarial strength of the perturbed images and is a useful topic of future research. 142 Figure 7.9: (a) Inference T steps for rate-coded vs direct input trained SNNs, (b-e) Accuracy vs. ϵ s plot for both clean and adversarially generated images (both with WB and BB attack settings) with VGG5 (b, c) and VGG11 (d, e) on CIFAR-10 and CIFAR-100, respectively. 143 Impact of the noise-step knob ϵ s . To analyze the impact of the introduced hyperparameter ϵ s , we performed experiments with VGG5 and VGG11, training the models with various ϵ s ∈[0.01,0.03]. As depicted in Fig. 7.9 with increased ϵ s themodelsshowaconsistentimprovementonbothwhite-boxandblack-boxattack generated perturbed images with only a small drop in clean image performance of up to∼ 2%. Note, here ϵ s = 0 corresponds to traditional SNN training. With the optimal choice of ϵ s our models outperform the state-of-the-art inherently robust SNNs trained on rate-coded inputs [201] maintaining similar clean image accuracy with an improved inference latency of up to 25× as shown in Fig. 7.9. Table 7.5: Performance comparison of proposed with traditional SNN training when threshold-leak parameters are frozen to their initialized values. Model Dataset Training Clean Acc. % on WB Acc. % on BB Method Acc. (%) FGSM PGD FGSM PGD VGG5 CIFAR-10 Traditional 87.2 33.0 4.5 40.4 8.8 Proposed 86.8 40.5 13.6 46.2 21.9 Table 7.6: Estimated energy costs for various operations in a 45 nm CMOS process at 0.9 V [8]. Serial Operation Energy (pJ) No. 32-b INT 32-b FP 1. 32-bit multiplication 3.1 3.7 2. 32-bit addition 0.1 0.9 3. 32-bit MAC (#1 + #2) 3.2 4.6 4. 32-bit AC (#2) 0.1 0.9 144 (a) (b) Figure 7.10: Comparison of normalized compute energy computed assuming (a) 32-bit FP and (b) 32-bit INT implementations. 7.4.4 Computation Energy For an L-layer SNN with rate-coded and direct inputs, the inference computation energy is, E rate SNN =( L X l=1 FL l SNN )· E AC (7.2) E direct SNN =FL 1 SNN · E MAC +( L X l=2 FL l SNN )· E AC (7.3) where E AC and E MAC represent the energy cost of AC and MAC operation, re- spectively. For our evaluation we use their values as shown in Table 7.6. In partic- ular, as exemplified in Fig. 7.10(a), the computation energy benefit of HIRE-SNN VGG11overitsinherentlyrobustrate-codedSNNandANNcounterpartisashigh as4.6× and10× , respectively, considering32-bfloatingpoint(FP)representation. For a 32-b integer (INT) implementation, this advantage is as much as 3.9× and 53× , respectively (Fig. 7.10(b)). 7.5 Conclusions In this chapter we first analyzed the inherent robustness of low-latency SNNs trained with direct inputs to provide insightful observations. Motivated by these 145 observations we then present a training algorithm that harnesses the inherent ro- bustness of low-latency SNNs without incurring any additional training time cost. Weconductedextensiveexperimentalanalysistoevaluatetheefficacyofourtrain- ing along with experiments to understand the contribution of the carefully crafted noise. Particularly, compared to traditionally trained direct input SNNs, the gen- erated SNNs can yield accuracy improvement of up to 13.7% on black-box FGSM attack generated images. Compared to the SOTA inherently robust VGG11 SNN trained on rate-coded inputs (CIFAR-100) our models perform similarly or better on clean and perturbed image classification performance while providing an im- proved performance of up to 25× and ∼ 4.6× , in terms of inference latency and computation energy, respectively. We believe that this study is a step in making deep SNNs a practical energy-efficient solution for safety-critical inference appli- cations where robustness is a need. 146 Part III: Vulnerability and Opportunities in Private Inference 147 Chapter 8 Reality Check of Model Privacy under Compression through Distillation This chapter first provides the introduction and motivation behind the potential negativeusageofknowledgedistillation(KD)asatoolforstealingmodelIPSection 8.1. Preliminaries of KD and related work on model IP protection is presented in Section 8.2. Section 8.3 provides motivational case studies to understand the limitations of the existing model IP protection methods in protecting against KD- based stealing. Leveraging the observations from these studies, Section 8.4 then presents the idea of skeptical students and details experimental results in Section 8.5 to show its efficacy in successfully stealing model IP both under data-available and data-free scenarios. Finally the chapter concludes in Section 8.6. 8.1 Introduction and Motivation KD [76] aims to transfer the useful knowledge of a trained model (the teacher) to another model (the student). KD has found success in various applications [77,182,207,208]andisparticularlyusefulforresource-constrainedIoTapplications where the compute budget is limited and compute-efficient models are required. Generally, KD requires the student model to be trained over the same data-set 148 that is used to train the teacher. However, recent research [10,209,210] has shown the efficacy of KD even under the “data-free” scenario where the training data may not be available for the student to get trained. Overthepastfewyearsvariousformsofdistillationhavebeenproposed,includ- ing distillation from the student itself [211] and via an ensemble of students [212]. However, recently, reference [89] has highlighted the fact that many forms of dis- tillation may unintentionally leak the IP of a teacher model. In particular, some teacher models contain significant IP associated with the arduous effort of both data collection and model training, which motivates their release as “black-box” executable pieces of software. Moreover, these trained models may even enable safeguarding the training data (for example, sensitive medical images [213] and company proprietary information [214,215]) as well as their performance on the secure data. Use of KD under these circumstances, sometimes under the data-free condition, may enable training of an unauthorized student to yield comparable performance as the teacher. To mitigate this issue, reference [89] has proposed the idea of a nasty teacher that prevents knowledge leaking to a student and thereby reduces the student’s classification performance. Fig. 8.1(a) depicts the thesuccessofanastyResNet50indegradingvariousstudentmodels’classification performances compared to their respective baseline performances 1 . Earlier evalu- ations [89] have shown that nasty teachers can retain their efficacy under various settings of two key hyperparameters, namely the weight of the distillation loss (a) andthesoftmaxtemperature(τ )ofthedistillationloss. However,acomprehensive evaluation of the efficacy of undistillable nasty teachers has yet to be completed. Towards this goal, we investigate the performance of KD on distillation at differ- ent depths of the student model, and in particular, how the teacher’s influence changes when transferring knowledge to an intermediate shallow section of the student through an auxiliary classifier (AC). We find that the impact of the nasty 1 The baseline models are trained with only cross-entropy (CE) loss. 149 (a) (b) (c) Figure 8.1: Distillation from a nasty ResNet50 to (a) normal students, (b) proposed skeptical students, on CIFAR-100. In particular, for MobileNetV2 (MbV2) which is a reduced parameter model, the proposed distillation method can improve the accuracy by 59.49%. (c) Impact of transferring knowledge at various depth of a ResNet18 from a nasty teacher. BB represents a basic-block layer. teacher drastically reduces as we transfer knowledge to a shallow subsection of the student (Fig. 8.1(c)). Basedonthesefindings,wepresenta skepticalstudent thatusesanintermediate shallow auxiliary classifier to transfer the information derived through the soft probabilities of the teacher’s output classes. We further propose a novel hybrid distillation scheme to improve learnability of the student by distilling both from a teacher and the student itself. Our approach has some similarity with self- distillation (SD) [211] because both approaches use an auxiliary classifier for the knowledge transfer. However, the goal of SD is to show the efficacy of a model distilling from itself, contrasting our goal of analyzing the possible presence of a potential model stealer who can extract knowledge even from an undistillable nasty teacher. More importantly, the proposed hybrid distillation is effective in stealing a model’s IP even under a “data-free” scenario, contrasting SD which is only applicable for students who have access to training data. We conduct extensive experiments using both standard KD with available training data on CIFAR-10, CIFAR-100, and Tiny-ImageNet, and data-free KD on CIFAR-10 testset. Experimental results show that compared to normal ones, skeptical students exhibit improved performance of up to ∼ 59.5% and∼ 5.8% for 150 data-available and data-free KD, respectively, when distilled from nasty teachers. Thisexposesasignificantlimitationofnastyteachersattemptingtoprotectmodel IP. Moreover, our proposed students perform similar to normal student models while distilled from normal teachers, demonstrating their efficacy irrespective of the teacher being nasty or not. 8.2 Preliminaries and Related Work 8.2.1 Knowledge Distillation Knowledgedistillation,similartothegoalofvariousmodelcompressiontechniques, embedstherichinformationofacompute-heavymodelintoamodelthatgenerally requiresfewercomputations. ThetraditionalKD[76]reliesoninformationtransfer through a Kullback–Leibler (KL) divergence measure between the soft logits at the output classifier layers of the teacher and the student. Apart from this, over the past few years, various other efficient distillation methods have been proposed, includingdistillingfromhintsprovidedbytheteacher[216],distillingviaattention transfer [135], and other approaches [217,218]. To reduce the distillation-based training time and avoid the requirement of a separate teacher, reference [211] has proposed self-distillation. The authors partition the student model into several shallowsections,eachhavingitsownauxiliaryclassifier,towhichthefinalclassifier transfers the soft-logits to enhance the model’s classification performance. Several recent studies [219,220] have also shown the efficacy of KD to a student from its own pre-trained variant as its teacher. Similar to [89], in this work, we assume to have no access to the teacher model’s intermediate features. Rather, we focus on the information leaking to an unauthorized outsider and thus rely on standard KD. 151 8.2.2 Model IP protection Protection of model IP has drawn significant interest primarily due to the mas- sive human resource and financial costs required for large model training. Ear- lier works explored various defense strategies, including adaptive misinformation against model performance stealing [221] and passport-based defense [222]. In most of these cases, the stealer is assumed to have access to only synthetically generated data. However, also of interest is when a portion of training data is unintentionally leaked. 8.2.3 Poisoning of Neural Network Models An attacker can degrade the performance of a neural network by simply inject- ing poisoned data [191] into the training set. Adversarial-attack generated im- ages[149],[1],[87]haveproventobeeffectiveindegradingamodel’sperformance. Backdoor attacks [223] insert crafted malicious data into the training set that ap- parently trains the model to perform well until such time that the attacker sets up a signal that degrades the model performance drastically. The Bit flip attack [125] corrupts selective bits of the trained DNN weights to lower its performance. As mentioned earlier, reference [89] has recently proposed to poison a neural network model through training, such that the model retains its classification performance, but loses its ability to be used as an efficient teacher, thus referred to as a nasty teacher. These nasty teachers are believed to protect the model IP 2 . This work analyzes the degree of IP protection that such teachers provide and, in particular, presents a skeptical student-based distillation technique that diminishes the effect of their nastiness, as detailed in the next Section. 2 In so far as a teacher is trying to protect its private IP from inquiring or even intrusive students, we believe a better phrase to characterize him/her is as a “defensive teacher” or in the worst case as a “secretive teacher”. In our view, there is nothing “nasty” about what the teacherisattemptingtodo. However,becausethepriorarthasconsistentlyusedtheterm“nasty teacher”, we will also use that phrase in this chapter. 152 Table8.1: Performanceofstudent(ResNet18)undertransferabilitytestonCIFAR-100. Teacher Teacher type Teacher Acc % Student Acc % ∆ base ResNet50 Nasty 76.57 72.47 -5.08 ResNet18 Distilled 72.47 70.99 -6.56 ResNet50 Normal 78.04 79.39 +1.84 ResNet18 Distilled 79.39 79.47 +1.92 8.3 Motivational Case Study Tomotivateourskepticalstudents, thissectionpresentsanempiricalanalysisthat explore the efficacy of nasty teacher models under two distinct KD scenarios 3 . 8.3.1 Transferability of the Impact of Nasty Teachers Definition 1. Secondary student: We define a secondary student as a model that is trained via KD from a trained model which was earlier trained via dis- tillation from a teacher model. In this context, we refer to the student that is distilled from the original teacher model, as the primary student. We measure the transferability impact (TI) of a teacher as the performance improvement (or degradation) of a secondary student with respect to its baseline (∆ base ). A nega- tive ∆ base signifies the success of a teacher’s privacy or confidentiality preserving effort [89]. WefirsttrainedbothnormalaswellasnastyvariantsofaResNet50onCIFAR- 100. We then used two ResNet18 students to distill knowledge from these two teachers. Finally, to test TI we used the distilled ResNet18 models as teachers to secondary students with identical model architectures. Observation 1. The TI of a nasty teacher is negative, meaning the nastiness of a teacher transfers to its student. 3 The description of training hyperparameters is given in Section 6.4.1 for all the experiments in this section. 153 Figure 8.2: A ResNet18 student’s performance on CIFAR-100 dataset. Table 8.1 shows the classification performance of a secondary student is 6 .56% lower than the baseline while using a distilled ResNet18 as teacher. Interestingly, this implies that the false sense of generalization that the primary student inherits makes it become a nasty teacher. Thus simple redistillation from a primary to a secondary student will not evade the nastiness of a teacher 4 . 8.3.2 Transferring Knowledge to a Shallow Subsection of the Student We computed the weighted KL-divergence based distillation loss at an auxiliary classifier (AC) of the student, and used its final classifier (C) to compute the weighted CE-loss only. We evaluated the impact of the auxiliary classifier placed at different depths of the student model. In particular, for the ResNet18 student we performed four experiments by placing the AC after the n th BB, n ranging from 1 to 4, where the model’s final classifier is located at the end of 4 th block. We follow similar procedure as [211] to design the AC branches. Observation2. The influence of a teacher (both nasty and normal) on a student’s performance reduces as we distill knowledge to a shallow subsection of the student model. 4 It may also be interesting to note that, as shown in Table 8.1, the TI of a normal teacher follows a similar trend and remains positive. 154 Figure 8.3: Skeptical student distillation framework. Note the arrow of the distilla- tion loss components are directed from teacher to student for the corresponding KL- divergence computation. Fig. 8.2 shows the accuracy of the student model approaches the baseline accuracy as we distill from the teacher at shallower depths (n = 1). Interestingly, this trend can be observed for both nasty and normal teachers. In particular, from a nasty ResNet50, the student ResNet18 has a test accuracy of up to 77.19% while distilled at AC 1, in contrast the accuracy is 72.47% when distilled at the final classifier. On the other hand, ResNet18 can have a classification accuracy reduction of as much as 1.56% when distilled at shallow depths. We leverage these two observations in our hybrid distillation on the skeptical studentsanddiminishthetransferringofteachers’nastiness, presentedinthenext section. 8.4 Skeptical Students Let us consider a student model Φ S , which distills knowledge from a pre-trained teacher Φ T where g Φ S (.) and g Φ T (.) are the functions describing the student and 155 teacher models, respectively. Let (x, y) be the vectorized pairs of inputs and cor- responding output labels used to train these models. For teacher-based traditional KD [76], the training loss for the student may be written as L KD =(1− a)∗L CE σ (g Φ S (x,y)) +a∗ τ 2 ∗L KL σ (g Φ S (x,y),τ ),σ (g Φ T (x,y),τ ) (8.1) whereL CE represents the student’s cross-entropy (CE) loss andL KL represents the KL divergence loss aimed at transferring knowledge. Here, σ (.) is the softmax function and τ is the softmax temperature, both of which are used to compute soft probabilities. τ is set to 1 for the CE loss. The hyperparameter a acts as a balancing factor between the two loss terms. Based on the observations in Section 8.3, we propose a skeptical student that can largely diminish its nastiness at the very first round of distillation. In particular, skeptical students are those models that always transfer the teacher’s KL-div driven knowledge at a shallow section (Φ ′ S ), as depicted in Fig. 8.3. Thus the teacher driven loss,L T can be formulated same as theL KD with Φ ′ S replacing Φ S in both the CE and KD loss components. To train the complete student model Φ S we rely on the CE-loss applied at the final classifier. However, due to reduced influence of the teacher, such students can hardly get any benefit of KD from a normal teacher. Hence, to improve these models’ performance, motivated by the idea of self-distillation (SD), we introduce a third loss term that allows distillation to shallow ACs from the student’s final classifier, L SD = X j∈J (1− β )∗L CE σ (g Φ j S (x,y)) +β ∗L KL σ (g Φ j S (x,y),τ ),σ (g Φ S (x,y),τ ) (8.2) 156 Figure 8.4: Data-free distillation from a teacher to a skeptical student. Note that,J ∈ AC i where, N > i > i Φ ′ S 5 . Here N and i Φ ′ S represent the total number of sub blocks in the model and sub-block index at which the AC of Φ ′ S is placed, respectively. Finally, our hybrid distillation loss is given by, L S =γ 1 L T +γ 2 L SD +γ 3 L CE σ (g Φ S (x,y)) (8.3) where the last term corresponds to the CE loss of the complete student Φ S . Here, γ 1 ,γ 2 ,andγ 3 arehyperparameterstobalancetheKDlossoftheauxiliaryclassifier, self-distillation loss, and the CE loss of the student. It is noteworthy that we use the auxiliary classifier sections during training and that these auxiliary sections may be removed during inference, nullifying any extra inference parameter cost. Skeptical students for data-free KD. As described earlier, the skeptical stu- dentsprimarilyuseanintermediateauxiliaryclassifiertodistillknowledge. There- fore,teachingtheremainingpartofthenetwork(Φ S -Φ ′ S )is,inparticular,difficult for data-free KD because there is no CE-loss to train the whole network Φ S . To mitigatethisissueweproposean auxiliary self distillation losstodistillknowledge to the final classifier from the intermediate auxiliary classifier. Note that, as de- picted in Fig. 8.4, the same auxiliary classifier works as a student to learn from a teacher under the data-free scenario. To evaluate this, we use recently proposed 5 To minimize the auxiliary layer computation overhead during training, in this chapter we use 1 AC to transfer the teacher knowledge and 1 AC for SD. 157 zero-shot knowledge transfer [10] with the skeptical students using a loss function enhanced by the auxiliary self KD, L S DF =L KL σ (g Φ ′ S (x,y),τ ),σ (g Φ T (x,y),τ ) +L KL σ (g Φ S (x,y),τ ),σ (g Φ ′ S (x,y),τ ) +γ at L AT (8.4) The first term takes care of knowledge transfer from the teacher, while the second term helps train the final classifier. Similar to the original paper [10], we also use an attention-transfer lossL AT . Nastyteachertraining. Asproposedin[89],weuseself-underminingknowledge distillationtodesignthenastyteacherΦ T . Specifically,wetrainΦ T viadistillation from a pre-trained model Φ A with the same network architecture, minimizing the following loss L N =L CE σ (g Φ T (x,y)) − α N ∗ τ 2 N ∗L KL σ (g Φ T (x,y),τ N ),σ (g Φ A (x,y),τ N ) (8.5) where the CE loss terms helps retain Φ T ’s classification performance. The second termmaximizestheKLdivergencebetweenΦ A andΦ T allowingΦ T tolearnafalse form of generalization that plays a key role in its undistillability [89]. Here, τ N is the softmax temperature, similar to traditional KD, and α N controls the severity of the self-undermining distillation of the nasty teacher. 158 8.5 Experimental Results 8.5.1 Experimental Setup Models and Datasets. To evaluate the efficacy of our hybrid distillation ap- proach, we performed detailed experiments on three popular datasets, CIFAR-10, CIFAR-100[104],andTiny-ImageNet[105]withResNet18,ResNet50[15]andMo- bileNetV2 [57] models. We used PyTorch API to define and train our models on an Nvidia RTX 2080 Ti GPU. Training hyperparameters. We used standard data augmentation techniques (horizontal flip and random crop with reflective padding) and the SGD optimizer for all training. To create a nasty teacher, we first trained a network Φ A for 160 epochs on CIFAR-10 and 200 epochs for CIFAR-100 and Tiny-ImageNet with an initial learning rate (LR) of 0.1 for all. For CIFAR-10, we reduced the LR by a factor of 0.1 after 80 and 120 epochs. For CIFAR-100 and Tiny-ImageNet the LR decayed at 60, 120, and 160 epochs by a factor of 0.2. We then trained the nastyΦ T ofsamearchitecturewiththesameepochsandLRhyperparameters. We choseα N as0.04,0.005and0.005,forCIFAR-10,CIFAR-100,andTiny-ImageNet, respectively [89]. Similar to [89], we chose τ N to be 4, 20, and 20 for the three datasets. ForthedistillationtrainingtoΦ S (bothnormalandskeptical),wetrained for 180 epochs with a starting LR of 0.05 that decays by a factor of 0.1 after 120, 150, and 170 epochs. Unless stated otherwise, we kept τ the same as τ N and chose α and β to be 0.9 and 0.7, respectively. We placed the skeptical students’ auxiliary classifiers after the 2 nd (Φ ′ S for KD from the teacher) and 3 rd (for SD) BB of a total of 4 ResNet blocks. To give equal weight to the loss components of Eq. 8.3, we chose γ 1 = γ 2 = γ 3 =1.0, for all the experiments. We performed all the experiments with two different seeds and report the average accuracy with std deviation (in bracket) in the tables. 159 8.5.2 Data-available Distillation To evaluate model performance, we conducted two types of distillation: dis- till to self (DtoS) [219], where both teacher and student architectures are the same, and KD from a compute heavy teacher to a reduced-parameter student (for example,Φ T : ResNet50, Φ S : ResNet18, MobileNetV2). For DtoS, we also per- formed distillation with both the assumption of the model being heavy (Φ T/S : ResNet50) and lite (Φ T/S : ResNet18). Table 8.2 shows the corresponding perfor- mance when distilled from a nasty teacher. Skeptical students always outper- form their normal counterparts providing better accuracy with improvements of up to ∼ 59.49%. These results clearly show the efficacy of skeptical students in mitigating the undistillability of a nasty teacher. We also measure the classifica- tion performance by ensembling the ACs and final classifier outputs and denote that as ‘Skeptical-E’. However, the ensemble performance is always inferior to the final classifier, which is primarily due to inferior performance of the AC that dis- tills knowledge from the nasty teacher. Table 8.3 shows the performance of both skeptical and normal students when distilled from a normal teacher. As we can see, the ensemble output of skeptical students perform better than their normal counterparts. These results motivate the use of skeptical students for distillation irrespective of whether the teacher is nasty or not. In both the tables ∆ acc is the accuracy difference between a skeptical and corresponding normal student when both are trained via distillation from a teacher, i.e. ∆ acc = {max(acc s ,acc se ) - acc n }. 160 Table 8.2: Performance of normal vs. skeptical student when distilled from a nasty teacher. Dataset Φ T Φ T Φ S Φ S Base- Student Acc. (%) ∆ acc Acc. (%) line Acc. (%) Normal (accn) Skeptical (accs) Skeptical-E (accse ) ResNet18 94.67 ResNet18 95.15 94.13(± 0.18) 95.09(± 0.15) 94.77(± 0.05) +0.96 MobileNetV2 90.12 88.13(± 0.13) 90.37(± 0.25) 90.21(± 0.18) +2.24 CIFAR 94.28 ResNet18 95.15 94.38(± 0.18) 95.16(± 0.01) 95.02(± 0.01) +0.78 -10 ResNet50 ResNet50 94.9 94.21(± 0.04) 95.48(± 0.14) 95.48(± 0.14) +1.27 MobileNetV2 90.12 88.76(± 0.14) 91.02(± 0.09) 90.88(± 0.23) +2.26 ResNet18 77.55 ResNet18 77.55 75.00(± 0.14) 77.33(± 0.21) 76.38(± 0.1) +2.33 MobileNetV2 69.24 7.13(± 0.71) 66.62(± 0.30) 64.26(± 0.64) +59.49 CIFAR 76.57 ResNet18 77.55 72.28(± 0.27) 77.25(± 0.25) 75.48(± 0.54) +4.97 -100 ResNet50 ResNet50 78.04 74.14(± 0.85) 78.65(± 0.29) 77.61(± 0.1) +4.52 MobileNetV2 69.24 7.72(± 1.57) 66.38(± 0.50) 62.93(± 0.75) +58.66 Tiny- ResNet18 62.08 ResNet18 63.07 53.60(± 0.04) 65.76(± 0.83) 60.63(± 0.07) +12.16 ImageNet MobileNetV2 57.01 4.81(± 0.19) 54.74(± 0.84) 54.27(± 2.94) +49.93 161 Table 8.3: Performance of normal vs. skeptical student when distilled from a normal teacher. Dataset Φ T Φ T Φ S Φ S Base- Student Acc. (%) ∆ acc Acc. (%) line Acc. (%) Normal (accn) Skeptical (accs) Skeptical-E (accse ) ResNet18 95.15 ResNet18 95.15 95.38 (± 0.10) 95.45(± 0.10) 95.42(± 0.09) +0.07 MobileNetV2 90.12 91.36(± 0.17) 91.81(± 0.15) 92.00(± 0.28) +0.64 CIFAR ResNet18 95.15 95.43(± 0.11) 95.31(± 0.01) 95.27(± 0.04) -0.12 -10 ResNet50 94.9 ResNet50 94.9 95.15(± 0.13) 95.85(± 0.05) 96.09(± 0.01) +0.94 MobileNetV2 90.12 91.71(± 0.06) 91.71(± 0.18) 91.95(± 0.16) +0.24 ResNet18 77.55 ResNet18 77.55 78.96(± 0.12) 78.79(± 0.42) 79.68(± 0.52) +0.72 MobileNetV2 69.24 75.12(± 0.08) 71.63(± 0.19) 75.45(± 0.06) +0.33 CIFAR 78.04 ResNet18 77.55 79.21(± 0.24) 78.51(± 0.44) 79.86(± 0.01) +0.65 -100 ResNet50 ResNet50 78.04 79.56(± 0.13) 80.66(± 0.52) 81.96(± 0.52) +2.4 MobileNetV2 69.24 75.28(± 0.04) 71.76(± 0.16) 76.32(± 0.34) +1.04 Tiny- ResNet18 63.07 ResNet18 63.07 67.35(± 0.18) 66.49(± 0.30) 67.43(± 0.47) +0.08 ImageNet MobileNetV2 57.01 64.99(± 0.51) 59.37(± 0.01) 65.38(± 0.01) +0.39 162 Figure 8.5: Logit response visualization after the softmax layer. Each row contains an example image from CIFAR-10 dataset and corresponding response for normal teacher, nasty teacher, normal student and skeptical student. We used ResNet50 and ResNet18 as teacher and student model, respectively. 163 8.5.3 Qualitative Analysis We now present qualitative behavioral analysis of both normal and skeptical stu- dents upon distillation from both normal and nasty teachers. Fig. 8.5 shows that the nasty teacher has multiple non-negligible peaks at its final softmax logit re- sponse, in contrast to the normal teacher having mainly one high valued peak. As mentioned in [89], this can create a false sense of generalization to a normal student causing the student to misclassify, as shown by its logit response. Our skeptical students, on the other hand, not only classify correctly, but also largely mitigate the issue of multi peak logit response of a normal student. We present visualizations of the t-distributed stochastic neighbor embedding (t-SNE) for output logits in Fig. 8.6. It shows that the inter class cluster distance is shifted and reduced for certain classes of the nasty teachers. A similar shift of class clusters is also observed for the normal students and even in the AC of the skeptical student where teacher knowledge is transferred. However, the final classifier of the skeptical students has a similar class clustering distribution as a normal teacher. This demonstrates that the remaining sections of the student model (Φ S - Φ ′ S ) indeed remain free from the impact of the nasty teacher. 164 Figure8.6: VisualizationoftSNEfornormalandskepticalstudents(ResNet18)upondistillationfrombothnormalandevasive teacher (ResNet50) on CIFAR-10. For the skeptical students we plot visualization both at the final classifier (C) and auxiliary classifier (AC). 165 (a) (b) (c) (d) Figure 8.7: Ablation study with a and τ for normal and skeptical students (ResNet18) upon distillation from both normal and nasty teacher (ResNet50) on CIFAR-100. 8.5.4 Ablation Studies Ablation study with the Temperature τ . To further evaluate the influence of the hyperparameter τ on the student distillation, we performed ablation with τ ∈ [2,5,10,15,20]. As depicted in Fig. 8.7(a), when distilling from a nasty teacher,theskepticalstudentsmaintaintheirsuperioritycomparedtotheirnormal counterparts at all different values of τ . While distilling from a normal teacher, even at reduced τ , both the normal and skeptical student variants retain higher classification accuracy compared to their baseline (Fig. 8.7(c)). Ablationstudywiththebalancingterm a. Todeterminetheinfluenceofthe undistillable teacher on the performance of the presented models, we conducted distillationwitha∈[0.2,0.4,0.6,0.8,0.9]. Asareduces,theinfluenceoftheteacher 166 Table 8.4: Performance of normal vs. skeptical student on data-free distillation [10] from a teacher. Dataset Φ T Φ T Φ T Φ S Student Acc. (%) ∆ acc type Acc. (%) Normal Skeptical With AT loss (grey-box) ResNet34 Nasty 94.81 ResNet18 87.7(± 1.20) 91.76(± 0.30) +4.06 CIFAR Normal 95.3 93.41(± 0.21) 93.52(± 0.06) +0.11 -10 ResNet50 Nasty 94.28 80.34(± 1.19) 86.14(± 0.01) +5.80 Normal 94.9 90.54(± 1.16) 91.93(± 0.04) +1.39 Without AT loss (black-box) CIFAR ResNet50 Nasty 94.28 ResNet18 20.95(± 0.91) 79.93(± 0.28) +58.93 -10 Normal 94.9 22.08(± 0.56) 80.71(± 0.6) +58.63 is reduced and we see an obvious improvement in student performance. Interest- ingly,aswecanseeinFig. 8.7(b),even at reduceda the skeptical students maintain improved performance compared to their normal counterparts. In distillation from a normal teacher, similar to the previous ablation, the skeptical students do not suffer from any significant performance drop compared to normal students (Fig. 8.7(d)). 8.5.5 Limited Data and Data-Free Distillation Insteadofhavingfullaccesstoalltrainingsamples,KDwithlimitedornoaccessto training sample is considered a more realistic scenario for model stealing. Fig. 8.8 showsthestudents’performanceupondistillationfromateacher(bothnormaland nasty)whenonlyafractionofthetotaltrainingdataisavailable. Inparticular,the figure shows under different % of training data availability the skeptical student performs consistently better than its normal variant upon distillation from a nasty teacher. When distilled from a normal teacher, the skeptical student perform similar to its normal counterpart. To demonstrate skeptical student’s performance under data-free scenario, we leverage the idea of zero shot knowledge transfer [10], a state-of-the-art data-free distillation technique. For this evaluation we used ResNet34 and ResNet50 as 167 (a) (b) Figure8.8: ResNet18onCIFAR-10datasetunderdifferentpercentageoflimitedtrain- ing data upon distillation from (a) nasty and (b) normal teachers. Table 8.5: Performance of a skeptical student (ResNet18) under transferability test on CIFAR-100. Teacher Teacher type Teacher Acc % Student Acc % ∆ base ResNet50 Nasty 76.57 77.43 -0.12 ResNet18 Nasty-distilled 77.43 79.22 +1.67 ResNet50 Normal 78.04 78.90 +1.35 ResNet18 Normal-distilled 78.90 79.92 +2.37 teacher models with ResNet18 as the student for both, on CIFAR-10. We used the same training hyperparameters as in [10] with the proposed loss introduced in Eq. 8.4 and evaluated the performance when the teacher is both grey and white boxed. In particular, for the grey-box and black-box assumptions, we computed the final loss with and without the attention-transfer (AT) loss from the teacher, respectively. Table 8.4 shows the skeptical students always yield higher classifica- tion accuracy compared to their normal counterparts upon distillation from both normal and nasty teachers. In particular, while distilling from a nasty teacher the student’s performance can improve up to 5.8% and 58.93%, with grey-box and black-box teacher assumptions, respectively. These results clearly show that skep- tical students can largely diminish the KD-immunity of a nasty model under even the data-free scenario. 168 8.5.6 Transferability of Nastiness on Skeptical Students Similar to the transferability test on normal students (see Table 8.1), we also explored the transferability of a nasty teacher to a skeptical student. For this experiment, we use a skeptical student trained from a nasty teacher (ResNet50) as ateacherforasecondarystudentonCIFAR-100. Here,weusedanormalResNet18 as the secondary student model. Interestingly, Table 8.5 shows the performance of the secondary student improves by 1.67% compared to the baseline ResNet18, following the same trend as a student distilled from a normal teacher. From these results, we conclude that, a skeptical student not only reduces the nastiness of a teacher on its own performance, but also breaks the chain of transferability of nastiness to a secondary student. 8.6 Conclusions In this chapter we presented a skeptical student model that leverages a simple yet effective hybrid distillation strategy to diminish the effect of a nasty teacher and largelyretainitsclassificationperformance. Inparticular,ourexperimentalresults showed that, when distilling from a nasty teacher, the performance of skeptical students is up to ∼ 59.5% higher than that of normal students. Our models also retain a similar performance as the normal student when distilled from a normal teacher, showing the general efficacy of the proposed KD under both nasty and normal teacher scenarios. 169 Chapter 9 Generating Models for Client-Server Private Inference Framework: A Path Towards Security and Efficiency 9.1 Introduction With the recent proliferation of several AI-driven client-server applications includ- ing image analysis [224], object detection, speech recognition [225], and voice as- sistance services, the demand for machine learning inference as a service (MLaaS) has grown significantly. Simultaneously, the emergence of privacy concerns from boththeusersandmodeldevelopershasmade private inference (PI)animportant aspect of MLaaS. In PI the service provider retains the proprietary models in the cloudwheretheinferenceisperformedontheclient’sencrypteddata(ciphertexts), thus preserving both model privacy [88] and data-privacy [209]. Existing PI methods rely on various cryptographic protocols, including homo- morphic encryption (HE) [226,227] and additive secret sharing (ASS) [228] for the linear operations in the convolutional and fully connected (FC) layers. For exam- ple, popular methods like Gazelle [229], DELPHI [11], and Cheetah [230] use HE while MiniONN [231] and CryptoNAS [232] use SS. For performing the non-linear 170 Figure 9.1: Comparison of various methods in accuracy vs. #ReLU trade-off plot. SENet outperforms the existing approaches with an accuracy improvement of up to ∼ 4.5% for similar ReLU budget [6]. ReLUoperations,thePImethodsgenerallyuseYao’sGarbledCircuits(GC)[233]. However, GCs demands orders of magnitude higher latency and communication than the PI of linear operations, making latency efficient PI an exceedingly dif- ficult task. In contrast, standard inference latency is dominated by the linear operations [86] and is significantly lower than that of PI. This has motivated the unique problem of reducing the number of ReLU non- linearity operations to reduce the communication and latency overhead of PI. In particular,recentliteraturehasleveragedneuralarchitecturesearch(NAS)toopti- mizeboththenumberandplacementofReLUs[232,234]. However,thesemethods oftencostsignificantaccuracydrop,particularlywhentheReLUbudgetislow. For example,withaReLUbudgetof86k,CryptoNAScosts∼ 9%accuracycomparedto the model with all ReLUs (AR) present. To mitigate this issue DeepReDuce [235] used a careful multi-stage optimization and provided reduced accuracy drop of ∼ 3%atsimilarReLUbudgets. However, DeepReDuceheavilyreliesonmanualef- fort for precise removal of ReLU layers, making this strategy exceedingly difficult, particularly, for models with many layers. A portion of these accuracy drops can be attributed to the fact that these approaches are constrained to remove ReLUs at a higher granularity of layers and channels rather than at the pixel level. 171 Table 9.1: Comparison between existing approaches in yielding efficient models to perform PI. Note, SENet++ can yield a model that can be switched to sub models of reduced channel sizes. Name Method Reduced Granularity Reduce modelSupports dynamic used non-linearity dimension channel dropping Irregular pruning Various ✗ Scalar weight ✗ ✗ Structured pruning Various ✓ Channel, filter ✓ ✗ Sphynx [234] NAS ✓ Layer-block ✗ ✗ CryptoNAS [232] NAS ✓ Layer-block ✗ ✗ DELPHI [11] NAS + PA ✓ Layer-block ✗ ✗ SAFENet [236] NAS + PA ✓ Channel ✗ ✗ DeepReDuce [235] Manual + HE ✓ Layer-block ✗ ✗ SENet (ours) Automated ✓ Pixel ✗ ✗ SENet++ (ours) Automated ✓ Channel, pixel ✓ ✓ Our contributions. Our contribution is three-fold. We first empirically demonstrate the relation between a layer’s sensitivity towards pruning and its as- sociated input ReLU sensitivity. Based on our observations, we introduce SENet, an automated layer-wise ReLU sensitivity evaluation strategy and propose a three stage training process to yield secure and efficient networks for PI. In particular, for a given global ReLU budget we first determine a sensitivity-driven layer-wise non-linearity(ReLU)unitbudget. Giventhisbudget, wethenpresentalayer-wise ReLU allocation mask search. For each layer, we train a binary mask tensor the sizeofthelayer’sactivationmapforwhicha1or0signifiesthepresenceorabsence of a ReLU unit associated to each pixel location. Finally, we use the trained mask to create a partial ReLU (PR) model with ReLU present only at fixed parts of the non-linearity layers, and fine-tune it via distillation from an iso-architecture trained AR model. We further extend our approach to SENet++, allowing reduction of both the linear (MAC) and ReLU operations. SENet++ uses a single training loop to train amodelofdifferentchanneldropoutrates(DRs) d r (d r ≤ 1.0),whereeachd r yields asub-modelwithaMAC-ReLUbudgetsmallerthanorsameasthatoftheoriginal one. In particular, we leveragethe idea ofordered dropout [7] totrain a PRmodel with multiple dropout rates [7], where each dropout rate corresponds to a scaled channel sub model having number of channels per layer ∝ the d r . Additionally, 172 SENet++ enables an efficient PI-time trade-off between compute cost (typically dominated by the # ReLUs) and accuracy. Table 9.1 compares the important characteristics of our methods with existing alternatives. We conduct experiments on various models including variants of ResNet (ResNet18, ResNet34), Wide Residual Networks (WRN22-8), and VGG (VGG16) on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Experimental results showthat, comparedtotheexistingalternatives, SENetcanyieldanimprovedac- curacy of up to∼ 4.5% for similar ReLU budgets, evaluated on CIFAR-100. Also, forsimilaraccuracySENetrequiresupto∼ 129klessReLUs,potentiallymakingits PI models up to∼ 2.3× faster. SENet++ (d r =0.5) can further improve the MAC and ReLU cost of SENet, with additional saving of 4× and∼ 2× , respectively. 9.2 Preliminaries and Related Work Cryptographicprimitives. Webrieflydescribetherelevantcryptographicprim- itives in this section. Additive secret sharing. Given an element x, an ASS of x is the pair (⟨x⟩ 1 ,⟨x⟩ 2 )=(x− r,r), where r is a random element and x=⟨x⟩ 1 +⟨x⟩ 2 . Since r is random, the value x cannot be revealed by a single share, so that the value x is hidden. Homomorphic encryption. HE [227] is a public key encryption scheme that supports homomorphic operations on the ciphertexts. Given a public key pk and corresponding secret key sk, an encryption function E generates the ciphertext t of a plaintext m through t = E(m,pk), and a decryption function D obtains the plaintext m via m = D(t,sk). In PI, the results of linear operations can be obtained homomorphically through m 1 ◦ m 2 =D(t 1 ⋆t 2 ,sk), where◦ represents a linear operation, e.g., convolution, ⋆ is its corresponding homomorphic operation, t 1 and t 2 are the ciphertexts of m 1 and m 2 , respectively. 173 Garbled circuits. GC [233] allows two parties to jointly compute a Boolean function f over their private inputs without revealing their inputs to each other. TheBooleanfunctionf isrepresentedasaBooleancircuitC. Onepartyactingasa garblergeneratestheencodedBooleancircuit ˜ C andasetoflabelscorrespondingto theinputsthroughtheprocedureGarble(C). Thegarblersends ˜ C andthelabelsto the other party who acts as an evaluator. The evaluator obtains the output labels through the evaluation procedure Eval( ˜ C), and then sends the output labels back to the garbler. Finally, the garbler decrypts the labels to get the plain results and shares the results with the evaluator. Private inference. In this chapter, we focus on a client-server PI scenario, whereaclient,holdingprivatedata,wanttoperforminferenceonaserver,holding a private model, each without revealing their data and model, respectively. While mostofthepriorPImodeltargetedasemi-honestthreatmodelwherebothparties aresemi-honest[11,229,230,237],arecentworkMuse[238]enhancedthedefenseby defending against a malicious client. Specifically, the semi-honest parties strictly follow the protocol but try to reveal their collaborator’s private data by inspecting the information they received. On the other hand, a malicious client could deviate from the protocol. In defending against various threat models, popular PI methods [11,232,236] widely adopt HE or SS for linear operations, and GC for ReLU operations. Most of the existing cryptographic inference frameworks benefit from the offline-online topology, where the cryptographic primitives irrelevant to the client’s input are moved offline to shorten online latency [11,237,238]. Typically, the offline-online topology moves offline the circuit garbling procedure Garble(C) in GC and leaves the evaluation procedure Eval( ˜ C) online. For the linear operations, DELPHI [11] andMiniONN[231]movetheheavyprimitivesinHEandSStoofflineenablingfast linear operations during PI online stage. However, the compute heavy Eval( ˜ C) stage of GC keeps the ReLU cost for PI high even at the online stage. 174 ReLU reduction for efficient PI. Existing works use model designing with reduced ReLU counts via either search for efficient models [11,232,234,236] or manual re-design from an existing model [235]. In particular, DELPHI [11] uses layer-wise search and replacement of ReLUs with low cost quadratic polynomial approximation. SAFENet [236], on the other hand, enables more fine-grained channel-wise substitution and mixed-precision activation approximation. Cryp- toNAS [232] re-designs the neural architectures through evolutionary NAS tech- niques to minimize ReLU operations. Sphynx [234] further improves the search byleveragingdifferentiablemacro-searchNAS[239]inyieldingefficientPImodels. Finally, DeepReDuce [235] yields SOTA reduced ReLU models via a manual effort of finding and dropping redundant ReLU layers starting from an existing model. However, these approaches either suffer from lack of automated ReLU sensi- tivity evaluation incurring significant accuracy drop or require manual efforts to precisely evaluate the sensitivity, which can become exceedingly hard for deep models with many ReLU layers. 9.3 Motivational Study: Relation between ReLU importance and Pruning Sensitivity ExistingworktofindimportanceofaReLUlayer[235], requiresmanualeffortand is extremely time consuming. In contrast, earlier works [68,84] leveraged various metrics to efficiently identify a layer’s sensitivity towards a target pruning ratio. In particular, a layer’s pruning sensitivity can be quantitatively defined as the accuracy reduction caused by pruning a certain ratio of parameters from it [65]. However, instead of iteratively computing the sensitivity sequentially, earlier liter- atureleveragedsparselearning[65,84]andusedatrainedsparsemodeltoevaluate thesensitivityofalayer l (η θ l)astheratio total # of non-zero layer parameters total # layer parameters . We 175 Figure 9.2: Layer-wise pruning sensitivity (d=0.1) vs. normalized ReLU importance. The later layers are less sensitive to pruning, thus, can afford significantly more zero- valued weights as opposed to the earlier ones. On the contrary, later ReLU stages generally have more importance. hypothesize that there maybe a correlation between a layer’s pruning sensitiv- ity [84] and the importance of ReLU and have run to experiments to explore this approach. Let us assume an L-layer DNN model Φ parameterized by Θ ∈R m that learns a function f Φ , where m represents the total number of model parameters. The goal of neural network pruning is to identify and remove the unimportant pa- rameters from a DNN and yield a reduced parameter model that has comparable performance to the baseline unpruned model. As part of the pruning process for a givenparameterdensityd,eachparameterisassociatedwithanauxiliaryindicator variable c belonging to a mask tensor c∈{0,1} m such that only those θ remain non-zero whose corresponding c = 1. With these notations, we can formulate the training optimization as minL(f Φ (Θ ⊙ c)), s.t. ||c|| 0 ≤ d× m (9.1) WhereL(.) represents the loss function. In general, for image classification tasks, we define L(.) to be the cross-entropy (CE) loss. Weusedasparselearningframework[84]totrainaResNet18onCIFAR-100for a target d=0.1 and computed the pruning sensitivity of each layer. In particular, as shown in Fig. 9.2, earlier layers have higher pruning sensitivity than later ones. This means that to achieve close to baseline performance, the model trains later layers’ parameters towards zero more than those of earlier layers. 176 We then compared this trend with that of the importance of different ReLU layers as defined in [235]. In particular, for each basic block stage, we created a different ResNet18 keeping ReLUs associated with that stage while replacing all other ReLUs with identity layers, trained the network, and measured the resulting test accuracy. We performed the same operation to evaluate the accuracy with ReLU only after the first CONV layer. The ReLU stages of higher importance are thosewhichyieldedthehighestaccuracy. Wethennormalizedtheimportanceofa ReLU stage with accuracy Acc as the ratio (Acc− Acc min )/(Acc max − Acc min ). Here Acc max and Acc min corresponds to the accuracy of models with all and no ReLUs, respectively. As depicted in Figure 9.2, the results show that the ReLU importance and pruning sensitivity of a layer are inversely correlated. This inverse correlation may imply that a pruned layer can afford to have more zero-valued weights when the associated ReLU layer forces most of the computed activation values to zero. 9.4 SENet Training Methodology As highlighted earlier, for large number of ReLU layers L r , the manual evaluation andanalysisofthecandidatearchitecturesbecomeinefficientandtimeconsuming. Moreover, the manual assignment of ReLU at the pixel level, becomes even more intractablebecausethenumberofpixelsthatmustbeconsidered,explode. Tothat end,wenowpresentSENet,athree-stageautomatedReLUtrimmingstrategythat can yield models for a given reduced ReLU budget. 177 Algorithm 6: Layer-wise #ReLU Allocation Algorithm Data: Global ReLU budget r, model parameters Θ , model parameter proxy density d, 1 number of ReLU layers L r , active ReLU indicator a∈{1} Lr 2 Output: Per-layer # ReLU count. 3 η α ← evalActSens(Θ ,d) 4 for l← 0 to L r do 5 η α l ← η α l P L i=0 η α i× a i 6 end 7 initVals(r remain ,r total ,r final ) 8 while r total <r do 9 for l← 0 to L do 10 r l cur ← assignReluProportion(r remain ,η α l,a) 11 r l final ,r total ← assignUpdateRelu(r l final ,r l cur ,r total ) 12 end 13 end 14 r remove ← r total − r 15 while r remove >0 do 16 for l← 0 to L do 17 r l cur ← removeReluProportion(r del ,η α l,a) 18 r l final ,r remove ← removeUpdateRelu(r l final ,r l cur ,r remove ) 19 end 20 end 21 return r final 9.4.1 Sensitivity Analysis Inspired by our observations in Section 9.3, we define the ReLU sensitivity of a layer l as η α l =(1− η θ l) (9.2) It is important to emphasize that unlike ReLU importance, ReLU sensitivity does notrequiretrainingmanycandidatemodels. However,η θ l canonlybeevaluatedfor a specific d. We empirically observe that d>0.3 tends to yield uniform sensitivity across layers due to a large parameter budget. In contrast, ultra low density d < 0.1, costs non-negligible accuracy drops [3,85]. Based on these observations, we propose to quantify ReLU sensitivity with a proxy density of d=0.1. 178 Moreover, to avoid the compute heavy pruning process, we leverage the idea of sensitivity evaluation before training [68]. In particular, on a sampled mini batch from training data D, the sensitivity of the j th connection with associated indication variable and vector as c j ande j , can be evaluated as, ∆ L j (f Φ (Θ ;D))=g j (f Φ (Θ ;D))= ∂L(f Φ (c⊙ Θ ;D)) ∂c j c=1 (9.3) = lim δ →0 L(f Φ (c⊙ Θ ;D))−L (f Φ ((c− δ e j )⊙ Θ ;D)) δ c=1 The ∂L ∂c j is an infinitesimal version of ∆ L j that can be efficiently computed using one forward pass for all j at once. We normalize the connection sensitivities, rank them, and identify the top d-fraction of connections. We then define the layer sensitivity η Θ l as the fraction of connections of each layer that are in the top d- faction. For a given global ReLU budget r, we then assign the # ReLU for each layer proportional to its normalized ReLU sensitivity. The details are shown in Algorithm 6 and highlighted in Fig. 9.3 as point 1 ○. 9.4.2 ReLU Mask Identification After layer-wise #ReLU allocation, we next identify the ReLU locations in each layer’s activation map.In particular, for a non-linear layer l, we assign a mask tensor M l ∈{0,1} h l × w l × c l , where h l ,w l , and c l represents the height, width, and the number of channels in the activation map. For a layer l, we initialize M with r l final assigned 1’s with random locations. Then we perform a distillation-based training of the PR model performing ReLU ops only at the locations of the masks with 1, while distilling knowledge from an AR model of the same architecture (see Fig. 9.3, point 2 ○). At the end of each epoch, for each layer l, we rank the top-r l final locations based on the highest absolute difference between the PR and AR model’s post ReLU activation output (averaged over all the mini-batches) for that layer, and update the M l with 1’s at these locations. This, on average, 179 de-emphasizes the locations where the post ReLU activations in both the PR and AR models are positive. We finalize the mask once the ReLU mask 1 evaluation reaches the maximum mask training epochs or when the normalized hamming distance between masks generated after two consecutive epochs is below a certain pre-defined ϵ value. 9.4.3 Maximizing Activation Similarity via Distillation Oncethemaskforeachlayerisfrozen,westartourfinaltrainingphaseinwhichwe maximize the similarity between activation functions of our PR and AR models, see Fig. 9.3, point 3 ○. In particular, we initialize a PR model with the weights of a trained AR and with its ReLU mask frozen to the final mask of stage 2. To train the PR model, we then add a KL-divergence loss [76] to the original CE- loss enabling distillation from the AR model. Moreover, we introduce an AR-PR post-ReLU activation mismatch (PRAM) penalty into the loss function. This loss drives the PR model to have activation maps that are similar to that of the AR model. More formally, let Ψ m pr and Ψ m ar represent the m th pair of vectorized post-ReLU activation maps of same layer for Φ pr and Φ ar , respectively. Our loss function for the fine-tuning phase is given as L=(1− λ )L pr (y,y pr ) | {z } CE loss +λ L KL σ z ar ρ ,σ z pr ρ | {z } KL-div. loss + β 2 X m∈I Ψ m pr ∥Ψ m pr ∥ 2 − Ψ m ar ∥Ψ m ar ∥ 2 2 | {z } PRAM loss (9.4) where σ represents the softmax function with ρ being its temperature. λ balances the importance between the CE and KL divergence loss components, and β is the 1 The identified mask tensor has non-zeros irregularly placed. This can be easily extended to generation of structured mask, by allowing the assignment and removal of mask values at the granularity of channels instead of activation scalar [84]. 180 Figure 9.3: Different stages of the proposed training methodology for efficient private inference that can support dynamic channel reduction. For example, the model here supports two channel SFs, S 1 and S 2 . Note, similar to [7], for each SF support we use a separate batch-normalization (BN) layer to maintain a separate statistics. weight for the PRAM loss. Similar to [135], we use the l 2 -norm of the normalized activation maps to compute this loss. 9.4.4 SENet++: Support for Ordered Channel Dropping To yield further compute-communication benefits, we now present an extension of SENet, namely SENet++, that can perform the ReLU reduction while also supporting inference with reduced model sizes. In particular, we leverage the idea of ordered dropout (OD) [7] to simultaneously train multiple sub models with different fractions of channels. The OD method is parameterized by a candidate dropout setD r with dropout rate values d r ∈(0,1]. At a selected d r for any layer l, the model uses a d r -sub model with only the channels with indices{0,1,...,⌈d r · C l ⌉− 1} active, effectively pruning the remaining {⌈d r · C l ⌉,...,C l − 1} channels. Hence, during training, the selection of a d r -sub model with d r < 1.0 ∈ D r , is a form of channel pruning, while d r =1.0 trains the full model. For each minibatch of data, we perform a forward pass once for each value of d r in D r , accumulating the loss. We then perform a backward pass in which the model parameters are updated based on the gradients computed on the accu- mulated loss. We first train an AR model with a dropout set D r . For the ReLU budgetevaluation,weconsideronlythemodelwithd r =1.0,andfinalizethemask by following the methods in Sections 9.4.1 and 9.4.2. During the maximizing of 181 activation similarity stage, we fine tune the PR model supporting the same set D r as that of the AR model. In particular, the loss function for the fine tuning is the same as 9.4, for d r = 1.0. For d r < 1.0, we exclude the PRAM loss because we empirically observed that adding the PRAM loss for each sub model on average does not improve accuracy. During inference, SENet++ yielded models can be dynamically switched to support reduced channel widths, allowing both ReLUs and MACs to reduce compared to the baseline full model. 9.5 Experiments Table 9.2: Runtime and communication costs of linear and ReLU operations for 15-bit fixed-point model parameters/inputs and 31-bit ReLU operation [11]. Operation Offline Online GC size (KB) Runtime (µs ) Comm. cost (KB) Runtime (µs ) Comm. cost (KB) Linear* 32.6 0.095 0.248 0.000563 - ReLU 154.9 17.5 85.3 2.048 17.5 *Correspond to 1 multiplication and accumulation. The values are calculated by averaging the runtime and communication costs reported in DELPHI’s Table 1 [11]. 9.5.1 Experimental Setup Models and Datasets. To evaluate the efficacy of the SENet yielded models, we performed extensive experiments on three popular datasets, CIFAR-10, CIFAR- 100 [104], and Tiny-ImageNet [105] with three different model variants, namely ResNet (ResNet18, ResNet34) [15], wide residual network (WRN22-8) [186], and VGG (VGG16) [13]. We used PyTorch API to define and train our models on an Nvidia RTX 2080 Ti GPU. Training Hyperparameters. We used standard data augmentation tech- niques (horizontal flip and random crop with reflective padding) and the SGD optimizer for all training. We trained the baseline all-ReLU model for 240 epochs and 120 epochs for CIFAR and Tiny-ImageNet, respectively, with a starting learn- ing rate (LR) of 0.05 that decays by a factor of 0.1 at the 62.5%, 75%, and 87.5% 182 Table 9.3: Performance of SENet and other methods on various datasets and models. min≤ r≤ max Model Baseline #ReLU (k) Method Test Acc%/ Comm. Acc% Acc% #1k ReLU Savings Dataset: CIFAR-10 VGG16 93.8 12.5 91.6 7.33 23.6× 49.2 SENet(ours) 93.16 1.89 6.0× 0≤ r≤ 100k ResNet18 95.2 49.1 93.60 1.9 11.3× 82 93.75 1.14 6.8× VGG16 93.8 36.8 DeepReDuce [235] 88.9 3.32 – VGG16 93.8 126 SENet(ours) 93.42 0.74 2.3× ResNet18 95.2 150 95.02 0.63 3.7× 100k≤ r≤ 500k VGG16 93.8 126 DeepReDuce [235] 92.5 0.73 2.3× VGG16 93.8 126 SAFENet [236] 88.9 0.7 2.3× Custom Net 95.0 100 CryptoNAS [232] 92.18 0.92 – 500 94.41 0.19 – Dataset: CIFAR-100 ResNet18 78.05 25.6 70 2.73 21.8× 49.6 74.12 1.89 11.2× 100 SENet(ours) 77.18 0.77 5.6× 0≤ r≤ 100k ResNet34 78.42 50.1 75.1 1.5 19.3× 80 75.55 0.94 12.1× ResNet18 78.05 28.7 DeepReDuce [235] 68.6 2.39 19.4× 49.2 69.5 1.41 11.3× Custom Net 74.93 51 Sphynx [234] 69.57 1.36 – ResNet18 78.05 150 78.32 0.52 3.7× ResNet34 78.425 200 SENet(ours) 78.8 0.4 4.8× WRN22-8 80.82 240 79.81 0.33 5.8× 300 80.26 0.27 4.6× 100k≤ r≤ 500k ResNet18 78.05 229.4 DeepReDuce [235] 76.22 0.33 2.4× Custom Net 74.93 102 Sphynx [234] 72.9 0.714 – 230 74.93 0.32 – Custom Net 79.07 100 CryptoNAS [232] 68.67 0.69 – 500 77.69 0.16 – training epochs completion points. For all the training we used an weight decay coefficient of 5 × 10 − 4 . For a target ReLU budget, we performed the mask eval- uation for 150 epochs with the ϵ set to 0.05, meaning the training prematurely terminates when less than 5% of the total #ReLU masks change their positions. Finally, we performed the post-ReLU activation similarity improvement for 100 and 80 epochs, for CIFAR and Tiny-ImageNet, respectively. Notably, for this fine tuningstage,weinitializethestudentPRmodelweightsinitializedwiththatofthe AR model that acts as the teacher. Also, unless stated otherwise, we use λ =0.5, and β = 1000 for the loss described in Eq. 9.4. Further details of our training hyper-parameter choices are provided in the Supplementary materials. 183 9.5.2 SENet Results As shown in Table 9.3, SENet yields models that have higher accuracy than ex- isting alternatives by a significant margin while often requiring fewer ReLUs. For example, at a small ReLU budget of≤ 100k, Table 9.4: Performance of SENet and DeepReDuce on Tiny-ImageNet. Model Baseline#ReLU Method Test Acc%/ Comm. Acc% (k) Acc% #1k ReLUSavings 142 SENet 58.9 0.414 × 15.7 ResNet18 66.1 298 64.96 0.218 × 7.5 393 DeepReDuce [235] 61.65 0.157 × 5.7 917 64.66 0.071 × 2.4 our models yield up to 4.85% and 7.64% higher accuracy, on CIFAR-10 and CIFAR-100, respectively. At a ReLU budget of ≤ 500k, our improvement is up to 0.61% and 2.57%, respectively, on the two datasets. We further evaluate the communication saving due to the non-linearity reduction by taking the per ReLU communication cost mentioned in Table 9.2. In particular, the communication saving reported in the 8 th column of Table 9.3 is computed as the ratio of com- munication costs associated with an AR model to that of the corresponding PR model with reduced ReLUs. We did not report any saving for the custom models, as they do not have any corresponding AR baseline model. On Tiny-ImageNet, SENet models can provide up to 0.3% higher performance while requiring 3.08× fewer ReLUs (Table 9.4). 9.5.3 SENet++ Results ForSENet++,weperformedexperimentswithD r =[0.5,1.0],meaningeachtrain- ing loop can yield models with two different channel dropout rates. The 0 .5-sub model enjoys a ∼ 4× MACs reduction compared to the full model. Moreover, as showninFig. 9.4,the0.5-submodelalsorequiressignificantlyless#ReLUsdueto reduced model size. In particular, the smaller models have #ReLUs reduced by a factor of 2.05× , 2.08× , and 1.88× on CIFAR-10, CIFAR-100, and Tiny-ImageNet, 184 (a) (b) (c) Figure 9.4: Performance of SENet++ on three datasets for various #ReLU budgets. The points labelled A, B, C, D corresponds to experiments of different target #ReLUs for the full model (d r = 1.0). For SENet++, note that a single training loop yields two points with the same label corresponding to the two different drop out rates respectively, comparedtothePRfullmodels, averagedoverfourexperimentswith different ReLU budgets for each dataset. Lastly, the similar performance of the SENet and SENet++ models at d r = 1.0 with similar ReLU budgets, clearly de- picts the ability of SENet++ to yield multiple sub models without sacrificing any accuracy for the full model. 9.5.4 Analysis of Linear and ReLU Inference Latency Table 9.2 shows the GC-based online ReLU operation latency is ∼ 343× higher than one linear operation (multiply and accumulate), making the ReLU opera- tion latency the dominant latency component. Inspired by this observation, we quantify the online PI latency as that of the N ReLU operations for a model with ReLU budget of N. In particular, based on this evaluation, Fig. 9.5(a) shows the superiorityofSENet++ofupto∼ 9.6× (∼ 1.92× )reducedonlineReLUlatencyon CIFAR-10 (CIFAR-100). With negligibly less accuracy this latency improvement canbeupto∼ 21× . Furthermore, asford r <1.0, SENet++requiresfewerMACs, the linear operation latency can also be reduced significantly as demonstrated in Fig. 9.5(b). 185 (a) (b) Figure 9.5: Performance comparison of SENet++ (with d r =1.0 and 0.5) vs. existing alternatives(a)withVGG16andResNet18intermsofReLUlatency. ThelabelsA,B,C, D correspond to experiments of different target #ReLUs for the full model ( d r = 1.0). For SENet++, note that a single training loop yields two points with the same label correspondingtothetwodifferentdropoutrates. (b)ComparisonbetweenDeepReDuce and SENet++ for a target # ReLU budget of∼ 50k with ResNet18 on CIFAR-100. 9.5.5 Ablation Studies Importance of ReLU sensitivity. For a given ReLU budget, to understand the importance of layer-wise ReLU sensitivity evaluations, we conducted experiments with evenly allocated ReLUs. Specifically, for ResNet18, for a ReLU budget of 25% as that of the original model, we randomly removed 75% ReLUs from each PR layer with identity elements to create the ReLU mask, and trained the PR model with this mask. We further trained two other PR ResNet18 with simi- lar and lower # ReLU budgets with the per-layer ReLUs assigned following the proposed sensitivity. As shown in Table 9.5, the sensitivity driven PR models can yield significantly improved performance of ∼ 5.76% for similar ReLU budget, demonstrating the importance of proposed ReLU sensitivity. Table 9.5: Importance of ReLU sensitivity. Model Baseline #ReLU ReLU Test Acc%/ Comm. Acc% (k) Sensitivity Acc% #1k ReLU Savings 139.2 ✗ 70.12 0.503 × 4 ResNet18 78.05 135 ✓ 75.88 0.56 × 4.12 70.4 ✓ 73.03 1.03 × 7.9 Choice of the hyperparameter λ and β . To determine the influence of the AR teacher’s influence on the PR model’s learning, we conducted the final stage distillation with λ ∈ [0.1,0.3,0.5,0.7,0.9] and β ∈ [100,300,500,700,1000]. As 186 (a) (b) Figure 9.6: Ablation studies with different (a) λ and (b) β values for the loss term in Eq. 9.4. shown in Fig. 9.6, the performance of the student PR model improves with the increasinginfluenceoftheteacherbothintermsofhigh λ 2 andβ values. However, we also observe, the performance improvement tend to saturate at β ≈ 1000. 9.6 Conclusions In this chapter, we introduced the notion of ReLU sensitivity for non-linear layers of a DNN model. Based on this notion, we present an automated ReLU allocation andtrainingalgorithmformodelswithlimitedReLUbudgetstargetinglatencyand communication efficient PI. The resulting networks can achieve similar to SOTA accuracy while significantly reducing the # ReLUs by up to 9 .6× on CIFAR-10, enabling dramatic reduction of the latency and communication costs of PI. 2 Though higher λ generally yields improved performance, we presented results with λ = 0.5, to match hyperparameter settings of [235]. 187 Chapter 10 Conclusions 10.1 Summary In this thesis, we present multiple avenues towards yielding efficient deep neural networks. We take a holistic approach in the sense that we initially design al- gorithms and frameworks to design energy-efficient neural networks both in the conventionalANNsandbrain-inspiredSNNs. Wethentakeadeepdiveintoinves- tigatingthemodelrobustnessoftheenergy-efficientmodelsanddevelopalgorithms andproposeconditionallytrainablemodelsthatcanimprovethemodelrobustness yet maintain reduced complexity through compression. We finally develop novel training framework of distillation to not only improve the performance of distilled models, but also show its efficacy in evading the model privacy under even “undis- tillable” scenarios. Our novel set of algorithms are deeply routed in fundamental understand- ing of the neural network training and functionality of each module of a DNN model. Moreover, our sensitivity-driven compression methods to yield energy- efficient models can also help as guiding tool for better design space exploration for novel DNN architectures in yielding robustness and accuracy. Finally our analysis of model IP vulnerability through distillation opens up a new avenue of challenges and opportunities in maintaining the privacy of model performance to protect machine learning as a service (MLaaS) business model. As a potential 188 solution to this, we encourage the reader towards a privacy preserving inference service primarily via latency and compute efficient PI models [6]. 10.2 Closing Remarks Withthepushtowardsglobaltechnologicaladvancements,wehaveestablishedthe foundationsforharnessing,processingandleveragingbigdataacrossnumerousas- pects of our personal and professional lives. Machine learning and, in particular, deep learning methods are powerful techniques that are increasingly becoming key players of innovation to turn these enormous amounts of data into consequential insights and meaning information and so far we have only scratched the surface of what may be achievable in the future. So far we have seen that, in many cases the algorithmicdevelopmentsarelargelyinfluencedbythehardwarecapabilities. Thus toexploreandunleashthetruecapabilitiesAI,weneedtobringalgorithmicdevel- opment and architectural innovation in closer proximity to yield energy-efficient, robust and sustainable solutions. We believe this thesis to act as a foundational effort to contribute an impactful and valuable set of techniques towards realizing this vision. 10.3 Funding ThisresearchwasfundedinpartbytheUnitedStatesNationalScienceFoundation grant number #1763747, DARPA grant number #HR00112190120, University of Southern California Annenberg Fellowship. This research was further supported in part by Intel AI Labs. 189 Bibliography [1] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “To- wards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017. [2] Z. He, A. S. Rakin, and D. Fan, “Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 588–597. [3] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018. [4] T. Dettmers and L. Zettlemoyer, “Sparse networks from scratch: Faster training without losing performance,” arXiv preprint arXiv:1907.04840, 2019. [5] M. Ali, A. Jaiswal, S. Kodge, A. Agrawal, I. Chakraborty, and K. Roy, “Imac: In-memory multi-bit multiplication and accumulation in 6t sram ar- ray,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 8, pp. 2521–2531, 2020. [6] S.Kundu,S.Lu,Y.Zhang,J.Liu,andP.Beerel,“Senet: Towardssecureand efficient private inference via automated non-linearity trimmed network,” arxiv, 2022. [7] S.Horvath,S.Laskaridis,M.Almeida,I.Leontiadis,S.Venieris,andN.Lane, “Fjord: Fair and accurate federated learning under heterogeneous targets withordereddropout,” Advances in Neural Information Processing Systems, vol. 34, 2021. [8] M. Horowitz, “1.1 Computing’s energy problem (and what we can do about it),” in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014, pp. 10–14. 190 [9] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in Inter- national Conference on Machine Learning. PMLR, 2018, pp. 274–283. [10] P. Micaelli and A. Storkey, “Zero-shot knowledge transfer via adversarial belief matching,” arXiv preprint arXiv:1905.09768, 2019. [11] P.Mishra, R.Lehmkuhl, A.Srinivasan, W.Zheng, andR.A.Popa, “Delphi: A cryptographic inference service for neural networks,” in 29th USENIX Security Symposium (USENIX Security 20), 2020, pp. 2505–2522. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. [13] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksforlarge- scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.Vanhoucke,andA.Rabinovich,“Goingdeeperwithconvolutions,”inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9. [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [16] G. Datta, S. Kundu, A. Jaiswal, and P. A. Beerel, “Ace-snn: Algorithm- hardwareco-designofenergy-efficient&low-latencydeepspikingneuralnet- works for 3d image recognition,” Frontiers in neuroscience, p. 400, 2022. [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [18] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolu- tional networks,” arXiv preprint arXiv:1312.6229, 2013. [19] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271. 191 [20] D. Bose, K. Somandepalli, S. Kundu, R. Lahiri, J. Gratch, and S. Narayanan, “Understanding of emotion perception from art,” arXiv preprint arXiv:2110.06486, 2021. [21] H. Tao, W. Li, X. Qin, and D. Jia, “Image semantic segmentation based on convolutional neural network and conditional random field,” in 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI). IEEE, 2018, pp. 568–572. [22] A.Coates,B.Huval,T.Wang,D.Wu, B.Catanzaro, andN.Andrew, “Deep learning with cots hpc systems,” in International conference on machine learning, 2013, pp. 1337–1345. [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. [24] S. Kundu, H. Mostafa, S. N. Sridhar, and S. Sundaresan, “Attention-based image upsampling,” arXiv preprint arXiv:2012.09904, 2020. [25] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [26] K.Arulkumaran,M.P.Deisenroth,M.Brundage,andA.A.Bharath,“Deep reinforcement learning: A brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017. [27] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943. [28] F. Rosenblatt, “The perceptron: a probabilistic model for information stor- age and organization in the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958. [29] M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,” Atmo- spheric environment, vol. 32, no. 14-15, pp. 2627–2636, 1998. [30] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, “A theoretical frame- work for back-propagation,” in Proceedings of the 1988 connectionist models summer school, vol.1. CMU,Pittsburgh, Pa: MorganKaufmann, 1988, pp. 21–28. 192 [31] R.Nirthika,S.Manivannan,A.Ramanan,andR.Wang,“Poolinginconvolu- tionalneuralnetworksformedicalimageanalysis: asurveyandanempirical study,” Neural Computing and Applications, pp. 1–27, 2022. [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [33] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger, “Understanding batch normalization,” in Advances in Neural Information Processing Sys- tems, 2018, pp. 7694–7705. [34] C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le, “Adversar- ial examples improve image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2020,pp.819–828. [35] R. Sutton, “Two problems with back propagation and other steepest de- scent learning procedures for networks,” in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, 1986, pp. 823–832. [36] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016. [37] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999. [38] N. Rathi, P. Panda, and K. Roy, “Stdp based pruning of connections and weight quantization in spiking neural networks for energy efficient recogni- tion,” arXiv preprint arXiv:1710.04734, 2017. [39] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization tospikingneuralnetworks,”IEEESignalProcessingMagazine,vol.36,no.6, pp. 51–63, 2019. [40] S. Kundu, G. Datta, P. A. Beerel, and M. Pedram, “qbsa: Logic design of a 32-bit block-skewed rsfq arithmetic logic unit,” in 2019 IEEE International Superconductive Electronics Conference (ISEC). IEEE, 2019, pp. 1–3. [41] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Di- mou, P. Joshi, N. Imam, S. Jain et al., “Loihi: A neuromorphic manycore processor with on-chip learning,” Ieee Micro, vol. 38, no. 1, pp. 82–99, 2018. [42] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spiking neural networks: Vgg and residual architectures,” Frontiers in neuroscience, vol. 13, p. 95, 2019. 193 [43] G. Indiveri and T. K. Horiuchi, “Frontiers in neuromorphic engineering,” Frontiers in neuroscience, vol. 5, p. 118, 2011. [44] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeiffer, “Real-time classificationandsensorfusionwithaspikingdeepbeliefnetwork,” Frontiers in neuroscience, vol. 7, p. 178, 10 2013. [45] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural net- works using backpropagation,” Frontiers in Neuroscience, vol. 10, p. 508, 2016. [46] C.Lee,P.Panda,G.Srinivasan,andK.Roy,“Trainingdeepspikingconvolu- tional neural networks with STDP-based unsupervised pre-training followed by supervised fine-tuning,” Frontiers in Neuroscience, vol. 12, p. 435, 2018. [47] G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass, “Long short-term memory and learning-to-learn in networks of spiking neurons,” arXiv preprint arXiv:1803.09574, 2018. [48] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer, “Fast- classifying, high-accuracy spiking deep networks through weight and thresh- old balancing,” in 2015 International Joint Conference on Neural Networks (IJCNN), vol. 1, no. 1, 2015, pp. 1–8. [49] P. U. Diehl, G. Zarrella, A. Cassidy, B. U. Pedroni, and E. Neftci, “Con- version of artificial recurrent neural networks to spiking neural networks for low-power neuromorphic hardware,” in 2016 IEEE International Conference on Rebooting Computing (ICRC). IEEE, 2016, pp. 1–8. [50] G. Srinivasan and K. Roy, “ReStoCNet: Residual stochastic binary convo- lutional spiking neural network for memory-efficient neuromorphic comput- ing,” Frontiers in Neuroscience, vol. 13, p. 189, 2019. [51] G. Datta, S. Kundu, Z. Yin, R. T. Lakkireddy, P. A. Beerel, A. Jacob, and A. R. Jaiswal, “P2m: A processing-in-pixel-in-memory paradigm for resource-constrained tinyml applications,” arXiv preprint arXiv:2203.04737, 2022. [52] H. Wang, T. Chen, S. Gui, T.-K. Hu, J. Liu, and Z. Wang, “Once-for-all adversarial training: In-situ tradeoff between robustness and accuracy for free,” arXiv preprint arXiv:2010.11828, 2020. [53] S.Han, X.Liu, H.Mao, J.Pu, A.Pedram, M.A.Horowitz, andW.J.Dally, “Eie: efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016. 194 [54] Y.-H.Chen,T.Krishna,J.S.Emer,andV.Sze,“Eyeriss: Anenergy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016. [55] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accel- erator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019. [56] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [57] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition,2018,pp.4510– 4520. [58] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutionalneuralnetworkformobiledevices,”in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856. [59] M.Rastegari,V.Ordonez,J.Redmon,andA.Farhadi,“Xnor-net: Imagenet classification using binary convolutional neural networks,” in European con- ference on computer vision. Springer, 2016, pp. 525–542. [60] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi- narized neural networks,” in Advances in neural information processing sys- tems, 2016, pp. 4107–4115. [61] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quan- tization: Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017. [62] Q. Hu, G. Li, P. Wang, Y. Zhang, and J. Cheng, “Training binary weight networks via semi-binary decomposition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 637–653. [63] A. Bulat, B. Martinez, and G. Tzimiropoulos, “Bats: Binary architecture search,” arXiv preprint arXiv:2003.01711, 2020. [64] A. Sanyal, P. A. Beerel, and K. M. Chugg, “Neural network training with approximate logarithmic computations,” in ICASSP 2020-2020 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3122–3126. 195 [65] X. Ding, X. Zhou, Y. Guo, J. Han, J. Liu et al., “Global sparse momentum sgd for pruning very deep neural networks,” in Advances in Neural Informa- tion Processing Systems, 2019, pp. 6382–6394. [66] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang, “A systematicdnnweightpruningframeworkusingalternatingdirectionmethod of multipliers,” in Proceedings of the European Conference on Computer Vi- sion (ECCV), 2018, pp. 184–199. [67] M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017. [68] N. Lee, T. Ajanthan, and P. H. Torr, “Snip: Single-shot network pruning based on connection sensitivity,” arXiv preprint arXiv:1810.02340, 2018. [69] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800. [70] C. Li, “Openai’s gpt-3 language model: A technical overview,” Sep 2020. [Online]. Available: https://lambdalabs.com/blog/demystifying-gpt-3/ [71] A. S. Rakin, Z. He, L. Yang, Y. Wang, L. Wang, and D. Fan, “Robust sparse regularization: Simultaneously optimizing neural network robustness and compactness,” arXiv preprint arXiv:1905.13074, 2019. [72] S.Ye,K.Xu,S.Liu,H.Cheng,J.-H.Lambrechts,H.Zhang,A.Zhou,K.Ma, Y. Wang, and X. Lin, “Adversarial robustness vs. model compression, or both,” in The IEEE International Conference on Computer Vision (ICCV), vol. 2, 2019. [73] D.Zou, Y.Cao, D.Zhou, andQ.Gu, “Stochasticgradientdescentoptimizes over-parameterized deep relu networks,” arXiv preprint arXiv:1811.08888, 2018. [74] N. Rathi, G. Srinivasan, P. Panda, and K. Roy, “Enabling deep spiking neural networks with hybrid conversion and spike timing dependent back- propagation,” arXiv preprint arXiv:2005.01807, 2020. [75] L. Deng, Y. Wu, Y. Hu, L. Liang, G. Li, X. Hu, Y. Ding, P. Li, and Y. Xie, “Comprehensive SNN compression using ADMM optimization and activity regularization,” arXiv preprint arXiv:1911.00822, 2019. [76] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. 196 [77] H. Chen, Y. Wang, H. Shu, C. Wen, C. Xu, B. Shi, C. Xu, and C. Xu, “Distilling portable generative adversarial networks for image translation,” in Proceedings of the AAAI Conference on Artificial Intelligence , 2020, pp. 3585–3592. [78] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019. [79] S. Kundu, S. Prakash, H. Akrami, P. A. Beerel, and K. M. Chugg, “psconv: A pre-defined sparse kernel based convolution for deep cnns,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2019, pp. 100–107. [80] S. Kundu, M. Nazemi, M. Pedram, K. M. Chugg, and P. A. Beerel, “Pre- defined sparsity for low-complexity convolutional neural networks,” IEEE Transactions on Computers, 2020. [81] S. Kundu, S. Wang, Q. Sun, P. A. Beerel, and M. Pedram, “Bmpq: Bit-gradient sensitivity driven mixed-precision quantization of dnns from scratch,” arXiv preprint arXiv:2112.13843, 2021. [82] S. Kundu, G. Datta, M. Pedram, and P. A. Beerel, “Spike-thrift: Towards energy-efficient deep spiking neural networks by limiting spiking activity via attention-guided compression,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),January2021,pp. 3953–3962. [83] S. Kundu, G. Datta, M. Pedram, and P. Beerel, “Towards low-latency energy-efficient deep snns via attention-guided compression,” arXiv preprint arXiv:2107.12445, 2021. [84] S.Kundu,M.Nazemi,P.A.Beerel,andM.Pedram,“Dnr: Atunablerobust pruning framework through dynamic network rewiring of dnns,” in Proceed- ingsofthe26thAsiaandSouthPacificDesignAutomationConference ,2021, pp. 344–350. [85] S. Kundu, Y. Fu, B. Ye, P. A. Beerel, and M. Pedram, “Towards adversary aware non-iterative model pruning through d ynamic n etwork r ewiring of dnns,” ACM Transactions on Embedded Computing Systems (TECS), 2022. [86] S.Kundu, S.Sundaresan, M.Pedram, andP.A.Beerel, “Afastandefficient conditional learning for tunable trade-off between accuracy and robustness,” arXiv preprint arXiv:2204.00426, 2022. 197 [87] S. Kundu, M. Pedram, and P. A. Beerel, “Hire-snn: Harnessing the inher- ent robustness of energy-efficient deep spiking neural networks by training with crafted input noise,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 5209–5218. [88] S. Kundu, Q. Sun, Y. Fu, M. Pedram, and P. Beerel, “Analyzing the con- fidentiality of undistillable teachers in knowledge distillation,” Advances in Neural Information Processing Systems, vol. 34, 2021. [89] H. Ma, T. Chen, T.-K. Hu, C. You, X. Xie, and Z. Wang, “Undistillable: Making a nasty teacher that cannot teach students,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=0zvfm-nZqQs [90] A.Krizhevskyet al.,“ImageNetclassificationwithdeepconvolutionalneural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [91] S. Dey, K.-W. Huang, P. A. Beerel, and K. M. Chugg, “Pre-defined sparse neural networks with hardware acceleration,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019. [92] A. Fayyazi, S. Kundu, S. Nazarian, P. A. Beerel, and M. Pedram, “Csrram: Area-efficient low-power ex-situ training framework for memristive neuro- morphic circuits based on clustered sparsity,” in 2019 IEEE Computer Soci- ety Annual Symposium on VLSI (ISVLSI). IEEE, 2019, pp. 465–470. [93] S. Dey, D. Chen, Z. Li, S. Kundu, K.-W. Huang, K. M. Chugg, and P. A. Beerel, “A highly parallel fpga implementation of sparse neural network training,” in 2018 International Conference on ReConFigurable Computing and FPGAs (ReConFig). IEEE, 2018, pp. 1–4. [94] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynami- cally configurable coprocessor for convolutional neural networks,” in ACM SIGARCH Computer Architecture News, vol. 38. ACM, 2010, pp. 247–257. [95] Y.-H. Chen, J. Emer, and V. Sze, “Using dataflow to optimize energy effi- ciency of deep neural network accelerators,” IEEE Micro, vol. 37, no. 3, pp. 12–21, 2017. [96] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-basedlearning,”inShape, contour and grouping in computer vision. Springer, 1999, pp. 319–345. [97] V. Vanhoucke, “Learning visual representations at scale,” ICLR invited talk, vol. 1, p. 2, 2014. 198 [98] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi, “Deep Roots: Im- proving CNN efficiency with hierarchical filter groups,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1231–1240. [99] S. Xie, R. Girshick, P. Doll´ ar, Z. Tu, and K. He, “Aggregated residual trans- formationsfordeepneuralnetworks,”inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500. [100] J. Cheng, M. Grossman, and T. McKercher, Professional Cuda C Program- ming. John Wiley & Sons, 2014. [101] X. Ma, F.-M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang, “Pconv: The missing but desirable sparsity in dnn weight pruning for real- time execution on mobile devices,” arXiv preprint arXiv:1909.05073, 2019. [102] M.Greenberg,“LPDDR3andLPDDR4: Howlow-powerDRAMcanbeused in high-bandwidth applications,” by JEDEC, 2013. [103] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in py- torch,” 2017. [104] A.KrizhevskyandG.Hinton,“Learningmultiplelayersoffeaturesfromtiny images,” Citeseer, Tech. Rep., 2009. [105] L. Hansen, “Tiny ImageNet challenge submission,” CS 231N, 2015. [106] T. Lawrence and L. Zhang, “Iotnet: An efficient and accurate convolutional neural network for iot devices,” Sensors, vol. 19, no. 24, p. 5541, 2019. [107] X. She, Y. Long, D. Kim, and S. Mukhopadhyay, “Scienet: Deep learn- ing with spike-assisted contextual information extraction,” arXiv preprint arXiv:1909.05314, 2019. [108] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016. [109] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolu- tional neural networks,” arXiv preprint arXiv:1905.11946, 2019. [110] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, “HAWQ: Hessianawarequantizationofneuralnetworkswithmixed-precision,”inPro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 293–302. 199 [111] Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M. Mahoney et al., “HAWQ-V3: Dyadic neural network quantiza- tion,” in International Conference on Machine Learning. PMLR, 2021, pp. 11875–11886. [112] I.Chakraborty,D.Roy,I.Garg,A.Ankit,andK.Roy,“Constructingenergy- efficient mixed-precision neural networks through principal component anal- ysis for edge intelligence,” Nature Machine Intelligence, vol. 2, no. 1, pp. 43–55, 2020. [113] B.Wu,Y.Wang,P.Zhang,Y.Tian,P.Vajda,andK.Keutzer,“Mixedpreci- sion quantization of ConvNets via differentiable neural architecture search,” arXiv preprint arXiv:1812.00090, 2018. [114] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: Hardware-aware auto- mated quantization with mixed precision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612– 8620. [115] K. Vasquez, Y. Venkatesha, A. Bhattacharjee, A. Moitra, and P. Panda, “Activation density based mixed-precision quantization for energy efficient neural networks,” DATE, 2021. [116] A. Inc., “An on-device deep neural network for face detection,” Nov 2017. [Online]. Available: https://machinelearning.apple.com/research/ face-detection [117] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015. [118] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016. [119] D. Zhang, J. Yang, D. Ye, and G. Hua, “LQ-Nets: Learned quantization for highly accurate and compact deep neural networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 365–382. [120] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018. [121] H. Yang, L. Duan, Y. Chen, and H. Li, “BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization,” arXiv preprint arXiv:2102.10462, 2021. 200 [122] J. Choi, S. Venkataramani, V. Srinivasan, K. Gopalakrishnan, Z. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural networks.” in MLSys, 2019. [123] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K.Gopalakrishnan, “PACT:Parameterizedclippingactivationforquantized neural networks,” arXiv preprint arXiv:1805.06085, 2018. [124] Y. Bengio, N. L´ eonard, and A. Courville, “Estimating or propagating gradi- entsthroughstochasticneuronsforconditionalcomputation,” arXivpreprint arXiv:1308.3432, 2013. [125] A. S. Rakin, Z. He, and D. Fan, “Bit-flip attack: Crushing neural network with progressive bit search,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1211–1220. [126] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016. [127] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Es- maeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network,” in 2018 ACM/IEEE 45th Annual Inter- national Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 764–775. [128] Y. Long, E. Lee, D. Kim, and S. Mukhopadhyay, “Q-PIM: A genetic algo- rithmbasedflexiblednnquantizationmethodandapplicationtoprocessing- in-memory platform,” in 2020 57th ACM/IEEE Design Automation Confer- ence (DAC). IEEE, 2020, pp. 1–6. [129] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- nections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143. [130] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi- narized neural networks: Training deep neural networks with weights and activationsconstrainedto+1or-1,” arXiv preprint arXiv:1602.02830, 2016. [131] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, “Going deeper in spik- ing neural networks: VGG and residual architectures,” Frontiers in Neuro- science, vol. 13, p. 95, 2019. [132] Y. Shi, L. Nguyen, S. Oh, X. Liu, and D. Kuzum, “A soft-pruning method applied during training of spiking neural networks for in-memory computing applications,” Frontiers in Neuroscience, vol. 13, p. 405, 2019. 201 [133] N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye, “AutoCompress: An automatic DNN structured pruning framework for ultra-high compression rates.” in AAAI, 2020, pp. 4876–4883. [134] S.Kundu,M.Nazemi,P.A.Beerel,andM.Pedram,“Atunablerobustprun- ing framework through dynamic network rewiring of dnns,” arXiv preprint arXiv:2011.03083, 2020. [135] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Im- provingtheperformanceofconvolutionalneuralnetworksviaattentiontrans- fer,” arXiv preprint arXiv:1612.03928, 2016. [136] S. Lu and A. Sengupta, “Exploring the connection between binary and spik- ing neural networks,” arXiv preprint arXiv:2002.10064, 2020. [137] C. Lee, S. S. Sarwar, P. Panda, G. Srinivasan, and K. Roy, “Enabling spike- basedbackpropagationfortrainingdeepneuralnetworkarchitectures,”Fron- tiers in Neuroscience, vol. 14, p. 119, 2020. [138] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal backpropa- gation for training high-performance spiking neural networks,” Frontiers in neuroscience, vol. 12, p. 331, 2018. [139] R. Chen, H. Ma, S. Xie, P. Guo, P. Li, and D. Wang, “Fast and efficient deep sparse multi-strength spiking neural networks with dynamic pruning,” in 2018 International Joint Conference on Neural Networks (IJCNN),vol.1, no. 1, 2018, pp. 1–8. [140] M. Sorbaro, Q. Liu, M. Bortone, and S. Sheik, “Optimizing the energy con- sumption of spiking neural networks for neuromorphic applications,” Fron- tiers in Neuroscience, vol. 14, p. 662, 2020. [141] G. Bellec, D. Kappel, W. Maass, and R. Legenstein, “Deep rewiring: Train- ing very sparse deep networks,” arXiv preprint arXiv:1711.05136, 2017. [142] H.MostafaandX.Wang,“Parameterefficienttrainingofdeepconvolutional neural networks by dynamic sparse reparameterization,” arXiv preprint arXiv:1902.05967, 2019. [143] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal backpropa- gation for training high-performance spiking neural networks,” Frontiers in Neuroscience, vol. 12, p. 331, 2018. [144] F. Zenke and S. Ganguli, “Superspike: Supervised learning in multilayer spiking neural networks,” Neural computation, vol. 30, no. 6, pp. 1514–1541, 2018. 202 [145] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural net- worksforenergy-efficientobjectrecognition,” International Journal of Com- puter Vision, vol. 113, no. 1, pp. 54–66, 2015. [146] Y. Wu, L. Deng, G. Li, J. Zhu, Y. Xie, and L. Shi, “Direct training for spiking neural networks: Faster, larger, better,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019, pp. 1311–1318. [147] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural net- worksforenergy-efficientobjectrecognition,” International Journal of Com- puter Vision, vol. 113, pp. 54–66, 05 2015. [148] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp). IEEE, 2017, pp. 39–57. [149] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. [150] F. Tram` er, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. Mc- Daniel, “Ensemble adversarial training: Attacks and defenses,” arXiv preprint arXiv:1705.07204, 2017. [151] D. Meng and H. Chen, “Magnet: a two-pronged defense against adversarial examples,” in Proceedings of the 2017 ACM SIGSAC Conference on Com- puter and Communications Security, 2017, pp. 135–147. [152] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “AMC: AutoML for model compression and acceleration on mobile devices,” in The European Conference on Computer Vision (ECCV), September 2018. [153] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1389–1397. [154] S. Gui, H. N. Wang, H. Yang, C. Yu, Z. Wang, and J. Liu, “Model com- pression with adversarial robustness: A unified optimization framework,” in Advances in Neural Information Processing Systems, 2019, pp. 1283–1294. [155] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “ADMM-NN: An algorithm-hardware co-design framework of DNNs using alternating direction methods of multipliers,” in Proceedings of the Twenty- Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 925–938. 203 [156] T. Dinh and J. Xin, “Convergence of a relaxed variable splitting method for learning sparse neural networks via l 1 ,l 0 , and transformed-l 1 penalties,” arXiv preprint arXiv:1812.05719, 2018. [157] K. Ren, T. Zheng, Z. Qin, and X. Liu, “Adversarial attacks and defenses in deep learning,” Engineering, 2020. [158] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed op- timization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011. [159] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016. [160] T. Han, S. Nebelung, F. Pedersoli, M. Zimmermann, M. Schulze-Hagen, M. Ho, C. Haarburger, F. Kiessling, C. Kuhl, V. Schulz et al., “Advancing diagnosticperformanceandclinicalusabilityofneuralnetworksviaadversar- ialtraininganddualbatchnormalization,” Nature Communications, vol.12, no. 1, pp. 1–11, 2021. [161] W. Hua, Y. Zhang, C. Guo, Z. Zhang, and G. E. Suh, “Bullettrain: Ac- celerating robust neural network training via boundary example mining,” Advances in Neural Information Processing Systems, vol. 34, 2021. [162] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry, “Robust- ness may be at odds with accuracy,” arXiv preprint arXiv:1805.12152, 2018. [163] K. Sun, Z. Zhu, and Z. Lin, “Towards understanding adversarial examples systematically: Exploring data size, task and model factors,” arXiv preprint arXiv:1902.11019, 2019. [164] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, “Ad- versarially robust generalization requires more data,” arXiv preprint arXiv:1804.11285, 2018. [165] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 32, no. 1, 2018. [166] J.Yu,L.Yang,N.Xu,J.Yang,andT.Huang,“Slimmableneuralnetworks,” arXiv preprint arXiv:1812.08928, 2018. 204 [167] P. Samangouei, M. Kabkab, and R. Chellappa, “Defense-gan: Protect- ing classifiers against adversarial attacks using generative models,” arXiv preprint arXiv:1805.06605, 2018. [168] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometer encoding: One hot way to resist adversarial examples,” in International Conference on Learning Representations, 2018. [169] X.Wang,F.Yu,Z.-Y.Dou,T.Darrell,andJ.E.Gonzalez,“Skipnet: Learn- ing dynamic routing in convolutional networks,” in Proceedings of the Euro- pean Conference on Computer Vision (ECCV), 2018, pp. 409–424. [170] S. Teerapittayanon, B. McDanel, and H.-T. Kung, “Branchynet: Fast infer- enceviaearlyexitingfromdeepneuralnetworks,”in2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 2464–2469. [171] G.Huang,D.Chen,T.Li,F.Wu,L.vanderMaaten,andK.Q.Weinberger, “Multi-scaledensenetworksforresourceefficientimageclassification,” arXiv preprint arXiv:1703.09844, 2017. [172] Y. Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understand- ing and mitigating network overthinking,” in International Conference on Machine Learning. PMLR, 2019, pp. 3301–3310. [173] A. Bulat and G. Tzimiropoulos, “Bit-mixer: Mixed-precision networks with runtime bit-width selection,” arXiv preprint arXiv:2103.17267, 2021. [174] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adap- tive instance normalization,” in Proceedings of the IEEE International Con- ference on Computer Vision, 2017, pp. 1501–1510. [175] S.Yang,Z.Wang,Z.Wang,N.Xu,J.Liu,andZ.Guo,“Controllableartistic textstyletransferviashape-matchinggan,”inProceedingsoftheIEEE/CVF International Conference on Computer Vision, 2019, pp. 4442–4451. [176] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville, “Modulating early visual processing by language,” arXiv preprint arXiv:1707.00683, 2017. [177] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri, “Hetconv: Het- erogeneous kernel-based convolutions for deep cnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4835–4844. [178] C. Xie and A. Yuille, “Intriguing properties of adversarial training at scale,” arXiv preprint arXiv:1906.03787, 2019. 205 [179] A. Bietti, G. Mialon, and J. Mairal, “On regularization and robustness of deep neural networks,” 2018. [180] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified robustness to adversarial examples with differential privacy,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 656–672. [181] T. Chen, Y. Cheng, Z. Gan, L. Yuan, L. Zhang, and Z. Wang, “Chasing sparsity in vision transformers: An end-to-end exploration,” Advances in Neural Information Processing Systems, vol. 34, 2021. [182] S. Kundu and S. Sundaresan, “Attentionlite: Towards efficient self-attention modelsforvision,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021. [183] S. Sundaresan and S. Kundu, “Deep neural network optimization system for machine learning model scaling,” Feb. 3 2022, US Patent App. 17/504,282. [184] D. J. Cummings, J. P. Munoz, S. Kundu, S. N. Sridhar, and M. Szankin, “Machine learning model scaling system with energy efficient network data transferforpowerawarehardware,”Feb.32022,USPatentApp.17/506,161. [185] Y.Netzer, T.Wang, A.Coates, A.Bissacco, B.Wu, andA.Y.Ng, “Reading digits in natural images with unsupervised feature learning,” 2011. [186] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016. [187] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in un- supervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics . JMLR Workshop and Conference Proceedings, 2011, pp. 215–223. [188] Z. Li, G. Yuan, W. Niu, P. Zhao, Y. Li, Y. Cai, X. Shen, Z. Zhan, Z. Kong, Q. Jin et al., “Npas: A compiler-aware framework of unified network prun- ing and architecture search for beyond real-time mobile acceleration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14255–14266. [189] M. Kang, S. Lim, S. Gonugondla, and N. R. Shanbhag, “An in-memory vlsi architecture for convolutional neural networks,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems,vol.8,no.3,pp.494–505,2018. [190] F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in ICML, 2020. 206 [191] S.-M.Moosavi-Dezfooli,A.Fawzi,andP.Frossard,“DeepFool: Asimpleand accurate method to fool deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582. [192] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille, “Adversarial examples for semantic segmentation and object detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1369– 1378. [193] D.Zhang,T.Zhang,Y.Lu,Z.Zhu,andB.Dong,“Youonlypropagateonce: Accelerating adversarial training via maximal principle,” arXiv preprint arXiv:1905.00877, 2019. [194] A.Shafahi,M.Najibi,A.Ghiasi,Z.Xu,J.Dickerson,C.Studer,L.S.Davis, G. Taylor, and T. Goldstein, “Adversarial training for free!” arXiv preprint arXiv:1904.12843, 2019. [195] Z. F. Mainen and T. J. Sejnowski, “Reliability of spike timing in neocortical neurons,” Science, vol. 268, no. 5216, pp. 1503–1506, 1995. [196] G. Indiveri and T. Horiuchi, “Frontiers in neuromorphic engineering,” Fron- tiers in Neuroscience, vol. 5, p. 118, 2011. [197] M.PfeifferandT.Pfeil, “Deeplearningwithspikingneurons: Opportunities and challenges,” Frontiers in Neuroscience, vol. 12, p. 774, 2018. [198] C. Farabet, R. Paz, J. P´ erez-Carrasco, C. Zamarre˜ no, A. Linares-Barranco, Y.LeCun,E.Culurciello,T.Serrano-Gotarredona,andB.Linares-Barranco, “Comparison between frame-constrained fix-pixel-value and frame-free spiking-dynamic-pixel convnets for visual processing,” Frontiers in neuro- science, vol. 6, p. 32, 2012. [199] N. Rathi and K. Roy, “DIET-SNN: Direct input encoding with leakage and threshold optimization in deep spiking neural networks,” arXiv preprint arXiv:2008.03658, 2020. [200] R.El-Allami,A.Marchisio,M.Shafique,andI.Alouani,“Securingdeepspik- ing neural networks against adversarial attacks through inherent structural parameters,” arXiv preprint arXiv:2012.05321, 2020. [201] S.Sharmin,N.Rathi,P.Panda,andK.Roy,“Inherentadversarialrobustness of deep spiking neural networks: Effects of discrete input encoding and non- linear activations,” in European Conference on Computer Vision. Springer, 2020, pp. 399–414. 207 [202] A. Marchisio, G. Nanfa, F. Khalid, M. A. Hanif, M. Martina, and M.Shafique,“Isspikingsecure? Acomparativestudyonthesecurityvulner- abilities of spiking and deep neural networks,” in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–8. [203] L. Deng, “The MNIST database of handwritten digit images for machine learning research [best of the web],” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012. [204] G.Datta,S.Kundu,andP.A.Beerel,“Trainingenergy-efficientdeepspiking neural networks with single-spike hybrid input encoding,” arXiv preprint arXiv:2107.12374, 2021. [205] S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. Mann, “Fixing data augmentation to improve adversarial robustness,” arXiv preprint arXiv:2103.01946, 2021. [206] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in Proceedings of the 10th ACM workshop on artificial intelligence and security , 2017, pp. 15–26. [207] T. Wang, L. Yuan, X. Zhang, and J. Feng, “Distilling object detectors with fine-grained feature imitation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4933–4942. [208] X. Chen, Y. Zhang, Y. Wang, H. Shu, C. Xu, and C. Xu, “Optical flow distillation: Towards efficient and stable video style transfer,” in European Conference on Computer Vision. Springer, 2020, pp. 614–630. [209] H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz, “Dreaming to distill: Data-free knowledge transfer via deepinversion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8715–8724. [210] H.Chen,Y.Wang,C.Xu,Z.Yang,C.Liu,B.Shi,C.Xu,C.Xu,andQ.Tian, “Data-free learning of student networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3514–3522. [211] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your own teacher: Improve the performance of convolutional neural networks via self distillation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3713–3722. 208 [212] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4320–4328. [213] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE transactions on medical imag- ing, vol. 35, no. 5, pp. 1299–1312, 2016. [214] Y.Taigman,M.Yang,M.Ranzato,andL.Wolf,“DeepFace: Closingthegap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708. [215] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine transla- tion,” arXiv preprint arXiv:1609.08144, 2016. [216] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “FitNets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014. [217] N. Passalis and A. Tefas, “Learning deep representations with probabilistic knowledgetransfer,”inProceedingsoftheEuropeanConferenceonComputer Vision (ECCV), 2018, pp. 268–284. [218] S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI Conference on Artificial Intelligence , 2020, pp. 5191–5198. [219] L.Yuan,F.E.Tay,G.Li,T.Wang,andJ.Feng,“Revisitingknowledgedistil- lation via label smoothing regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3903– 3911. [220] S.Yun, J.Park, K.Lee, andJ.Shin, “Regularizingclass-wisepredictionsvia self-knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13876–13885. [221] S. Kariyappa and M. K. Qureshi, “Defending against model stealing attacks withadaptivemisinformation,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 770–778. [222] J. Zhang, D. Chen, J. Liao, W. Zhang, G. Hua, and N. Yu, “Passport-aware normalization for deep model protection,” Advances in Neural Information Processing Systems, vol. 33, pp. 22619–22628, 2020. 209 [223] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor at- tacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017. [224] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoo- rian, J. A. Van Der Laak, B. Van Ginneken, and C. I. S´ anchez, “A survey on deeplearning in medical image analysis,” Medical image analysis, vol.42, pp. 60–88, 2017. [225] G.Hinton, L.Deng, D.Yu, G.E.Dahl, A.-r.Mohamed, N.Jaitly, A.Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012. [226] Z. Brakerski and V. Vaikuntanathan, “Efficient fully homomorphic encryp- tion from (standard) lwe,” SIAM Journal on computing, vol. 43, no. 2, pp. 831–871, 2014. [227] C. Gentry, A fully homomorphic encryption scheme. Stanford university, 2009. [228] O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game, or a completeness theorem for protocols with honest majority,” in Providing Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio Micali, 2019, pp. 307–328. [229] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan, “{GAZELLE}: A lowlatencyframeworkforsecureneuralnetworkinference,”in 27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 1651–1669. [230] B. Reagen, W.-S. Choi, Y. Ko, V. T. Lee, H.-H. S. Lee, G.-Y. Wei, and D. Brooks, “Cheetah: Optimizing and accelerating homomorphic encryp- tion for private inference,” in 2021 IEEE International Symposium on High- Performance Computer Architecture (HPCA). IEEE, 2021, pp. 26–39. [231] J. Liu, M. Juuti, Y. Lu, and N. Asokan, “Oblivious neural network predic- tionsviaminionntransformations,”inProceedingsofthe2017ACMSIGSAC conference on computer and communications security, 2017, pp. 619–631. [232] Z. Ghodsi, A. K. Veldanda, B. Reagen, and S. Garg, “Cryptonas: Private inferenceonarelubudget,” Advances in Neural Information Processing Sys- tems, vol. 33, pp. 16961–16971, 2020. [233] A. C.-C. Yao, “How to generate and exchange secrets,” in 27th Annual Sym- posium on Foundations of Computer Science (sfcs 1986). IEEE, 1986, pp. 162–167. 210 [234] M. Cho, Z. Ghodsi, B. Reagen, S. Garg, and C. Hegde, “Sph- ynx: Relu-efficient network design for private inference,” arXiv preprint arXiv:2106.11755, 2021. [235] N. K. Jha, Z. Ghodsi, S. Garg, and B. Reagen, “Deepreduce: Relu reduction forfastprivateinference,”inInternational Conference on Machine Learning. PMLR, 2021, pp. 4839–4849. [236] Q. Lou, Y. Shen, H. Jin, and L. Jiang, “Safenet: A secure, accurate and fast neural network inference,” in International Conference on Learning Repre- sentations, 2021. [237] Z. Ghodsi, N. K. Jha, B. Reagen, and S. Garg, “Circa: Stochastic relus for private deep learning,” Advances in Neural Information Processing Systems, 2021. [238] R. Lehmkuhl, P. Mishra, A. Srinivasan, and R. A. Popa, “Muse: Secure inferenceresilienttomaliciousclients,”in30th USENIX Security Symposium (USENIX Security 21). USENIX Association, aug 2021, pp. 2201–2218. [239] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018. 211
Abstract (if available)
Abstract
The super-linear increase of deep learning model size with the slow-down of Moore’s law has made their deployment on a resource-constrained device exceedingly challenging. Compressing these large models has become a critical step in meeting energy, memory, and I/O bandwidth constraints imposed by the device itself. However, training to yield efficient neural network models can be expensive in terms of the compute complexity and associated environmental impact. Moreover, these energy-efficient models must simultaneously address increasingly important aspects of robustness and model privacy, particularly for safety-critical applications such as autonomous driving, healthcare, and military-grade robotics. This thesis presents low-complexity training methods that yield energy-efficient models, discloses methods that improve model robustness with reduced computational cost, and identifies some key challenges associated with achieving model privacy. Specifically, in part I, we introduce efficient non-iterative pruning and quantization schemes that are able to generate compressed models with a negligible drop in inference accuracy. In part II, we investigate the limitations of post-training compression for robust model generation under adversarial attacks and develop sparse-learning algorithms to train robust yet compressed models. We then present a training method that conditionally trains a novel class of models that simultaneously yields state-ofthe-art performance on both clean and adversarial images. Finally, part III of this thesis analyzes vulnerability and methods in protecting model privacy. More precisely, we first introduce the notion of a “skeptical student” that, using a novel hybrid distillation, can circumvent the model IP protection obtained from so-called “undistillable” models under both data-available and data-free scenarios. We further develop training methodologies in yielding efficient models while protecting model IP via a form of client-server secure private inference framework. This thesis thus highlights the importance and potential benefits of expanding state-of-the-art training methods to not only consider model accuracy, but also target energy efficiency, robustness, and privacy.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Efficiency in privacy-preserving computation via domain knowledge
PDF
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Efficient data collection in wireless sensor networks: modeling and algorithms
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Models and algorithms for energy efficient wireless sensor networks
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Clocking solutions for SFQ circuits
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Deep learning for subsurface characterization and forecasting
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Controlling information in neural networks for fairness and privacy
PDF
Neural representation learning for robust and fair speaker recognition
Asset Metadata
Creator
Kundu, Souvik
(author)
Core Title
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
12/23/2022
Defense Date
04/06/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial attacks,cryptographic inference,DNN pruning,DNN quantization,efficient deep neural networks,efficient spiking neural networks,knowledge distillation,model privacy,model robustness,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Pedram, Massoud (
committee chair
), Avestimehr, Salman (
committee member
), Golubchik, Leana (
committee member
)
Creator Email
ksouvik52@gmail.com,souvikku@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111345526
Unique identifier
UC111345526
Legacy Identifier
etd-KunduSouvi-10790
Document Type
Dissertation
Rights
Kundu, Souvik
Type
texts
Source
20220623-usctheses-batch949
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial attacks
cryptographic inference
DNN pruning
DNN quantization
efficient deep neural networks
efficient spiking neural networks
knowledge distillation
model privacy
model robustness