Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
(USC Thesis Other)
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS EFFICIENT EDGE INTELLIGENCE WITH IN-SENSOR AND NEUROMORPHIC COMPUTING: ALGORITHM-HARDWARE CO-DESIGN by Gourav Datta A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2023 Copyright 2023 Gourav Datta Dedication Dedicated to my mother, my late father, and Ipsita ii Acknowledgements Completing a PhD has been a rewarding and adventurous experience, and I sin- cerely owe it all to the people who supported me and made it possible. Firstly, I would like to give my heartfelt thanks to my advisor Prof. Peter A. Beerel for being a great mentor and always believing in me and setting me up for success. He taught me how to write papers and proposals, give talks and foremost how to conduct research. A remarkable quality in Peter is his dedication and commitment towards his students, which I have truly reaped benefits of, during my tenure here by engaging in regular technical and non-technical discussions. Overall, I will forever be indebted to Peter for his immense support and guidance which has brought me here today. I would like to extend my gratitude to Prof. Massoud Pedram, Prof. JoshuaYang, and Prof. Aiichiro Nakano for their support and encouragement as part of my dissertation committee. Thank you to all of my teachers at USC – Prof. Pierluigi Nuzzo, Prof. Xiang Ren, Prof. Xuehai Qian (now at Purdue University), Prof. Joshua Yang, Prof. Michael Neely, Dr. Moe Tabar, Prof. Gandhi Puvvada, and Prof. Victor Adam- chik for providing me with many resources to learn various topics throughout the entire computing stack, from devices and circuits to architectures and algorithms. Thanks to all the professors and senior researchers I had the opportunity to col- laborate with – Prof. Akhilesh R. Jaiswal, Prof. Massoud Pedram, Prof. Rehan Kapadia,Prof. JoshuaYang,Prof. Wael-AbdAlmageed,Dr. AjeyJacob,Dr. An- drew Schimdt, Joe Mathai, Dr. John Paul Walters, Prof. Nitin Chandrachoodan (IIT Madras). Thanks for your fruitful collaborations and discussions in the past few years. IamthoroughlygratefultoallthestudentmembersofE 2 S 2 Cforcreatingsuch an amazing work environment. None of my publications would have been possible without all of your collaborative efforts. It has been my pleasure to work with Dr. iii Souvik Kundu, Dr. Dake Chen, Dr. Huimei Cheng, Yuke Zhang, Yue Hu, Robert Aviles,SreetamaSarkar,andDorothyQiu. Iamthankfultoallofyouforspending time with me both inside and outside our offices and labs throughout my PhD. I would also like to thank the student members of other research groups, including Dr. Bo Zhang, Mingye Li, Dr. Haolin Cong, Md Kaiser, and Zihan Yin, who collaboratedwithmeandhelpedmeachievespecificresearchgoals. Iamalsovery grateful to the wonderful cohort of Masters directed research students I had the pleasure of mentoring. Thanks to Zeyu Liu, Fang Chen, Zixu Wang, Mulin Tian, Shunlin Lu, Shidie Lin, Adesh Singh Sudheer, Pranav Hindupur Srinivas for being instrumental to the culmination of several of my papers. During my PhD, I also had the unique opportunity of working with a few extremely talented Bachelors’ students, including Shashank Nag, Haoqin Deng, and Riad Alasadi. I would like to thank the agencies that have funded my research and helped to pay the bills – National Science Foundation Software and Hardware Founda- tions (NSF SHF) Grant #1763747, Defense Advanced Research Projects Agency (DARPA) In-Pixel Intelligent Processing (IP2) Artificial Intelligence Exploration (AIE) Opportunity contract number #HR00112190120, Samsung, Amazon Inc., and University of Southern California Information Sciences Institute (USC ISI). I would also like to thank the USC graduate school for selecting me as one of the Annenberg fellows during my Ph.D. program and providing travel support to attend the VLSI-SoC’22 conference. I am also indebted to Diane Demetras and Annie Yu for their help in administrative matters related to the progress of my Ph.D. and presentation of my research to the outside world. The best part of my PhD was the friendship with my peers at USC, who made tough days bearable and good days even better. I would like to mention Souvik, Subhajit, Chandani, Arnab, Matt, Anik, and Sreetama for being there for me personallyandprofessionally. Heartfeltthankstomymother,Soma,whohasbeen the torch bearer of light and always stayed beside me, supported me, motivated me, inspired me, and strengthened me throughout this journey. I had lost my father just a few months before arriving to the United States for my PhD. My mother constantly supported me (and my decision to start my PhD) and helped me navigate the tumultous waters, even in her period of extreme grief. I would also like to thank my partner, Ipsita, who has had a great influence in both my professional and personal life, in the short span of time she has been with me. Nothing would have been possible without them. iv Table of Contents Dedication ii Acknowledgements iii List of Tables x List of Figures xiv Abstract xx Related Publications with links xxii Chapter 1: Introduction 1 1.1 Neural Network Basics . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . 6 1.2 Efficient Inference with SNNs . . . . . . . . . . . . . . . . . . . . . 7 1.3 Efficient Inference with In-Pixel Computing . . . . . . . . . . . . . 9 1.4 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Single-Spike Hybrid-Input Encoding to reduce SNN spiking activity 13 2.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 13 2.2 SNN Training Techniques . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 ANN-SNN Conversion . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 STDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Hybrid Training . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Hybrid Spike Encoding . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Proposed Training Scheme . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Classification Accuracy & Latency . . . . . . . . . . . . . . 25 2.6 Improvement in Energy-efficiency . . . . . . . . . . . . . . . . . . . 25 v 2.6.1 Reduction in Spiking Activity . . . . . . . . . . . . . . . . . 25 2.6.2 Reduction in FLOPs and Compute Energy . . . . . . . . . . 28 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Chapter 3: Optimal ANN-to-SNN conversion to reeduce SNN la- tency 33 3.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Preliminaries and Related Work . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Threshold ReLU activation in DNNs . . . . . . . . . . . . . 34 3.2.2 DNN-to-SNN Conversion . . . . . . . . . . . . . . . . . . . . 35 3.3 Proposed Training Framework . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Why Does Conversion Fail for Ultra Low Latencies? . . . . . 37 3.3.2 Conversion & Fine-tuning for Ultra Low-Latency SNNs . . . 40 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Classification Accuracy & Latency . . . . . . . . . . . . . . 43 3.5 Simulation Time & Memory Requirements . . . . . . . . . . . . . . 44 3.6 Energy Consumption During Inference . . . . . . . . . . . . . . . . 45 3.6.1 Spiking Activity . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.6.2 Floating Point Operations (FLOPs) & Compute Energy . . 46 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 4: SNNs for 3D Image Recognition 50 4.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . 50 4.2 Proposed Quantized SNN Training Method . . . . . . . . . . . . . . 53 4.2.1 Study of Quantization Choice . . . . . . . . . . . . . . . . . 53 4.2.2 Q-STDB based Training . . . . . . . . . . . . . . . . . . . . 56 4.3 SRAM-based PIM Acceleration . . . . . . . . . . . . . . . . . . . . 60 4.4 Proposed CNN Architectures, Datasets, and Training Details . . . . 62 4.4.1 Model Architectures . . . . . . . . . . . . . . . . . . . . . . 62 4.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.3 ANN Training and SNN Conversion Procedures . . . . . . . 65 4.5 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . 66 4.5.1 ANN & SNN Inference Results . . . . . . . . . . . . . . . . 66 4.5.2 Spiking Activity . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.3 Energy Consumption and Delay . . . . . . . . . . . . . . . . 69 4.5.4 Training Time and Memory Requirements . . . . . . . . . . 75 4.5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6 Conclusions and Broader Impact . . . . . . . . . . . . . . . . . . . 80 Chapter 5: Hoyer Regularized Training for One-Time-Step SNNs 82 5.1 Introduction & Related Work . . . . . . . . . . . . . . . . . . . . . 82 vi 5.2 Preliminaries on Hoyer Regularizers . . . . . . . . . . . . . . . . . . 84 5.3 Proposed Training Framework . . . . . . . . . . . . . . . . . . . . . 85 5.3.1 Hoyer spike layer . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.2 Hoyer Regularized Training . . . . . . . . . . . . . . . . . . 87 5.3.3 Network Structure . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.4 Possible Training strategies . . . . . . . . . . . . . . . . . . 90 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5 Discussions & Future Impact . . . . . . . . . . . . . . . . . . . . . . 101 Chapter 6: Efficient Spiking LSTMs for Streaming Workloads 103 6.1 Introduction & Related Work . . . . . . . . . . . . . . . . . . . . . 103 6.2 Proposed Training Framework . . . . . . . . . . . . . . . . . . . . . 105 6.2.1 Non-spiking LSTM . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.2 Conversion from non-spiking to spiking LSTMs . . . . . . . 106 6.2.3 SNN Training with LIF Activation Shifts . . . . . . . . . . . 107 6.2.4 Selective conversion of LSTM activation functions . . . . . . 108 6.3 Pipelined Parallel SNN Processing . . . . . . . . . . . . . . . . . . . 110 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.4.1 Inference Accuracy . . . . . . . . . . . . . . . . . . . . . . . 112 6.4.2 Inference Energy Efficiency. . . . . . . . . . . . . . . . . . . 114 6.4.3 Inference Latency . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5 Conclusions & Broader Impact. . . . . . . . . . . . . . . . . . . . . 117 Chapter 7: In-Pixel Computing for several CV applications 118 7.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . . . 118 7.2 P 2 M-constrained Algorithm-Circuit Co-Design . . . . . . . . . . . . 122 7.2.1 Custom Convolution for the First Layer Modeling Circuit Non-Idealities . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.2.2 Circuit-Algorithm Co-optimization of CNN Backbone sub- ject to P 2 M Constrains . . . . . . . . . . . . . . . . . . . . . 124 7.2.3 Quantification of bandwidth reduction . . . . . . . . . . . . 126 7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.3.1 Benchmarking Dataset & Model . . . . . . . . . . . . . . . . 127 7.3.2 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . 128 7.3.3 EDP Estimation . . . . . . . . . . . . . . . . . . . . . . . . 132 7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Chapter 8: ISP-less CV for P 2 M 136 8.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . . . 136 8.2 Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2.1 ISP Reversal & Removal . . . . . . . . . . . . . . . . . . . . 139 8.2.2 Few-Shot Object Detection. . . . . . . . . . . . . . . . . . . 141 vii 8.3 Inverting ISP Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.4 Proposed Demosaicing Technique . . . . . . . . . . . . . . . . . . . 143 8.5 Few-Shot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . 147 8.6.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.7.1 VWW Results . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8.7.2 COCO raw Results . . . . . . . . . . . . . . . . . . . . . . . 149 8.7.3 PASCALRAW Results . . . . . . . . . . . . . . . . . . . . . 151 8.7.4 Comparison with Prior Works . . . . . . . . . . . . . . . . . 153 8.7.5 Bandwidth & Energy Benefits . . . . . . . . . . . . . . . . . 154 8.8 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Chapter 9: Self-Attentive Pooling for Aggressive Compression in P 2 M 156 9.1 Introduction & Motivation . . . . . . . . . . . . . . . . . . . . . . . 156 9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 9.2.1 Pooling Techniques . . . . . . . . . . . . . . . . . . . . . . . 159 9.2.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . 160 9.2.3 Low-Power Attention-based Models . . . . . . . . . . . . . . 161 9.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.4.1 Non-Local Self-Attentive Pooling . . . . . . . . . . . . . . . 163 9.4.2 Optimizing with Channel Pruning . . . . . . . . . . . . . . . 165 9.5 Self-Attentive Pooling in CNN Backbones . . . . . . . . . . . . . . 166 9.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 167 9.6.2 Accuracy & mAP Analysis . . . . . . . . . . . . . . . . . . . 168 9.6.3 Qualitative Results & Visualization . . . . . . . . . . . . . . 170 9.6.4 Compute & Memory Efficiency . . . . . . . . . . . . . . . . 171 9.6.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.7 Conclusion & Societal Implications . . . . . . . . . . . . . . . . . . 176 Chapter 10:Future Work 177 10.1 Future Work in SNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 177 10.1.1 SNNs for beyond-classification tasks. . . . . . . . . . . . . . 177 10.1.2 SNNs with efficient backbones and ViTs . . . . . . . . . . . 178 10.1.3 SNNs for dynamic vision sensing (DVS) . . . . . . . . . . . 178 10.2 Future work in in-sensor computing . . . . . . . . . . . . . . . . . . 179 10.2.1 Distributed Computing and Sensor Fusion . . . . . . . . . . 179 10.2.2 Frame Skipping . . . . . . . . . . . . . . . . . . . . . . . . . 180 viii 10.2.3 Low-level CV tasks . . . . . . . . . . . . . . . . . . . . . . . 180 10.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Bibliography 182 ix List of Tables 2.1 Model performances with single-spike hybrid encoded SNN training on CIFAR-10 and CIFAR-100 after a) ANN training, b) ANN-to- SNN conversion and c) SNN training. . . . . . . . . . . . . . . . . . 24 2.2 Performancecomparisonoftheproposedsinglespikehybridencoded SNNwithstate-of-the-artdeepSNNsonCIFAR-10andCIFAR-100. TTFS denotes time-to-first-spike coding. . . . . . . . . . . . . . . . 26 2.3 Convolutional and Fully-connected layer FLOPs for ANN and SNN models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 EstimatedenergycostsforMACandACoperationsin45nmCMOS process at 0.9 V [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1 Modelperformanceswithproposedtrainingframeworkaftera)DNN training, b) DNN-to-SNN conversion & c) SNN training. . . . . . . 43 3.2 Performance comparison of the proposed training framework with state-of-the-art deep SNNs on CIFAR-10 and CIFAR-100.. . . . . . 45 4.1 Model architectures employed for CNN-3D and CNN-32H in classi- fyingtheIPdataset. Everyconvolutionalandlinearlayerisfollowed by a ReLU non-linearity. The last classifier layer is not shown. The size of the activation map of a 3D CNN is written as (H,W,D,C) where H, W, D, and C represent the height, width, depth of the input feature map and the number of channels. Since the 2D CNN layer does not have the depth dimension, its feature map size is represented as (H,W,C). . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 Model performances with Q-STDB based training on IP, PU, SS, and HyRANK datasets for CNN-3D and CNN-32H after a) ANN training, b) ANN-to-SNN conversion, c) 32-bit SNN training, d) 4- bit SNN training, e) 5-bit SNN training, and f) 6-bit SNN training, with only 5 time steps. . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3 Inference accuracy (OA, AA, and Kappa) comparison of our pro- posed SNN models obtained from CNN-3D and CNN-32H with state-of-the-art deep ANNs on IP, PU, SS, and HyRANK datasets . 69 4.4 Notationsandtheirvaluesusedinenergy,delay,andEDPequations for ANN and 6-bit SNNs.. . . . . . . . . . . . . . . . . . . . . . . . 71 x 4.5 Loss in accuracy associated with use of scale quantization during inference. Evaluated using the CNN-3D model on the IP dataset. . 78 4.6 ComparisonbetweenmodelperformancesforQ-STDBfromscratch, proposedhybridtraining,andANN-SNNconversionalone. Allcases are for 5 time steps and 6-bits. . . . . . . . . . . . . . . . . . . . . . 79 5.1 Accuracies from different strategies to train one-step SNNs on CI- FAR10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Comparison of the test accuracy of our one-time-step SNN models with the non-spiking DNN models for object recognition. Model ∗ indicatesthatweremovethefirstmaxpoolinglayer,andSAdenotes spiking activity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 Comparisonofourone-time-stepSNNmodelswithnon-spikingDNN, BNN, and multi-step SNN counterparts on VOC2007 test dataset. . 93 5.4 Comparisonofourone-time-stepSNNmodelstoexistinglow-latency counterparts. SGD and hybrid denote surrogate gradient descent andpre-trainedDNNfollowedbySNNfine-tuningrespectively. (qC, dL) denotes an architecture with q convolutional and d linear layers. 94 5.5 Ablation study of the different methods in our proposed training framework on CIFAR10. . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6 Accuracies of weight quantized one-time-step SNN models based on VGG16 on CIFAR10 where FP is 32-bit floating point. CE denotes compute energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.7 Comparisonofourone-time-stepSNNmodelstoAddNNsandBNNs that also incur AC-only operations for improved energy-efficiency. CE denotes compute energy. . . . . . . . . . . . . . . . . . . . . . . 99 5.8 Test accuracy obtained by our approach with multiple time steps on CIFAR10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.9 Comparisonofourone-andmulti-time-stepSNNmodelstoexisting SNN models on DVS-CIFAR10 dataset. . . . . . . . . . . . . . . . . 101 6.1 Test accuracy on temporal MNIST, GSC, and UCI datasets ob- tainedbyproposedapproacheswithdirectencodingfor2timesteps. SandNSdenotethespikingandnon-spikingLSTMvariantsrespec- tively. On the other hand, P and NP denotes the accuracies with and without a pre-trained non-spiking LSTM model respectively. . . 110 6.2 Test accuracy on GSC and UCI datasets obtained by proposed ap- proaches with direct encoding for 4 time steps on bi-directional and stackedLSTMs. ‘St.’ denotesatwo-layerstackedLSTM,withboth layers having 128 nodes each. ‘Bi-St.’ denotes a two-layer LSTM, with the first layer being bi-directional having 128 nodes. . . . . . 113 xi 6.3 Accuracy comparison of the best performing models obtained by our training framework with SOTA spiking and non-spiking LSTM models on different datasets. . . . . . . . . . . . . . . . . . . . . . . 114 7.1 Model hyperparameters and their values to enable bandwidth re- duction in the in-pixel layer. . . . . . . . . . . . . . . . . . . . . . . 127 7.2 Testaccuracies,numberofMAdds,andpeakmemoryusageofbase- line and P 2 M custom compressed model while classifying on the VWW dataset for different input image resolutions. . . . . . . . . . 129 7.3 Performance comparison of the proposed P 2 M-compatible models with state-of-the-art deep CNNs on VWW dataset. . . . . . . . . . 130 7.4 Energy estimates for different hardware components. The energy valuesare measured for designs in22nm CMOS technology. For the e mac , we convert the corresponding value in 45nm to that of 22nm by following standard scaling strategy [2].. . . . . . . . . . . . . . . 131 7.5 The description and values of the notations used for computation of delay. Note that we calculated the delay in 22nm technology for 32-bit read and MAdd operations by applying standard technol- ogy scaling rules initial values in 65nm technology [3]. We directly evaluated the T read and T adc through circuit simulations in 22nm technology node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 8.1 EvaluationofourapproachonISP-lessCVsystemswithMobileNetV2- 0.35xonVWWdataset. Demosaiced 1 denotestraditionaldemosaic- ing, while demosaiced 2 denotes our in-pixel demosaicing. WB, GC, and IPC denotes white balance, gamma correction, and in-pixel computing. Also, note that models trained on mosaiced images can only be tested with mosaiced images. . . . . . . . . . . . . . . . . . 150 8.2 mAPondifferentversionsoftheCOCOrawdatasettoemulateISP- less CV systems using a Faster R-CNN framework with ResNet101 backbone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.3 Comparison of our proposed approach on PASCALRAW dataset. . 152 9.1 Hyper parameter settings of different pooling techniques . . . . . . 167 9.2 Comparison of different pooling methods for different CNN back- bones on STL10 dataset. . . . . . . . . . . . . . . . . . . . . . . . . 169 9.3 ComparisonofdifferentpoolingmethodsforMobileNetV2-0.35Xon VWW dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.4 Comparison on COCO dataset. . . . . . . . . . . . . . . . . . . . . 171 9.5 Comparison of different pooling methods for MobileNetV2-0.35x on ImageNet dataset.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 xii 9.6 Comparison of the total FLOPs count of the whole CNN backbone with different pooling methods on the STL10 dataset. . . . . . . . . 173 9.7 Ablation Study of our Proposed Pooling Technique. . . . . . . . . . 175 xiii List of Figures Figure 1.1 ACNNusedforimageclassification. Pooling,batch-normalization andnon-linearitylayersarenotexplicitlyshownforsimplicity. 3 Figure 1.2 Feedforward fully-connected SNN architecture with inte- grate and fire (IF) spiking dynamics.. . . . . . . . . . . . . 6 Figure 2.1 (a) Hybrid coded input to the SNN (b) Mapping between thepixelintensityofimagesandthefiringtimeofindividual neurons where⌊.⌉ denotes the nearest integer function . . . 18 Figure 2.2 Comparison of average spiking activity per layer for VGG- 16 on CIFAR-10 and CIFAR-100 with both direct and hy- brid input encoding. . . . . . . . . . . . . . . . . . . . . . . 27 Figure 2.3 Effect of Poisson rate encoding, DCT encoding, direct en- coding, and single-spike hybrid input encoding on the av- erage spike rate and latency for VGG-16 architecture on CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . . . . 29 Figure 2.4 Comparison of normalized compute cost on CIFAR-10 and CIFAR-100 for VGG-16 of ANN and SNN with direct and hybrid input encoding. . . . . . . . . . . . . . . . . . . . . 31 Figure 3.1 (a) Comparison between DNN (threshold ReLU) and SNN (bothoriginalandbias-added)activationfunctions,thedis- tribution of DNN and SNN (T = 2) pre-activation values and variation of h(T,µ ) (see Eq. 3.4) with T(≤ 5) for the 2 nd layer of VGG-16 architecture on CIFAR-10, and (b) Proposed scaling of the threshold and output of the SNN post-activation values. . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.2 EffectofthenumberofSNNtimestepsonthetestaccuracy ofVGGandResNetarchitecturesonCIFAR-10withDNN- to-SNN conversion based on both threshold ReLU and the maximum pre-activation value used in [4]. . . . . . . . . . . 39 xiv Figure 3.3 Comparison between our proposed hybrid training tech- nique for 2 and 3 time steps, baseline direct encoded train- ing for 5 time steps [5] based on (a) simulation time per epoch, and (b) memory consumption, for VGG-16 archi- tecture over CIFAR-10 and CIFAR-100 datasets. . . . . . . 46 Figure 3.4 Comparison between our proposed hybrid training tech- nique for 2 and 3 time steps, baseline direct encoded train- ing for 5 time steps [5], and the optimal DNN-to-SNN con- version technique [4] for 16 time steps, based on (a) aver- age spike count, (b) total number of FLOPs, and (c) com- pute energy, for VGG-16 architecture over CIFAR-10 and CIFAR-100 datasets. An iso-architecture DNN is also in- cluded for comparison of FLOP count and compute energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 4.1 (a) Proposed SNN training framework details with 3D con- volutions,and(b)Fakequantizationforwardandbackward pass with straight through estimator (STE) approximation 59 Figure 4.2 PIM architecture in the first layer to process MAC opera- tionsforthefirstlayerofdirectcodedSNNs. Otherlayersof the SNN are processed with highly parallel programmable architecture using simpler accumulate operations. . . . . . 61 Figure 4.3 Architecturaldifferencesbetween(a)ANNand(b)SNNfor near-lossless ANN-SNN conversion. . . . . . . . . . . . . . 64 Figure 4.4 (i)Falsecolor-mapand(ii)groundtruthimagesofdifferent HSIdatasetsusedinourwork,namely(a)IndianPines,(b) Pavia University, and (c) Salinas Scene. . . . . . . . . . . . 64 Figure 4.5 Confusion Matrix for HSI test performance of ANN and proposed 6-bit SNN over IP dataset for both CNN-3D and CNN-32H.TheANNandSNNconfusionmatriceslooksim- ilar for both the network architectures. CNN-32H incurs a littledropinaccuracycomparedtoCNN-3Dduetoshallow architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 4.6 Layerwise spiking activity plots for (a) CNN-3D and (b) CNN-32H on Indian Pines, Salinas Scene and Pavia Uni- versity datasets. . . . . . . . . . . . . . . . . . . . . . . . . 70 Figure 4.7 Energy, delay, and EDP of layers of (a) CNN-3D and (b) CNN-32H architectures, comparing 6-bit ANNs and SNN (obtained via Q-STDB) models while classifying IP. . . . . 76 xv Figure 4.8 (a)Testaccuraciesfordifferentquantizationtechniquesdur- ing the forward path of training and inference with a 6-bit CNN-3D model on the IP dataset with 5 timesteps, (b) Test accuracies with 6, 9, and 12-bit weight precisions for post-training quantization with a CNN-32H model on the IP dataset with 5 timesteps. . . . . . . . . . . . . . . . . . 76 Figure 4.9 Weight shift (∆) in each layer of CNN-3D for (a) 4, (b) 5, and (c) 6-bit quantization, while classifying the IP dataset. 78 Figure 4.10 Comparison between our baseline SOTA ANNs and pro- posed SNNs with 5 time steps based on (a) training time per epoch, and (b) memory usage during training. Varia- tion of (a) and (b) with the number of time steps for the IP dataset and CNN-32H architecture are shown in (c). . . 80 Figure 5.1 (a) Comparison of our Hoyer spike activation function with existingactivationfunctionswherethebluedistributionde- notestheshiftingofthemembranepotentialawayfromthe threshold using Hoyer regularized training, (b) Proposed derivative of our Hoyer activation function. . . . . . . . . . 86 Figure 5.2 Spiking network architectures corresponding to (a) VGG and (b) ResNet based models. . . . . . . . . . . . . . . . . 90 Figure 5.3 Layerwise spiking activities for a VGG16 across time steps ranging from 5 to 1 (average spiking activity denoted as S in parenthesis) representing existing low-latency SNNs in- cluding our work on (a) CIFAR10, (b) ImageNet, (c) Com- parisonofthetotalenergyconsumptionbetweenSNNswith different time steps and non-spiking DNNs. . . . . . . . . . 95 Figure 5.4 Normalized training and inference time per epoch with iso- batch(256)andhardware(RTX3090with24GBmemory) conditions for (a) CIFAR10 and (b) ImageNet with VGG16. 96 Figure 6.1 Spiking (hard) and non-spiking activation (IF and LIF) functions corresponding to (a) sigmoid and (b) tanh ac- tivation for T = 4, and V th sig = 4, V th tanh+ = 3, V th tanh− =− 2. We show the proposed bias shifts for IF activations. The green and red dotted lines show the continuous versions of the discrete LIF activation functions. . . . . . . . . . . . . 104 xvi Figure 6.2 LIF activation function corresponding to the (a) sigmoid and (b) tanh activation function used in the spiking LSTM architecture, and (c) Proposed spiking LSTM architecture and dataflow with the parallel pipelined execution for the example of 5 time steps and 3 input elements in the se- quence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Figure 6.3 Comparison between the accuracies obtained by our direct and Poisson encoded spiking LSTMs (a) with both conver- sion and SNN fine-tuning and (b) with only conversion. . . 113 Figure 6.4 (a) Energy and (b) Delay comparisons between the non- spiking LSTM, proposed direct and Poisson encoded spik- ing LSTM, and the SOTA spiking LSTM model [6], that does not include any of our proposed approaches. . . . . . 115 Figure 7.1 Existing and Proposed Solutions to alleviate the energy, throughput, and bandwidth bottleneck caused by the seg- regation of Sensing and Compute. . . . . . . . . . . . . . . 121 Figure 7.2 Algorithm-circuit co-design framework to enable our pro- posed P 2 M approach optimize both the performance and energy-efficiency of vision workloads. We propose the use of 1 ○ large strides, 2 ○ large kernel sizes, 3 ○reduced number of channels, 4 ○ P 2 M custom convolution, and 5 ○ shifted ReLU operation to incorporate the shift term of the batch normalizationlayer, foremulatingaccurateP 2 Mcircuitbe- haviour. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 7.3 (a)Effectofquantizationofthein-pixeloutputactivations, and (b) Effect of the number of channels in the 1 st convo- lutional layer for different kernel sizes and strides, on the test accuracy of our P 2 M custom model.. . . . . . . . . . . 130 Figure 7.4 Comparison of normalized total, sensing, and SoC (a) en- ergy cost and (b) delay between the P 2 M, and baseline models architectures (compressed C, and non-compressed NC). Note, the normalization of each component was done by diving the corresponding energy (delay) value with the maximumtotalenergy(delay)valueofthethreecomponents.134 Figure 8.1 Difference in frequency distributions of pixel intensities be- tween mosaiced raw, demosaiced, and ISP-processed images.138 xvii Figure 8.2 (a) Proposed ISP-less CV system, (b) Invertible NN train- ingondemosaicedrawimage,withoutanywhitebalanceor gamma correction, (c) Generation of raw images using the trainedinversenetworkandcustommosaicing,and(d)Ap- plicationofin-pixeldemosaicingandtrainingoftheISP-less CVmodels. NotetheIn-Pixel Demosaic implementationin the pixel array is illustrated in Fig. 8.3. . . . . . . . . . . . 139 Figure 8.3 Implementation of the proposed (a) demosacing and (b) demosaicing coupled with in-pixel convolution for ISP-less CV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Figure 8.4 Comparison of the (a) accuracy and (b) mAP of our pro- posed demosaicing method with different ISP pipelines on COCOdatasetwithFaster-RCNNframeworkwithResNet101 backbone and VWW dataset with MobileNetV2-0.35x re- spectively, where DM denotes our proposed demosaicing technique,WBandGCdenotewhitebalancingandgamma correction respectively. The energy consumptions of our approachesarecomparedwiththenormalpixelread-outin (c) and (d) on VWW and COCO respectively, where IPC denotes in-pixel computing. Note, for (d), the energy unit is µJ for ‘sensor’ & ‘data comm.’, and 100µJ for ‘CNN’ & ‘total’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Figure 9.1 Illustration of locality based pooling and non-local self- attentive pooling. The pooling weight has the same shape with the input activation I, of which only a local region is displayed in this figure. F(·) denotes the locality based poolingandπ (·)denotestheproposednon-localself-attentive pooling. Forthelocalitybasedpooling,eachpoolingweight has limited sensitive field as shown in the red box. For the proposed non-local self-attentive pooling, the input activa- tion is divided to several patches and encoded into a series of patch tokens. Based on these patch tokens, the pool- ing weights have global view, which makes it superior for capturing long-range dependencies and aggregating features.159 Figure 9.2 Architecture of the non-local self-attentive pooling. . . . . 163 Figure 9.3 Illustration for two ways of using pooling methods. . . . . 166 xviii Figure 9.4 Visualizationresultsforlocalimportancebasedpoolingand the proposed non-local self-attentive pooling. The images arefromtheSTL10datasetandtheheatmapsineachtech- nique highlight the regions of interest, i.e., the regions with high heatmap value will be regarded as effective informa- tion and retained while down-sampling. . . . . . . . . . . 173 xix Abstract The increasing need for on-chip edge intelligence on various energy-constrained platformsischallengedbythehighcomputationandmemoryrequirementsofdeep neural networks (DNNs). This has motivated this dissertation research that fo- cusesontwokeythrustsinachievingenergyandlatencyefficientedgeintelligence. The first thrust is focused on neuromorphic spiking neural networks (SNN) where we propose novel DNN-to-SNN conversion and SNN fine-tuning algorithms that can increase the latency and energy efficiency for several static computer vision (CV) tasks. These algorithms involve single-spike hybrid input encoding, shifting and scaling threshold and post-activation values, quantization-aware spike time dependent backpropagation (STDB), and Hoyer regularized training with Hoyer spike layers. Beyond static tasks, we also propose novel SNN training algorithms and hardware implementations, involving novel activation functions with optimal bias shifts and a pipelined parallel processing scheme, that leverage both the tem- poral and sparse dynamics of SNNs to reduce the inference latency and energy of large-scale streaming/sequential tasks while achieving state-of-the-art (SOTA) ac- curacy. The second thrust is focused on hardware-algorithm co-design of in-sensor computing that can bring the SOTA DNNs, including these SNNs, closer to the sensors, further reducing their energy consumption, and enabling real-time pro- cessing. Here, we propose a novel processing-in-pixel-in-memory (P 2 M) paradigm for resource-constrained sensor intelligence applications, that embeds the compu- tational aspects of all modern CNN layers inside the CMOS image sensors, com- presses the input activation maps via reduced number of channels and aggressive strides, and thereby, mitigates the associated bandwidth, latency, and energy bot- tlenecks. Moreover, to enable such aggressive compression required for P 2 M, we propose a novel non-local self-attentive pooling method that efficiently aggregates dependenciesbetweennon-localactivationpatchesduringdown-samplingandthat xx can be used as a drop-in replacement to standard pooling layers. Additionally, to enable P 2 M, we need to bypass the image signal processing (ISP) pipeline, that degrades the test accuracy performed on raw images. To mitigate this concern, we proposeanISPreversalpipelinewhichcanconverttheRGBimagesofanydataset toitsrawcounterparts, andenablemodeltrainingonrawimages, therebyimprov- ing the accuracy with P 2 M-implemented systems. Coupled with our optimized SNNs, our P 2 M paradigm can reduce the bandwidth and the total system energy consumption each by an order of magnitude compared to SOTA vision pipelines. Thesis supervisor: Dr. Peter A. Beerel, Electrical and Computer Engineering, University of Southern California. xxi Related Publications with links Please see the author’s Google Scholar page for a full list of publications. G.Datta,S.Kundu,P.A.Beerel,“TrainingEnergy-EfficientDeepSpikingNeural NetworkswithSingle-SpikeHybridInputEncoding”,TheInternationalJointCon- ferenceonNeuralNetworks(IJCNN),2021. doi: 10.1109/IJCNN52387.2021.9534306. IEEE: [Paper] G. Datta, P. A. Beerel, “Can Deep Neural Networks be Converted to Ultra Low-Latency Spiking Neural Networks?”, Design, Automation and Test in Eu- rope Conference (DATE), 2022. doi: 10.23919/DATE54114.2022.9774704 IEEE: [Paper] G.Datta,S.Kundu,A.R.Jaiswal,P.A.Beerel,“ACE-SNN:Algorithm-Hardware Co-Design of Energy-efficient & Low-Latency Deep Spiking Neural Networks for 3D Image Recognition”, Frontiers in Neuroscience, Neuromorphic Engineering, 2022. doi: 10.3389/fnins.2022.815258 Frontiers: [Paper] G. Datta*, S. Kundu*, Z. Yin*, R. T. Lakkireddy, J. Mathai, A. P. Jacob, P. A. Beerel, A. R. Jaiswal, “P 2 M: A Processing-in-Pixel-in-Memory Paradigm for Resource-Constrained TinyML Applications”, Scientific Reports 12, Article num- ber: 14396 (2022), 2022. Nature: [Paper] (*=Authors have equal contribution). G. Datta, Z. Yin, A. P. Jacob, A. R. Jaiswal, P.A. Beerel, “Toward Efficient Hy- perspectral Image Processing inside Camera Pixels”, Computer Vision – ECCV 2022 Workshops, 2022. doi: 10.1007/978-3-031-25075-0 22 Springer: [Paper] G. Datta*, S. Kundu*, Z. Yin*, J. Mathai, Z. Liu, Z. Wang, M. Tian, S. Lu, R. T Lakkireddy, A. Schmidt, W. Abd-Almageed, A. P. Jacob, P. A. Beerel, A. R. xxii Jaiswal, “P 2 M-DeTrack: Processing-in-Pixel-in-Memory for Energy-efficient and Real-Time Multi-Object Detection and Tracking”, IFIP/IEEE 30th International ConferenceonVeryLargeScaleIntegration(VLSI-SoC),2022. doi: 10.1109/VLSI- SoC54400.2022.9939582 (best paper award nomination) IEEE: [Paper] (*=Authors have equal contribution). F.Chen*,G.Datta*,S.Kundu,P.A.Beerel,“Self-AttentivePoolingforEfficient Deep Learning”, IEEE/CVF Winter Conference on Applications of Computer Vi- sion (WACV), 2023, pp. 3974-3983 CVF: [Paper] (*=Authors have equal contribution). G. Datta, Z. Liu, Z. Yin, A. R. Jaiswal, P. A. Beerel, “Enabling ISPless Low- PowerComputerVision”, IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), 2023, pp. 2430-2439 CVF: [Paper] G. Datta, Z. Liu, M. Kaiser, S. Kundu, J. Mathai, Z. Yin, A. P. Jacob, A. R. Jaiswal, P. A. Beerel, “In-Sensor & Neuromorphic Computing are all you need for Energy Efficient Computer Vision”, accepted in International Conference on Acoustics, Speech, and Signal Processing, 2023. Arxiv: [Paper] G. Datta, Z. Liu, P. A. Beerel, “Hoyer regularizer is all you need for ultra low- latency spiking neural networks”, under review in International Conference on Computer Vision, 2023. Arxiv: [Paper] G. Datta, H. Deng, R. A. Aviles, P. A. Beerel, “Towards Energy-Efficient, Low- Latency and Accurate Spiking LSTMs”, accepted in ACM/IEEE International Symposium on Low Power Electronics and Design, 2023. Arxiv: [Paper] xxiii Chapter 1 Introduction Deep neural networks (DNNs), in particular, deep convolutional neural networks (CNNs), have become critical components in many real world vision applications ranging from object recognition [7,8], and detection [9,10] to image segmenta- tion [11]. With the demand for high classification accuracy, current state-of-the- art CNNs have evolved to have hundreds of layers [7,12–14], requiring millions of weights and billions of floating-point operations (FLOPs). However, because a wide variety of neural network applications are heavily resource constrained, such as those for embedded and IoT devices, there is increasing interest in DNN train- ing algorithms that can offer a good trade-off between the inference complexity, in terms of FLOPs, memory consumption, etc., and test accuracy [15], and the associated hardware accelerators that implement these DNNs [16,17]. 1.1 Neural Network Basics 1.1.1 Deep Neural Networks The first mathematical model of an artificial neuron was presented by War- ren S. McCulloch and Walter Pitts in 1943 [18]. A McCulloch-Pitts neuron (a.k.a. the threshold logic unit) takes a number of binary excitatory inputs and a binary inhibitoryinput, comparesthesumofexcitatoryinputswithathreshold, andpro- duces a binary output of one if the sum exceeds the threshold and the inhibitory input is not set. The mathematical formulation of this is as follows. 1 y = 1 if n− 1 P i=1 x i ≥ b and x 0 =0 0 otherwise, where each x i represents one of the n binary inputs (x 0 is the inhibitory input while the remaining inputs are excitatory), b is the threshold (a.k.a. bias), and y is the binary output of the neuron. However, the absence of any weight prohibits the synaptic connections between the neurons. A perceptron [19], addresses some of the shortcomings of McCulloch-Pitts neu- rons by introducing weighted connectivity between the neurons and its output can be formulated as y = 0 if n− 1 P i=0 θ i x i 0 0, otherwise (1.11) The third term in Eq. 2.3 exhibits soft reset by setting the reset potential to the threshold v l (instead of 0) i.e., reducing the membrane potential u l by v l at time step t, if an output spike is generated at the t th time step. As shown in [30], soft reset enables each spiking neuron to carry forward the surplus potential abovethefiringthresholdtothesubsequenttimestep[30], therebyminimizingthe information loss. 1.2 Efficient Inference with SNNs Spiking Neural Networks (SNNs) attempt to emulate the remarkable energy effi- ciency of the brain in vision, perception, and cognition-related tasks using event- driven neuromorphic hardware [31]. Neurons in an SNN exchange information 7 via discrete binary spikes, representing a significant paradigm shift from high- precision, continuous-valued deep neural networks (DNN) [32,33]. Due to its high activationsparsityanduseofaccumulates(AC)insteadofexpensivemultiply-and- accumulates (MAC), SNNs have emerged as a promising low-power alternative to DNNswhosehardwareimplementationsaretypicallyassociatedwithhighcompute and memory costs. Because SNNs receive and transmit information via spikes, analog inputs have tobeencodedwithasequenceofspikes. Therehavebeenmultipleencodingmeth- odsproposed,suchasratecoding[34],temporalcoding[35],rank-ordercoding[36], and others. However, recent works [5,37] showed that, instead of converting the image pixel values into spike trains, directly feeding the analog pixel values in the first convolutional layer, and thereby, emitting spikes only in the subsequent lay- ers, can reduce the number of time steps needed to achieve SOTA accuracy by an order of magnitude. Although the first layer now requires MACs, as opposed to the cheaper ACs in the remaining layers, the overhead is negligible for deep con- volutional architectures. Hence, we adopt this technique, termed direct encoding, in this work. In addition to accommodating various forms of encoding inputs, supervised learning algorithms for SNNs have overcome various roadblocks associated with the discontinuous derivative of the spike activation function [38,39]. Moreover, SNNscanbeconvertedfromDNNswithlowerrorbyapproximatingtheactivation value of ReLU neurons with the firing rate of spiking neurons [40]. SNNs trained using DNN-to-SNN conversion, coupled with supervised training, have been able to perform similar to SOTA DNNs in terms of test accuracy in traditional image recognition tasks [5,30]. However, the training effort still remains high, because SNNs need multiple time steps (at least 5 with direct encoding [5]) to process an input, and hence, the backpropagation step requires the gradients of the unrolled SNN to be integrated over all these time steps, which significantly increases the memory cost [41]. Moreover, the multiple forward passes result in an increased number of spikes, which degrade the SNN’s energy efficiency, both during training 8 and inference, and possibly offset the compute advantage of the ACs. This moti- vates our exploration of novel training algorithms [42–46] to reduce the test error oftheSNNwhilekeepingthenumberoftimestepsextremelysmallanddecreasing the number of generated spikes during both training and inference. Since SNNs havelowercomputecostthantheirnon-spikingCNNcounterparts,wealsoexplore the efficacy of SNNs in 3D CNNs which have higher arithmetic intensity (the ratio of floating point operations to accessed bytes) than 2D CNNs. This motivates our exploration of the effectiveness of SNNs converted from 3D CNNs for 3D image recognition, such as hyperspectral image classification (HSI). 1.3 Efficient Inference with In-Pixel Computing Meanwhile, the demand to process vast amounts of data generated from state-of- the-art high resolution cameras has motivated novel energy-efficient on-device AI hardware solutions [47,48]. Visual data in such cameras are usually captured in the form of analog voltages by a sensor pixel array, and then converted to digital domain for subsequent AI processing using analog-to-digital converters (ADC) [49]. Consequently,high-resolutioninputimagesneedtobeconstantlytransmitted between the camera and the AI processing unit, frame by frame, expending high energy and causing bandwidth and security bottlenecks. Tomitigatethisproblem,priorworkshaveleveragedanalogcomputingthrough in-sensor[50]andin-pixel[51,52]processinginanattempttomitigatetheexcessive transfer of data from sensors to processors. For example, some in-sensor comput- ing works (e.g., [50]) implement the analog multiply and accumulate logic for ML processing in the periphery of the sensor pixel array as shown in Fig. 7.2. How- ever, these techniques often require serial read-out of input kernels into an analog memory and streaming in of weights, incurring significant energy and through- put bottlenecks [50]. Moreover, the architectures are often restricted to simplistic neural network with limited applications. Other efforts rely on bulky exploratory devices(e.g.,2DmaterialbasedonWSe 2 [51]),yieldingpixelsthataremuchhigher 9 than possible using CMOS-based technology. The area in-efficiency of their pix- els does not enable the support for weight-reuse and multiple channels, and hence, theirworkloadremainconfinedtosimpleMLtasks. Finally,thereareCMOS-based in-pixel hardware, organized as pixel-parallel SIMD processor arrays, that offers improved energy-efficiency for simple networks consisting of only fully-connected layers, and hence are limited to toy workloads such as digit recognition [52]. In contrast, our proposal, Processing-in-pixel-in-memory (P 2 M) [53–59] presents a pathwaytoembedcomplexcomputationsinsidepixelarraysthroughtightlyinter- twined circuit-algorithm co-design, catering to the need of modern AI workloads with real-life datasets. We also propose a novel self-attention-based pooling tech- nique [60] that can aggressively aggregate the activation maps generated by the pixel array of the P 2 M-based sensors, thereby further improving the system band- width and energy efficiency. 1.4 Dissertation Contributions The contributions of this dissertation research can be summarized as follows. 1. Improvingtheenergy-efficiencyofSOTASNNmodelsforcomplexMLtasks. We propose a single-spike hybrid input encoding technique that leads to higher activation sparsity and compute efficiency in the SOTA SNN models. We train accurate yet ultra low-latency SNN models by shifting and scaling the threshold and post-activation values to accurately capture the distribu- tions of the source DNN and target SNN pre-activation values and minimize the difference between them. We propose a quantization-aware hybrid training algorithm for SNNs, and then a novel circuit framework for energy-efficient hardware implementation of the SNNs obtained by our algorithm. We further extend the use of these SNNs to address the compute energy bottleneck faced by 3D CNN layers used for HSI classification. 10 To enable extreme latency and energy efficiency, we also propose a novel training framework for one-time-step SNNs where the IF layer threshold is set to be a novel function of the Hoyer extremum of a clipped version of the membrane potential tensor, where the clipping threshold (existing SNNs use this as the threshold) is trained using gradient descent with our Hoyer regularizer. Lastly,totargetstreaming/sequentialtaskswherethedynamicsofSNNscan be applied more naturally, we propose an optimized spiking LSTM training framework. Inparticular,weproposenovelactivationfunctionsinthesource LSTM architecture and convert a judiciously selected subset of them to LIF activationswithoptimalbiasshifts. Moreover, weproposeapipelinedparal- lel processing scheme that hides the SNN time steps, significantly improving system latency, especially for long sequences. 2. Reducingenergyandbandwidthbottleneckcausedbylargedatatransferbe- tween high-resolution cameras and AI processing units: We propose a novel processing-in-pixel-in-memory(P 2 M)paradigmforresource-constrainedsen- sor intelligence applications, wherein we optimize the compact on-device AI models specifically for the hardware constraints, and yield significant im- provement in energy-delay product (EDP) on visual TinyML applications shown in the visual wake words (VWW) dataset. We propose a novel self-attention-based pooling technique that efficiently aggregates dependencies between non-local activation patches during down- sampling and that can be used as a drop-in replacement to the standard pooling layers. This helps to aggressively compress the activation maps gen- erated by the pixel array, and further reduce the bandwidth between the sensor and the processing units, thereby improving the energy-efficiency of P 2 M-implemented systems. We propose an ISP reversal pipeline which can convert the RGB images of anydatasettoitsrawcounterparts,andenablemodeltrainingonrawimages. 11 This helps to significantly improving the accuracy of CV applications with P 2 M-implementedsystems,whichotherwiserequiresISP-processedRGBim- agesforSOTAaccuracy(mostlarge-scaleopen-sourceCVdatasetsconsistof RGB images). To further improve the accuracy of ISP-less CV models and to increase the energy/bandwidth benefits obtained by P 2 M, we propose an energy-efficientformofanalogin-pixeldemosaicingthatcanbecoupledwith in-pixel CNN computations, and a novel application of few-shot learning. 1.5 Dissertation Organization The rest of the dissertation is organized as follows. Chapter 2 describes the single- spike hybrid input encoding approach to improve the energy-efficiency of existing SNN models. Chapter 3 presents our training framework that yields accurate yet ultra low-latency SNN models, which further improves the inference compute en- ergy for complex vision tasks. Chapter 4 discusses our quantization-aware hybrid training algorithm that yields accurate SNNs, converted from compute-expensive 3D CNNs, for on-device HSI tasks. Chapter 5 presents our training framework for one-time-step SNNs, involving a novel variant of the Hoyer regularizer and a novel Hoyer spike layer. Chapter 6 presents our optimized spiking LSTM training frameworkthatleveragesboththeefficienttemporalandsparsedynamicsofSNNs toreducetheinferencelatencyandenergyoflarge-scalestreamingworkloadswhile achievingclosetoSOTAaccuracy. Chapter7discussesourP 2 Msolutiontoaddress the energy and bandwidth bottleneck caused by the large data transfer between high-resolution image sensor and the downstream AI processing units. Chapter 8 presentsourISP-reversalpipeline,customanalogdemosaicingapproach,andnovel application of few-shot learning that can significantly increase the test accuracy of several complex CV tasks with P 2 M. Chapter 9 presents our novel self-attentive poolingapproachthatcanaggressivelystridethefirstfewlayersofthenetworkre- quiredtoachievetheefficiencygoalsofP 2 M.Chapter10providesdetailsofseveral interesting future research directions and finally concludes this dissertation. 12 Chapter 2 Single-Spike Hybrid-Input Encoding to reduce SNN spiking activity This chapter first provides the introduction and motivation behind the hybrid single-spike encoding in spiking neural networks (SNNs) in Section 2.1. Review of the existing SNN training techniques is illustrated in Section 2.2. Section 2.3 and 2.4providesthenecessarydetailsofourproposedencodingandtrainingframework respectively. We present our detailed experimental evaluation of the classification accuracy and latency in Section 2.5. We show the energy improvement of our proposed framework in Section 2.6 and finally present conclusions in Section 2.7. 2.1 Introduction and Motivation BecauseSNNsreceiveandtransmitinformationthroughspikes,analogvaluesmust beencodedintoasequenceofspikes. Therehasbeenaplethoraofencodingmeth- odsproposed,includingratecoding[34,40],temporalcoding[35,61–63],rank-order coding [36], phase coding [64], [65] and other exotic coding schemes [66]. Among these, rate-coding has shown competitive performance on complex tasks [34,40] while others are either generally limited to simple tasks such as learning the XOR function and classifying digits from the MNIST dataset or require a large num- ber of spikes for inference. In rate coding, the analog value is converted to a spike train using a Poisson generator function with a rate proportional to the in- put pixel value. The number of timesteps in each train is inversely proportional 13 to the quantization error in the representation. Low error requirements force a large number of timesteps at the expense of high inference latency and low acti- vation sparsity [40]. Temporal coding, on the other hand, has higher sparsity and can more explicitly represent correlations in inputs. However, temporal coding is challenging to scale [36] to vision tasks and often requires kernel-based spike response models [61] which are computationally expensive compared to the tradi- tional leaky-integrate-and-fire (LIF) or integrate-and-fire (IF) models. Recently, theauthorsin[5]proposeddirectinputencoding, wheretheyfeedtheanalogpixel values directly into the first convolutional layer, which treats them as input cur- rents to LIF neurons. Another recently proposed temporal encoding scheme uses the discrete cosine transform (DCT) to distribute the spatial pixel information over time for learning low-latency SNNs [67]. However, up to now, there has been no attempt to combine both spatial (captured by rate or direct encoding) and temporal information processed by the SNNs. In addition to accommodating the various of forms of encoding inputs, su- pervised learning algorithms for SNNs have overcome many roadblocks associated with the discontinuous derivative of the spike activation function [37–39]. How- ever, effective SNN training remains a challenge, as seen by the fact that SNNs still lag behind ANNs in terms of latency and accuracy in traditional classification tasks [40,68]. A single feed-forward pass in ANN corresponds to multiple forward passesinSNNwhichisassociatedwithafixednumberoftimesteps. Inspike-based backpropagation, the backward pass requires the gradients to be integrated over everytimestepwhichincreasescomputationandmemorycomplexity[30,38]. Itre- quires multiple iterations, is memory intensive (for backward pass computations), and energy-inefficient, and thus has been mainly limited to small datasets (e.g. CIFAR-10) on simple shallow convolutional architectures [30]. Researchers have also observed high spiking activity and energy consumption in these trained SNN models[69],whichfurtherhinderstheirdeploymentinedgeapplications. Thus,the current challenges in SNN models are high inference latency and spiking activity, long training time, and high training costs in terms of memory and computation. 14 To address these challenges, this chapter makes the following contributions: • Hybrid Spatio-Temporal Encoding: Weemployahybridinputencodingtech- nique where the real-valued image pixels are fed to the SNN during the first timestep. During the subsequent timesteps, the SNN follows a single- spike temporal coding scheme, where the arrival time of the input spike is inversely proportional to the pixel intensity. While the direct encoding in the first timestep helps the SNN achieve low inference latency, the temporal encoding increases activation sparsity. • Single Spike LIF Model: To further harness the benefits of temporal coding, we propose a modified LIF model, where neurons in every hidden layer fire at most once over all the timesteps. This leads to higher activation sparsity and compute efficiency. • Novel Loss Function: Wealsoproposeavariantofthegradientdescentbased spike timing dependent backpropagation mechanism to train SNNs with our proposedencodingtechnique. Inparticular,weemployahybridcrossentropy loss function to capture both the accumulated membrane potential and the spike time of the output neurons. 2.2 SNN Training Techniques Recent research on training supervised deep SNNs can be primarily divided into three categories: 1) ANN-SNN conversion-based training, 2) Spike timing depen- dent backpropagation (STDB), and 3) Hybrid training. 2.2.1 ANN-SNN Conversion ANN-SNN conversion involves copying the SNN weights from a pretrained ANN modelandestimatingthethresholdvaluesineachlayerbyapproximatingtheacti- vationvalueofReLUneuronswiththefiringrateofspikingneurons[33,40,70–72]. 15 The ANN model is trained using standard gradient descent based methods and helps an iso-architecture SNN achieve impressive accuracy in image classification tasks [40,70]. However, the SNNs resulting from these conversion algorithms re- quire an order of magnitude more time steps compared to other training tech- niques [40]. 2.2.2 STDB The thresholding-based activation function in the IF/LIF model is discontinu- ous and non-differentiable, which poses difficulty in training SNNs with gradient- descent based learning methods. Consequently, several approximate training methodologies have been proposed [38,73–75], where the spiking neuron func- tionality is either replaced with a differentiable model or the real gradients are approximated as surrogate gradients. However, the backpropagation step requires these gradients to be integrated over all the time steps required to train the SNN, which significantly increases the memory requirements. 2.2.3 Hybrid Training A recent paper [30] proposed a hybrid training technique where the ANN-SNN conversionisperformedasaninitializationstepandisfollowedbyanapproximate gradientdescentalgorithm. Theauthorsobservedthatcombiningthetwotraining techniques helps SNNs converge within a few epochs while requiring fewer time steps. Authors in [76] extended the above hybrid learning approach by training the membrane leak and the firing threshold along with other network parameters (weights)viagradientdescent. Moreover,[76]applieddirect-inputencodingwhere thepixelintensitiesofanimageareappliedintotheSNNinputlayerasfixedmulti- bit values each time step to reduce the number of time steps needed to achieve SOTA accuracy by an order of magnitude. Though the first layer now requires MACs, as opposed to cheaper accumulates in the remaining layers, the overhead is negligible for deep convolutional architectures [76]. 16 2.3 Hybrid Spike Encoding We propose a hybrid encoding scheme to convert the real-valued pixel intensities of input images into SNN inputs over the total number of timesteps dictated by the desired inference accuracy. As is typical, input images fed to the ANN are normalized to zero mean and unit standard deviation. In our proposed coding technique, we feed the analog pixel value in the input layer of the SNN in the 1 st timestep. Next, we convert the real-valued pixels into a spike train starting from the 2 nd timestep representing the same information. Considering a gray image with pixel intensity values in the range [I min ,I max ], each input neuron encodes the temporal information of its’ corresponding pixel value in a single spike time in the range [2,T] where T is the total number of timesteps. The firing time of the i th input neuron, T i , is computed based on the i th pixel intensity value, I i , as follows T i =⌊T + 2− T I max − I min · (I i − I min )⌉ (2.1) where ⌊.⌉ represents the nearest integer function. Eq. 2.1 is represented as the point-slope form of the linear relationship shown in Fig. 2.1(b) and⌊.⌉ is applied because T i should be integral. Note that Eq. 2.1 also implies that the spike train starts from the 2 nd timestep 2. The encoded value of the i th neuron in the input layer is thus expressed as X i (t)= I i , if t=1 1, else if t=T i 0, otherwise (2.2) which is further illustrated in Fig. 2.1(b). Brighter image pixels have higher intensities, and hence, lower T i . Neurons at the subsequent layers fire as soon as they reach their threshold, and both the voltage potential and time to reach 17 Figure2.1: (a)HybridcodedinputtotheSNN(b)Mappingbetweenthepixelintensity of images and the firing time of individual neurons where ⌊.⌉ denotes the nearest integer function the threshold in the output layer determines the network decision. The analog pixel value in the 1 st time step influences the membrane potential of the output neurons, while the firing times of the input neurons based on the pixel intensities are responsible for the spike times of the output neurons. Notably,thishybridencodingschemecapturesboththeintensityandtemporal nature of the input neurons, does not need any preprocessing steps like applying Gabor filters that are commonly used in SNNs trained with spike time depen- dent plasticity (STDP) [77,78]. Moreover, our proposed encoding technique is compatible with event-driven cameras which capture actual pixel value first, and subsequently emit spikes based on the changes in pixel intensity [79]. Lastly, our proposalensuresthatthereisasingleinputspikeperpixel,andhence,theobtained spike train is sparser than that observed in rate/direct coded techniques. 2.4 Proposed Training Scheme WeemployamodifiedversionoftheLIFmodelillustratedinSection3.2.1totrain energy-efficient SNNs. In our proposed training framework, neurons in all the hidden convolutional and fully-connected layers (except the output layer) spike at 18 most once over all the timesteps. During inference, once a neuron emits a spike, it is shut off, and does not participate in the remaining LIF computations. However, during training, the neurons in the hidden layers follow the model illustrated in Eq. (6)-(8) which shows that even though each neuron can fire at most once, it stillneedstoperformcomputationsfollowingtheLIFmodel. Thisensuresthatthe errorgradientsarestillnon-zerofollowingthespiketimeandenablesourproposed training framework to avoid the dead neuron problem where learning does not happen in the absence of a spike. U t l =λ l U t− 1 l +W l O t l− 1 − V l · (z t− 1 l >0) (2.3) z t l = U t l V l − 1 (2.4) O t l = 1, ifz t l >0 andz t i l ≤ 0∀t i ∈[1,t) 0, otherwise (2.5) Note thatU t l ,O l− 1 , and W l are vectors containing the membrane potential of the neurons of layer l at timestep t, spike signals from layer (l− 1), and the weight matrix connecting the layer l and (l− 1). Also note that (z t− 1 l > 0) in Eq. 2.3 denotes a Boolean vector of size equal to the number of neurons in layer l. The leak and threshold voltage for all the neurons in layer l are represented by λ l and V l respectively. In our training framework, both these parameters (same for the neurons in a particular layer) are trained with backpropagation along with the weights to optimize both accuracy and latency. The neurons in the output layer accumulate the incoming inputs without any leakage as shown in Eq. 4.12. However, unlike previous works [5,30], the output neurons in our proposed framework emit spikes following a model shown in Eq. 2.7 where T l denote the vector containing the spike times of the output neurons and T is the total number of timesteps. 19 U t l =U t− 1 l +W l O t l− 1 (2.6) T l = T, ifU t l <V l ∀t∈[1,T] t s.t.U t l ≥ V l & U t− 1 l <V l , otherwise (2.7) The output layer only triggers an output spike if there was no spike in the earlier timesteps and the corresponding membrane potential crosses the threshold. Also, an output neuron is forced to fire at the last timestep if it was unable to emit a spike in any of the timesteps. This ensures that all the neurons in the output layer have a valid T l which can be included in the loss function. Let us now we derive the expressions to compute the gradients of the trainable parametersofallthelayers. Weperformthespatialandtemporalcreditassignment byunrollingthenetworkintemporalaxisandemployingbackpropagationthrough time (BPTT) [30]. Output Layer: Thelossfunctionisdefinedonboth U T l andT l tocorrectlycapture both the direct and temporal information presented at the input layer. Therefore, we employ two softmax functions of the i th output neuron shown in Eq. 2.8, where N denotes the total number of classes, U T i and t i represent the accumulated membrane potential after the final timestep and the firing time of the i th neuron respectively. ˜ U i = e U T i P N j=1 e U T j , ˜ t i = e − t i P N j=1 e − t j (2.8) The resulting hybrid cross entropy loss (L) and its’ gradient with respect to the accumulated membrane potential vector ( ∂L ∂U T l ) are thus defined as L=− N X i=1 y i log( ˜ U i ˜ t i ), ∂L ∂U T l = ˜ U T l − y (2.9) 20 where ˜ U T l is the vector containing the softmax values ˜ U i , and y is the one-hot encoded vector of the correct class. Similarly, the gradient with respect to the firing time vector ( ∂L ∂T l ) is ( ˜ T l − y). Now, we compute the weight update as W l =W l − η ∆ W l (2.10) ∆ W l = X t ∂L ∂W l = X t ∂L ∂U t l ∂U t l ∂W l = ∂L ∂U T l X t ∂U t l ∂W l =( ˜ U T l − y) X t O t l− 1 (2.11) where η is the learning rate (LR). In order to evaluate the threshold update at the output layer, we rewrite Eq. (10) as T l = T− 1 X t=1 (tH(a)H(b))+TH(c) (2.12) where H denotes the Heaviside step function, a = U t l − V l , b = V l − U t− 1 l , and c = V l − U T l . Note that V l represents a vector of repeated elements of the thresholdvoltageoftheoutputlayer. Thederivative( ∂T l ∂V l )canthenberepresented as ∂T l ∂V l = T− 1 X t=1 t(H(a)δ (b)−H (b)δ (a))+Tδ (c) (2.13) whereδ representstheDirac-deltafunction. Sincethedeltafunctioniszeroalmost everywhere, it will not allow the gradient ofT l to change and train V l . Hence, we approximate Eq. (16) as T− 1 X t=1 t(H(a)(|b|<β )−H (b)(|a|<β ])+T(|c|<β ) (2.14) where β is a vector of size equal to the number of output neurons, consisting of the repeated elements of a training hyperparameter that controls the gradient of 21 T l . Note that (|a|<β ),(|b|<β ), and (|c|<β ) are all Boolean vectors of the same size asβ . We then compute the threshold update as V l =V l − η ∆ V l , ∆ V l = ∂L ∂V l = ∂L ∂T l ∂T l ∂V l (2.15) Hidden Layers: The weight update of the l th hidden layer is calculated from Eq. (6)-(8) as ∆ W l = X t ∂L ∂W l = X t ∂L ∂z t l ∂z t l ∂O t l ∂O t l ∂U t l ∂U t l ∂W l = X t ∂L ∂z t l ∂z t l ∂O t l 1 V l O t l− 1 (2.16) dz t l dO t l is the non-differentiable gradient which can be approximated with the surro- gate gradient proposed in [74]. ∂z t l ∂O t l =γ · max(0,1−| z t l |) (2.17) where γ is a hyperparameter denoting the maximum value of the gradient. The threshold update is then computed as ∆ V l = X t ∂L ∂V l = X t ∂L ∂O t l ∂O t l ∂z t l ∂z t l ∂V l = X t ∂L ∂O t l ∂O t l ∂z t l − V l · (z t− 1 l >0)− U t l (V l ) 2 (2.18) Given that the threshold is same for all neurons in a particular layer, it may seem redundant to train both the weights and threshold together. However, our experimentalevaluationdetailedinSection2.6showsthatthenumberoftimesteps required to obtain the state-of-the-art classification accuracy decreases with this joint optimization. We hypothesize that this is because the optimizer is able to 22 reach an improved local minimum when both parameters are tunable. Finally, the leak update is computed as λ l =λ l − η ∆ λ l (2.19) ∆ λ l = X t ∂L ∂λ l = X t ∂L ∂O t l ∂O t l ∂z t l ∂z t l ∂U t l ∂U t l ∂λ l = X t ∂L ∂O t l ∂O t l ∂z t l 1 V l U (t− 1) l (2.20) 2.5 Experiments This section first describes how we evaluate the efficacy of our proposed encoding and training framework and then presents the inference accuracy on CIFAR-10 and CIFAR-100 datasets with various VGG model variants. 2.5.1 Experimental Setup ANN Training for Initialization To train our ANNs, we used the standard data-augmented input set for each model. For ANN training with various VGG models, we imposed a number of constraints that leads to near lossless SNN conversion [40]. In particular, our models are trained without the bias term because it complicates parameter space exploration which increases conversion difficulty and tends to increase conversion loss. The absence of bias term implies that Batch Normalization [23] cannot be used as a regularizer during the training process. Instead, we use Dropout [80] as the regularizer for both ANN and SNN training. Also, our pooling operations use average pooling because for binary spike based activation layers, max pooling incurssignificantinformationloss. WeperformedtheANNtrainingfor200epochs with an initial LR of 0.01 that decays by a factor of 0.1 after 120, 160, and 180 epochs. 23 Table 2.1: Model performances with single-spike hybrid encoded SNN training on CIFAR-10 and CIFAR-100 after a) ANN training, b) ANN-to-SNN conversion and c) SNN training. a. b. Accuracy (%) with c. Accuracy (%) after Architecture ANN (%) ANN-SNN conversion proposed SNN training accuracy for T = 200 for T=5 Dataset : CIFAR-10 VGG-6 90.22 89.98 88.89 VGG-11 91.02 91.77 90.66 VGG-16 93.24 93.16 91.41 Dataset : CIFAR-100 VGG-16 71.02 70.38 66.46 ANN-SNN Conversion and SNN Training Previous works [30,40] set the layer threshold of the first hidden layer by com- puting the maximum input to a neuron over all its neurons across all T timesteps for a set of input images [40]. The thresholds of the subsequent layers are se- quentially computed in a similar manner taking the maximum across all neurons and timesteps. However, in our proposed framework, the threshold for each layer is computed sequentially as the 99.7 percentile (instead of the maximum) of the neuron input distribution at each layer, which improves the SNN classification accuracy [5]. During threshold computation, the leak in the hidden layers is set to unity and the analog pixel values of an image are directly applied to the in- put layer [5]. We considered only 512 input images to limit conversion time and used a threshold scaling factor of 0.4 for SNN training and inference, following the recommendations in [30]. Initialized with these layer thresholds and the trained ANN weights, we per- formed our proposed SNN training with the hybrid input encoding scheme for 150 epochs for CIFAR-10 and CIFAR-100, respectively, where we jointly optimize the weights, the membrane leak, and the firing thresholds of each layer as described in Section 2.4. We set γ = 0.3 [74], β = 0.2, and used a starting LR of 10 − 4 which decays by a factor of 0.1 every 10 epochs. 24 2.5.2 Classification Accuracy & Latency We evaluated the performance of these networks on multiple VGG architectures, namely VGG-6, VGG-9 and VGG-11 for CIFAR-10 and VGG-16 for CIFAR-100 datasets respectively. Column-2 in Table 2.1 shows the ANN accuracy; column-3 shows the accuracy after ANN-SNN conversion with 200 timesteps. Note that we need 200 timesteps to evaluate the thresholds of the SNN for VGG architectures withoutanysignificantlossinaccuracy. Column-4inTable2.1showstheaccuracy when we perform our proposed training with our hybrid input encoding discussed inSection2.3. TheperformanceoftheSNNstrainedviaourproposedframeworkis compared with the current state-of-the-art SNNs with various encoding and train- ing techniques in Table 6.3. Our proposal requires only 5 timesteps for both SNN training and inference to obtain the SOTA test accuracy and hence, represent- ing 5-300× improvement in inference latency compared to other rate/temporally coded spiking networks. Note that the direct encoding in the first time step is crucial for SNN convergence, and temporal coding solely leads to a test accuracy of∼ 10% and∼ 1% on CIFAR-10 and CIFAR-100 respectively, for all the network architectures. 2.6 Improvement in Energy-efficiency 2.6.1 Reduction in Spiking Activity To model energy consumption, we assume a generated SNN spike consumes a fixed amount of energy [33]. Based on this assumption, earlier works [30,40] have adoptedtheaveragespikingactivity(alsoknownasaveragespikecount)ofanSNN layer l, denoted ζ l , as a measure of compute-energy of the model. In particular, ζ l is computed as the ratio of the total spike count in T steps over all the neurons of the layer l to the total number of neurons in that layer. Thus lower the spiking activity, the better the energy efficiency. 25 Table 2.2: Performance comparison of the proposed single spike hybrid encoded SNN with state-of-the-art deep SNNs on CIFAR-10 and CIFAR-100. TTFS denotes time-to- first-spike coding. Authors Training Input Architecture Accuracy Time type encoding (%) steps Dataset : CIFAR-10 Sengupta et ANN-SNN Rate VGG-16 91.55 2500 al. (2019) [40] conversion Wu et al. Surrogate Direct 5 CONV, 90.53 12 (2019) [37] gradient 2 linear Rathi et al. Conversion+ Rate VGG-16 91.13 100 (2020) [30] STDB training 92.02 200 Garg et al. Conversion+ DCT VGG-9 89.94 48 (2019) [67] STDB training Kim et al. ANN-SNN Phase VGG-16 91.2 1500 (2018) [64] conversion Park et al. ANN-SNN Burst VGG-16 91.4 1125 (2019) [65] conversion Park et al. STDB TTFS VGG-16 91.4 680 (2020) [62] training Kim at. al. Surrogate Rate VGG-9 90.5 25 (2020) [39] gradient Rathi at. al. Conversion+ Direct VGG-16 92.70 5 (2020) [5] STDB training 93.10 10 This work Conversion+ Hybrid VGG-16 91.41 5 STDB training Dataset : CIFAR-100 Lu et al. ANN-SNN Direct VGG-16 63.20 62 (2020) [81] conversion Garg et al. Conversion+ DCT VGG-11 68.3 48 (2020) [67] STDB training Park et al. ANN-SNN Burst VGG-16 68.77 3100 (2019) [65] conversion Park et al. STDB TTFS VGG-16 68.8 680 (2020) [62] training Kim at. al. Surrogate Rate VGG-9 66.6 50 (2020) [39] gradient Rathi et al. Conversion+ Direct VGG-16 69.67 5 (2020) [5] STDB training This work Conversion+ Hybrid VGG-16 66.46 5 STDB training Table 2.3: Convolutional and Fully-connected layer FLOPs for ANN and SNN models Model Number of FLOPs Notation Convolutional layer l Fully-connected layer l ANN F l ANN (k l ) 2 × H l o × W l o × C l o × C l i f l i × f l o SNN F l SNN (k l ) 2 × H l o × W l o × C l o × C l i × ζ l f l i × f l o × ζ l Fig. 2.2 shows the average number of spikes for each layer with our proposed single-spike hybrid encoding and direct encoding scheme on VGG-16 when eval- uated for 1500 samples from CIFAR-10 testset for VGG-16 architecture. Let the 26 Figure 2.2: Comparison of average spiking activity per layer for VGG-16 on CIFAR-10 and CIFAR-100 with both direct and hybrid input encoding. average be denoted by ζ l which is computed by summing all the spikes in a layer over100timestepsanddividingbythenumberofneuronsinthatlayer. Forexam- ple, the average spike count of the 11 th convolutional layer of the direct encoded SNN is 0.78, which implies that over a 5 timestep period each neuron in that layer spikes 0.78 times on average over all input samples. As we can see, the spiking activity for almost all the layers reduces significantly with our proposed encoding technique. TocompareourproposedworkwiththeSOTASNNs,weperformhybridtrain- ing (ANN-SNN conversion, along with STDB) on spiking networks with (a) IF 27 neurons with Poisson rate encoding [30], (b) IF neurons with DCT-based input encoding [67], and (c) LIF neurons with direct encoding [5]. We employ trainable leak and threshold in these SNNs for fair comparison. We also evaluate the im- pact of the hybrid spatio-temporal encoding with the modified loss function and the single-spike constraint individually on the average spike rate and latency un- der similar accuracy and conditions (trainable threshold and leak). In particular, we train three additional spiking networks: (d) SNN with LIF neuron and pro- posed hybrid-encoding, (e) SNN with LIF neuron and direct encoding with the single-spike constraint over all the layers, and (f) single-spike hybrid encoded SNN with LIF neuron. All the six networks achieve test accuracies between 90-93% for VGG-16 on CIFAR-10. Fig. 2.3 shows the average spiking rate and the number of timesteps required to obtain the SOTA test accuracy of all these SNNs. Both (d) and (e) result in lower average spiking activity compared to all the SOTA SNNs, with at most the same number of timesteps. Finally, (f) generates even lower number of average spikes (2× ,17.2× , and 94.8× less compared to direct, DCT, and rate coding) with the lowest inference latency reported till date for deep SNN architectures [5], and no significant reduction in the test accuracy. The improve- ment stems from both the hybrid input encoding which reduces spiking activity in the initial few layers and our single-spike constraint which reduces the average spike rate throughout the network, particularly in the later layers. It becomes in- creasinglydifficultforthemembranepotentialoftheconvolutionallayersdeepinto the network to increase sufficient to emit a spike, due to the fact that the neurons in the earlier layers cannot fire multiple times and we need only 5 timesteps for classification. 2.6.2 Reduction in FLOPs and Compute Energy Letusassumeaconvolutionallayerl havingweighttensorW l ∈R k l × k l × C l i × C l o that operates on an input activation tensor I l ∈R H l i × W l i × C l i , where H l i ,W l i , C l i and C l o are the input tensor height, width, number of channels, and filters, respectively. k l 28 Figure2.3: EffectofPoissonrateencoding,DCTencoding,directencoding,andsingle- spike hybrid input encoding on the average spike rate and latency for VGG-16 architec- ture on CIFAR-10 dataset. represents both filter height and width. We now quantify the energy consumed to produce the corresponding output activation tensor O l ∈R H l o × W l o × C l o for an ANN and SNN, respectively. Our model can be extended to fully-connected layers with f l i and f l o as the number of input and output features respectively. In particular, for an ANN, the total number of FLOPS for layer l, denoted F l ANN , is shown in row 1 of Table 2.3. The formula can be easily adjusted for an SNN in which the numberofFLOPsatlayerlisafunctionoftheaveragespikingactivityatthelayer (ζ l ) denoted as F l SNN in Table 2.3. Thus, as the activation output gets sparser, the compute energy decreases. For ANNs, FLOPs primary consist of multiply accumulate (MAC) operations of the convolutional and linear layers. On the contrary, for SNNs, except the first and last layer, the FLOPs are limited to accumulates (ACs) as the spikes are binaryandthussimplyindicatewhichweightsneedtobeaccumulatedatthepost- synaptic neurons. For the first layer, we need to use MAC units as we consume 29 analog input 1 (at timestep one). Hence, the compute energy for an ANN (E ANN ) and an iso-architecture SNN model (E SNN ) can be written as E ANN =( L X l=1 F l SNN )· E MAC (2.21) E SNN =(F 1 ANN )· E MAC +( L X l=2 F l SNN )· E AC (2.22) where L is the total number of layers. Note that E MAC and E AC are the energy consumption for a MAC and AC operation respectively. As shown in Table 2.4, E AC is ∼ 32× lower than E MAC [1] in 45 nm CMOS technology. This number may vary for different technologies, but generally, in most technologies, an AC operation is significantly cheaper than a MAC operation. Fig. 2.4illustratestheenergyconsumptionandFLOPsforANNandSNNmod- elsofVGG-16whileclassifyingtheCIFARdatasets,wheretheenergyisnormalized to that of an equivalent ANN. The number of FLOPs for SNNs obtained by our proposedtrainingframeworkissmallerthanthatforanANNwithsimilarnumber of parameters. Moreover, because the ACs consume significantly less energy than MACs, as shown in Table 2.4, SNNs are significantly more energy efficient. In particular, for CIFAR-10 our proposed SNN consumes∼ 70× less compute energy than a comparable iso-architecture ANN with similar parameters and ∼ 1.2× less computeenergythanacomparableSNNwithdirectencodingtechniqueandtrain- able threshold/leak [5] parameters. For CIFAR-100 with hybrid encoding and our single-spike constraint, the energy-efficiency can reach up to ∼ 125× and ∼ 1.8× , respectively, compared to ANN and direct-coded SNN models [5] having similar parameters and architecture. Note that we did not consider the memory access 1 Note that for the hybrid coded data input we need to perform MAC at the first layer at t = 1, and AC operation during remaining timesteps at that layer. For the direct coded input, only MAC during the 1 st timestep is sufficient, as neither the inputs nor the weights change during remaining timesteps (i.e. 5≥ t≥ 2). 30 Table2.4: EstimatedenergycostsforMACandACoperationsin45nmCMOSprocess at 0.9 V [1] Serial No. Operation Energy (pJ) 1. 32-bit multiplication int 3.1 2. 32-bit addition int 0.1 3. 32-bit MAC 3.2 (#1 + #2) 4. 32-bit AC 0.1 (#2) Figure 2.4: Comparison of normalized compute cost on CIFAR-10 and CIFAR-100 for VGG-16 of ANN and SNN with direct and hybrid input encoding. energy in our evaluation because it is dependent on the underlying system archi- tecture. Although SNNs incur significant data movement because the membrane potentials need to be fetched at every timestep, there have been many proposals to reduce the memory cost by data buffering [82], computing in non-volatile cross- bar memory arrays [83], and data reuse with energy-efficient dataflows [84]. All these techniques can be applied to the SNNs obtained by our proposed training framework to address the memory cost. 2.7 Conclusions SNNs that operate with discrete spiking events can potentially unlock the energy wallindeeplearningforedgeapplications. Towardsthisend,wepresentedatrain- ingframeworkthatleadstolowlatency,energy-efficientspikingnetworkswithhigh 31 activation sparsity. We initialize the parameters of our proposed SNN taken from a trained ANN, to speed-up the training with spike-based backpropagation. The image pixels are applied directly as input to the network during the first timestep, while they are converted to a sparse spike train with firing times proportional to the pixel intensities in subsequent timesteps. We also employ a modified version of the LIF model for the hidden and output layers of the SNN, in which all the neurons fire at most once per image. Both of these lead to high activation sparsity in the input, convolutional, and dense layers of the network. Moreover, we employ a hybrid cross entropy loss function to account for the spatio-temporal encoding in the input layer and train the network weights, firing threshold, and membrane leak via spike-based backpropagation to optimize both accuracy and latency. The high sparsity combined with low inference latency reduces the compute energy by ∼ 70-130× and ∼ 1.2-1.8× compared to an equivalent ANN and a direct encoded SNN respectively with similar accuracy. SNNs obtained by our proposed frame- work achieves similar accuracy as other state-of-the-art rate or temporally coded SNN models with 5-300× fewer timesteps. 32 Chapter 3 Optimal ANN-to-SNN conversion to reeduce SNN latency This chapter first provides the introduction and motivation behind the conversion from deep neural networks to ultra low-latency spiking neural networks 3.1. Re- views of related work are provided in Section 3.2. Section 3.3 explains why these works fail for ultra-low SNN latencies and discusses our proposed methodology. Our accuracy and latency results are presented in Section 3.4 and our analysis of training resources and inference energy efficiency is presented in Sections 3.5 and 3.6, respectively. This chapter concludes in Section 3.7. 3.1 Introduction and Motivation It has been shown that SNNs can be converted from DNNs with low error by ap- proximating the activation value of ReLU neurons with the firing rate of spiking neurons [40]. SNNs trained using DNN-to-SNN conversion, coupled with super- vised training, have been able to perform similar to SOTA DNNs in terms of test accuracy in traditional image recognitiontasks [5,30]. However, thetraining effort still remains high, because SNNs need multiple time steps (at least 5 with direct encoding [5]) to process an input, and hence, the backpropagation step requires the gradients of the unrolled SNN to be integrated over all these time steps, which significantlyincreasesthememorycost[41]. Moreover,themultipleforwardpasses resultinanincreasednumberofspikes, whichdegradetheSNN’senergyefficiency, both during training and inference, and possibly offset the compute advantage of the ACs. This motivates our exploration of novel training algorithms to reduce 33 both the test error of a DNN and the conversion error to a SNN, while keeping the number of time steps extremely small during both training and inference. Thus the current challenges in SNNs are multiple time steps, large spiking activity, and high training effort, both in terms of compute and memory. To address these challenges, this chapter makes the following contributions. • Weanalyticallyandempiricallyshowthattheprimarysourceoferrorincur- rent DNN-to-SNN conversion strategies [4,85] is the incorrect and simplistic model of the distributions of DNN and SNN activations. • We propose a novel DNN-to-SNN conversion and fine-tuning algorithm that reduces the conversion error for ultra low latencies by accurately capturing these distributions and thus minimizing the difference between SNN and DNN activation functions. • We demonstrate the latency-accuracy trade-off benefits of our proposed frameworkthroughextensiveexperimentswithbothVGG[7]andResNet[8] variants of deep SNN models on CIFAR-10 and CIFAR-100 [86]. We bench- mark and compare the models’ training time, memory requirements, and inference energy efficiency in both GPU and neuromorphic hardware with two SOTA low-latency SNNs. 1 3.2 Preliminaries and Related Work 3.2.1 Threshold ReLU activation in DNNs Neurons in a non-spiking DNN integrate weight-modulated analog inputs and ap- ply a non-linear activation function. Although ReLU is widely used as the activa- tion function, previous work [87] has proposed a trainable threshold term, µ , for 1 We use VGG16 on CIFAR-10 and CIFAR-100 to show compute efficiency. 34 similarity with SNNs. In particular, the neuron outputs with threshold ReLU can be expressed as Y i =clip X j W ij X j ,0,µ ! (3.1) where clip(x,0,µ ) = 0, if x < 0; x, if 0 ≤ x ≤ µ ; µ, if x ≥ µ , and X j and W ij denotetheoutputsoftheneuronsintheprecedinglayerandtheweightsconnecting the two layers. The gradients of µ are estimated using gradient descent during the backward computations of the DNN. 3.2.2 DNN-to-SNN Conversion Previous research has demonstrated that SNNs can be converted from DNNs with negligible accuracy drop by approximating the activation value of ReLU neurons withthefiringrateofIFneuronsusingathresholdbalancingtechniquethatcopies the weights from the source DNN to the target SNN [33,40,70,71]. Since this technique uses the standard backpropagation algorithm for DNN training, and thus involves only a single forward pass to process a single input, the training procedureissimplerthantheapproximategradienttechniques usedtotrainSNNs from scratch. However, the key disadvantage of DNN-to-SNN conversion is that it yields SNNs with much higher latency compared to other techniques. Some previous research [85,88] proposed to down-scale the threshold term to train low- latency SNNs, but the scaling factor was either a hyperparameter or obtained via linear grid-search, and the latency needed for convergence still remained large (>64). To further reduce the conversion error, [4] minimized the difference between the DNN and SNN post-activation values for each layer. To do this, the activation function of the IF SNN must first be derived [4,85]. We assume that the initial membrane potential of a layer l (U l (0)) is 0. Moreover, we let ¯ S l be the average 35 SNN output of layer l. Then, ¯ S l = 1 T P T i=1 S l (i) where S l (i) is the discrete output at the i th time step, and T is the total number of time steps, ¯ S l = V th T clip T V th W l ¯S l− 1 ,0,T , (3.2) where V th and W l denote the layer threshold and weight matrix respectively. Eq 3.2 is illustrated in Fig. 3.1(a) by the piecewise staircase function of the SNN activation. Reference [4] also proved that the average difference in the post-activation values can be reduced by adding a bias term δ to shift the SNN activation curve to the left by δ =V th /2T, as shown in Fig. 3.1(a), assuming both the DNN (d) and SNN (s) pre-activation values are uniformly and identically distributed. To further reduce the difference, [4] added a non-trainable threshold equal to the maximum DNN pre-activation (d max ) value to the ReLU activation function in each layer and equated it with the SNN spiking threshold, which ensures zero difference between the DNN and SNN post-activation values when the DNN pre-activation values exceed d max . However, d max is an outlier, and >99% of the pre-activation values lie between [0, dmax 3 ]. Hence, we propose to use the ReLU activation with a trainable threshold for each layer (denoted as µ , where µ<d max for all layers) as discussed in Section 3.2.1 and shown in Fig. 3.1(a). This trainable threshold, as will be described below, also helps reduce the average difference for non-uniform DNN pre-activation distributions. 3.3 Proposed Training Framework In this section, we analytically and empirically show that the SOTA conversion strategies, along with our proposed modification described above, fail to obtain the SOTA SNN test accuracy for smaller time steps. We then propose a novel 36 Figure 3.1: (a) Comparison between DNN (threshold ReLU) and SNN (both original and bias-added) activation functions, the distribution of DNN and SNN (T = 2) pre- activation values and variation of h(T,µ ) (see Eq. 3.4) with T(≤ 5) for the 2 nd layer of VGG-16 architecture on CIFAR-10, and (b) Proposed scaling of the threshold and output of the SNN post-activation values. conversion algorithm that scales the SNN threshold and post-activation values to reduce the conversion error for small T. 3.3.1 Why Does Conversion Fail for Ultra Low Latencies? Even though we can minimise the difference between the DNN and SNN post- activation values with bias addition and thresholding, in practice, the SNNs ob- tained are still not as accurate as their iso-architecture DNN counterparts when T decreases substantially. We empirically show this trend for VGG and ResNet architectures on the CIFAR-10 dataset in Fig. 3.2. This is due to the flawed base- line assumption that the DNN and SNN pre-activation are uniformly distributed. Both the distributions are rather skewed (i.e., most of the values are close to 0), as illustrated in Fig. 3.1(a). Toanalyticallyseethis, letusassumetheDNNandSNNpre-activationproba- bilitydensityfunctionsaref D (d)andf S (s)andpost-activationvaluesaredenoted 37 as d ′ and s ′ , respectively. Assuming V th =µ , derived from DNN training, the ex- pected difference in the post-activation values ∆ = E(d ′ )− E(s ′ ) for a particular layer and T can be written as ∆ ≈ µ Z 0 (d ′ f D (d)∂d− s ′ f S (s)∂s)= µ Z 0 (df D (d)∂d− s ′ f S (s)∂s) =K(µ )µ − T− 1 X i=1 i( µ T )g i (T,µ ) ! − µ µ Z T ′ f S (s)∂s, (3.3) where the first approximation is due to the fact that greater than 99 .9% of both d and s are less than µ . The subsequent equality is because d ′ = d when d≤ µ . The last equality is based on the introduction of g i (T,µ ) = (i+ 1 2 ) µ T R (i− 1 2 ) µ T f S (s)∂s which captures the bias shift of µ 2T , and T ′ = (T − 1 2 µ )/T and the observation that the term µR 0 df D (d)∂d lies between its upper and lower integral limits, and thus can be re-written as K(µ )µ , where K(µ ) lies in the range [0,1]. The exact value of K(µ ) depends on the distribution f D (d). Assuming h(T,µ )= P T− 1 i=1 ( i T )g i (T,µ ) + µR T ′ f S (s)∂s, Eq. 3.3 can be then writ- ten as ∆ ≈ µ (K(µ )− h(T,µ )). (3.4) When f D (d) and f S (s) are uniformly distributed in the range [0,µ ], they must equal 1/µ . This implies that µR 0 df D (d)∂d = µ 2 and, consequently, K(µ ) = 1 2 . Moreover, g i (T,µ ) = 1 T ∀i∈[1,(T− 1)], and hence the first term of h(T,µ ), P T− 1 i=1 ( i T )g i (T,µ ) = T− 1 2T , whereas the second term, µR T ′ f S (s)∂s, equals 1 2T . Hence, similar to K(µ ), h(T,µ )= 1 2 . Thus, Eq. 3.4 evaluates to 0 which implies the error can be completely eliminated, as also concluded in [4]. However, when the distributions are skewed, we observe that while K(µ ) is independentofT,h(T,µ )decreasessignificantlyaswereduce T belowaround5,as 38 Figure 3.2: Effect of the number of SNN time steps on the test accuracy of VGG and ResNet architectures on CIFAR-10 with DNN-to-SNN conversion based on both threshold ReLU and the maximum pre-activation value used in [4]. shown in the insert in Fig. 3.1(a). Intuitively, for small T, most of the probability density of s lies to the left of the first staircase starting at s= µ 2T , due to its sharply decreasing nature. Consequently, the remaining area under the curve captured in h(T,µ ) becomes negligible, reducing the number of output spikes significantly. Hence, for ultra-low SNN latencies, the error ∆ per layer remains significant and accumulates over the network. ThisanalysisexplainstheaccuracygapthatisobservedbetweenoriginalDNNs and their converted SOTA SNNs for T ≤ 5, as exemplified in Fig. 3.2. Moreover, training with a non-trainable threshold [4], can be modeled by replacing µ with d max ≥ µ in Eq. 3.4. This further increases ∆, as observed from the increased accuracy degradation shown in Fig. 3.2. 39 3.3.2 Conversion & Fine-tuning for Ultra Low-Latency SNNs WhileEq. 3.4suggeststhatwecantuneµ tocompensateforlowT,thisintroduces other errors. In particular, if we replace µ with a down-scaled version 2 αµ , with α ∈ (0,1), the SNN activation curve will shift left, as shown in Fig. 3.1(b), and there will be an additional difference between d ′ and s ′ that stems from the values of d and s in the range (αµ,µ ) as follows ∆ α ≈ αµ (K(αµ )− h(T,µ ))+ µ Z αµ (d ′ f D (d)∂d− s ′ f S (s))∂s =αµ (K(αµ )− h(T,µ ))+ µ Z αµ df D (d)∂d− αµ µ Z αµ f S (s)∂s To mitigate this additional error term, we propose to also optimize the step size of the SNN activation function in the y-direction by modifying the IF model from Eq. 4.12, S β i (t)= βV th , if U temp i (t)>V th 0, otherwise, (3.5) which introduces another scaling factor β illustrated in Fig. 3.1(b). Moreover, we remove the bias term since it complicates the parameter space exploration and poses difficulty in training the SNNs after conversion, changing h(T,µ ) to h ′ (T,µ )= P T− 1 i=1 ( i T )g (i− 1/2) (T). This results in a new difference function ∆ αβ ≈ αµ (K(αµ )− βh ′ (T,µ ))+ µ Z αµ df D (d)∂d− αβµ µ Z αµ f S (s)∂s 2 Up-scaling µ further reduces the output spike count and increases the error. 40 Thus, our task reduces to finding the α and β that minimises ∆ αβ for a given low T. Since it is difficult to analytically compute ∆ αβ to guide SNN conversion, we empirically estimate it by discretizing d into percentiles P[j] ∀j ∈ {0,1,..,M}, where M is the largest integer satisfying P[M] ≤ µ , using the activations of a particular layer of the trained DNN. In particular, for each α = P[j] µ , we vary β between0and2withastepsizeof0.01, asshowninAlgorithm1. Thispercentile- basedapproachforα isbetterthanalinearsearchbecauseitenablesafiner-grained analysisintherangeofdwithhigherlikelihood. Wefindthe( α,β )pairthatyields the lowest ∆ αβ for each DNN layer. ForDNN-to-SNNconversion,wecopytheSNNweightsfromapretrainedDNN with trainable threshold µ , set each layer threshold as αµ , and produce an output βV th whenever the membrane potential crosses the threshold. Although we incur an overhead of two additional parameters per SNN layer, the parameter increase is negligible compared to the total number of weights. Moreover, as the outputs for each time step are either 0 or βV th , we can absorb the scaling factor into the weight values, avoiding the need for explicit multiplication. After conversion, we apply SGL in the SNN domain where we jointly fine-tune the threshold, leak, and weights [5]. To approximate the gradient of ReLU, we compute the surrogate gradient as ∂s ′ ∂s ≈ 1, if 0 ≤ s ≤ 2αµ , and 0 otherwise, which is used to estimate the gradients of the trainable parameters [5]. 3.4 Experimental Results 3.4.1 Experimental Setup Since we omit the bias term during DNN-to-SNN conversion described in Section 3.3.2, we avoid Batch Normalization, and instead use Dropout as the regularizer for both ANN and SNN training. Although prior works [5,30,40] claim that max poolingincursinformationlossforbinary-spike-basedactivationlayers,weusemax 41 Algorithm1:Detailedalgorithmforfindinglayer-wisescalingfactorsfor SNN threshold & post-activations 1 Input: Activations A, Total time steps T, ReLU threshold µ Data: i=0,1,...M, percentiles P[i]=i th percentile of A, where M is the largest integer satisfying P[M]≤ µ , initial scaling factors α i =1 and β i =1 2 Output: Final scaling factors α f and β f 3 Function ComputeLoss(P,µ ,α ,β ,T): 4 loss← 0 5 foreach p∈P do 6 for j← 0 to (T − 1) do 7 if jαµ T ≤ p≤ (j+1)αµ T then 8 loss← loss+(p− jαβµ T ) #Seg-I in Fig. 3.1(b) 9 end 10 end 11 end 12 if αµµ then 16 loss← loss+µ (1− αβ ) #Seg-III in Fig. 3.1(b) 17 return loss 18 End Function. 19 Function FindScalingFactors(P,µ ,T): 20 preloss= ComputeLoss (P,µ,α i ,β i ,T) 21 foreach p∈P do 22 for j← 0 to 2 (step size of 0.01) do 23 loss= ComputeLoss(P,µ, p µ ,j,T) 24 if |loss|<|preloss| then 25 α f = p µ , β f =j, preloss=loss 26 end 27 end 28 end 29 return α f ,β f 30 End Function. pooling because it improves the accuracy of both the baseline DNN and converted SNN. Moreover, max pooling layers produce binary spikes at the output, and ensures that the SNN only requires AC operations for all the hidden layers [89], thereby improving energy efficiency. 42 Number a. b. Accuracy (%) c. Accuracy (%) Architecture of DNN (%) with DNN-to-SNN after SNN time steps accuracy conversion training Dataset : CIFAR-10 VGG-11 2 90.76 65.82 89.39 3 91.10 78.76 89.79 VGG-16 2 93.26 69.58 91.79 3 93.26 85.06 91.93 ResNet-20 2 93.07 61.96 90.00 3 93.07 73.57 90.06 Dataset : CIFAR-100 VGG-16 2 68.45 19.57 64.19 3 68.45 36.84 63.92 ResNet-20 2 63.88 19.85 57.81 3 63.88 31.43 59.29 Table 3.1: Model performances with proposed training framework after a) DNN train- ing, b) DNN-to-SNN conversion & c) SNN training. WeperformedthebaselineDNNtrainingfor300epochswithaninitiallearning rate (LR) of 0.01 that decays by a factor of 0.1 at every 60, 80, and 90% of the total number of epochs. Initialized with the layer thresholds and post-activation values, we performed the SNN training with direct input encoding for 200 epochs for CIFAR-10 and 300 epochs for CIFAR-100. We used a starting LR of 0.0001 which decays similar to that in DNN training. All experiments are performed on a Nvidia 2080 Ti GPU with 11GB memory. 3.4.2 Classification Accuracy & Latency We evaluated the performance of these networks on multiple VGG and ResNet architectures, namely VGG-11, and VGG-16, and Resnet-20 for CIFAR-10, VGG- 16 and Resnet-20 for CIFAR-100. We report the (a) baseline DNN accuracy, (b) SNN accuracy with our proposed DNN-to-SNN conversion, and (c) SNN accuracy 43 with conversion, followed by SGL, for 2 and 3 time steps. Note that the models reported in (b) are far from SOTA, but act as a good initialization for SGL. Table 6.3 provides a comparison of the performances of models generated through our training framework with SOTA deep SNNs. On CIFAR-10, our ap- proach outperforms the SOTA VGG-based SNN [5] with 2.5× fewer time steps and negligible drop in test accuracy. To the best of our knowledge, our results represent the first successful training and inference of CIFAR-100 on an SNN with only 2 time steps, yielding a 2.5− 8× reduction in latency compared to others. Ablation Study: The threshold scaling heuristics proposed in [85,88], coupled with SGL, lead to a statistical test accuracy of∼ 10% and∼ 1% on CIFAR-10 and CIFAR-100 respectively, with both 2 and 3 time steps. Also, our scaling technique alone (without SGL) requires 12 steps, while the SOTA conversion approach [4] needs 16 steps to obtain similar test accuracy. 3.5 Simulation Time & Memory Requirements Because SNNs require iteration over multiple time steps and storage of the mem- brane potentials for each neuron, their simulation time and memory requirements can be substantially higher than their DNN counterparts. However, reducing their latency can bridge this gap significantly, as shown in Figure 3.3. On average, our low-latency, 2-time-step SNNs represent a 2.38× and 2.33× reduction in training and inference time per epoch respectively, compared to the hybrid training ap- proach [5] which represents the SOTA in latency, with iso-batch conditions. Also, ourproposaluses1.44× lowerGPUmemorycomparedto[5]duringtraining,while the inference memory usage remains almost identical. 44 Authors Training Architecture Accuracy Time type (%) steps Dataset : CIFAR-10 Wu et al. Surrogate 5 CONV, 90.53 12 (2019) [37] gradient 2 linear Rathi et al. Hybrid VGG-16 92.70 5 (2020) [5] training Kundu et al. Hybrid VGG-16 92.74 10 (2021) [90] training Deng et al. DNN-to-SNN VGG-16 92.29 16 (2021) [4] conversion This work Hybrid Training VGG-16 91.79 2 Dataset : CIFAR-100 Kundu et al. Hybrid VGG-16 65.34 10 (2021) [90] training CNN Deng et al. DNN-to-SNN VGG-16 65.94 16 (2021) [4] conversion This work Hybrid Training VGG16 64.19 2 Table 3.2: Performance comparison of the proposed training framework with state-of- the-art deep SNNs on CIFAR-10 and CIFAR-100. 3.6 Energy Consumption During Inference 3.6.1 Spiking Activity As suggested in [44,69], the average spiking activity of an SNN layer l can be used as a measure of the compute energy of the model during inference. This is computedastheratioofthetotalnumberofspikesinT stepsoveralltheneuronsof the layer l to the total number of neurons in that layer. Fig. 3.4(a) shows the per- image average number of spikes for each layer with our proposed algorithm (using both 2 and 3 time steps), the hybrid training algorithm by [5] (with 5 steps), and the SOTA conversion algorithm [4] which requires 16 time steps, while classifying CIFAR-10 and CIFAR-100 using VGG-16. On average, our approach yields 1.53× and 4.22× reduction in spike count compared to [5] and [4], respectively. 45 3.6.2 Floating Point Operations (FLOPs) & Compute En- ergy We use FLOP count to capture the energy efficiency of our SNNs, since each emittedspikeindicateswhichweightsneedtobeaccumulatedatthepost-synaptic neurons and results in a fixed number of AC operations. This, coupled with the MAC operations required for direct encoding in the first layer (also used in [4,5]), dominates the total number of FLOPs. For DNNs, FLOPs are dominated by the MAC operations in all the convolutional and linear layers. Assuming E MAC and E AC denote the MAC and AC energy respectively, the inference compute energy Figure 3.3: Comparison between our proposed hybrid training technique for 2 and 3 time steps, baseline direct encoded training for 5 time steps [5] based on (a) simulation time per epoch, and (b) memory consumption, for VGG-16 architecture over CIFAR-10 and CIFAR-100 datasets. 46 of the baseline DNN model can be computed as P L l=2 FL l D · E AC , whereas that of the SNN model as FL 1 S · E MAC + P L l=2 FL l S · E AC , where FL l D and FL l S are the FLOPs count in the l th layer of DNN and SNN respectively. Figure 3.4: Comparison between our proposed hybrid training technique for 2 and 3 time steps, baseline direct encoded training for 5 time steps [5], and the optimal DNN- to-SNN conversion technique [4] for 16 time steps, based on (a) average spike count, (b) total number of FLOPs, and (c) compute energy, for VGG-16 architecture over CIFAR- 10 and CIFAR-100 datasets. An iso-architecture DNN is also included for comparison of FLOP count and compute energy. 47 Fig. 3.4(b) and (c) illustrate the FLOP counts and compute energy consump- tion for our baseline DNN and SNN models of VGG16 while classifying CIFAR- datasets, along with the SOTA comparisons [4,5]. As we can see, the number of FLOPs for our low-latency SNN is smaller than that for an iso-architecture DNN and the SNNs obtained from the prior works. Moreover, ACs consume signifi- cantly less energy than MACs both on GPU as well as neuromorphic hardware. To estimate the compute energy, we assume a 45 nm CMOS process at 0.9 V, where E AC = 0.1 pJ, while E MAC = 3.2 pJ (3.1 for multiplication and 0.1 for addition) [1] for 32-bit integer representation. Then, for CIFAR-10, our proposed SNN consumes 103.5× lower compute energy compared to its DNN counterpart and 1.27× and 5.18× lower energy than [5] and [4] respectively. For CIFAR-100, theimprovementsare159.2× overthebaselineDNN,1.52× overthe5-stephybrid SNN, and 4.72× over the 16-step optimally converted SNN. Oncustomneuromorphicarchitectures,suchasTrueNorth[91],andSpiNNaker [92], the total energy is estimated as FLOPs∗ E compute +T∗ E static [62], where the parameters (E compute ,E static ) can be normalized to (0.4,0.6) and (0.64,0.36) for TrueNorth and SpiNNaker, respectively [62]. Since the total FLOPs for VGG-16 (>10 9 ) is several orders of magnitude higher than the SOTA T, the total energy of a deep SNN on neuromorphic hardware is compute bound and thus we would see similar energy improvements on them. 3.7 Conclusions ThischaptershowsthatcurrentDNN-to-SNNalgorithmscannotachieveultralow latencies because they rely on simplistic assumptions of the DNN and SNN pre- activation distributions. The chapter then proposes a novel training algorithm, inspired by empirically observed distributions, that can more effectively optimize the SNN thresholds and post-activation values. This approach enables training of SNNs with as little as 2 time steps and without any significant degradation in 48 accuracy for complex image recognition tasks. The resulting SNNs are estimated to consume 159.2× lower energy than iso-architecture DNNs. 49 Chapter 4 SNNs for 3D Image Recognition This chapter first provides the introduction and motivation behind SNNs for compute-heavy 3D image recognition applications in Section 4.1. Section 4.2 and Section 4.3 discusses our proposed quantization-aware SNN training method and a PIM architecture to improve the energy efficiency of our proposed SNN mod- els during inference. Section 4.4 focuses on our proposed network architectures, benchmark datasets, and our training details. We present detailed experimental results and analysis in Section 4.5. Finally, the chapter concludes in Section 4.6. 4.1 Introduction and Motivation 3D image classification is an important problem, with applications ranging from autonomous drones to augmented reality. 3D content creation has been gaining momentum in the recent past and the amount of information in the form of 3D input data becoming publicly available is steadily increasing. In particular, hy- perspectral imaging (HSI), which extracts rich spatial-spectral information about the ground surface, has shown immense promise in remote sensing [93], and thus, has become an important application for 3D image recognition. HSI is currently used in several workloads ranging from geological surveys [94], to the detection of camouflaged vehicles [95]. In hyperspectral images (HSIs), each pixel can be modeled as a high-dimensional vector where each entry corresponds to the spec- tral reflectivity of a particular wavelength [93], and constitutes the 3 rd dimension of the image. The goal of the classification task is to assign a unique seman- tic label to each pixel [96]. For HSI classification, several spectral feature-based methods have been proposed, including support vector machine [97], random for- est [98], canonical correlation forest [99], and multinomial logistic regression [100]. 50 However, these spectral-spatial feature extraction methods rely on hand-designed descriptions, prior information, and empirical hyperparameters [93]. Lately, convolutional neural networks (CNNs), consisting of a series of hierar- chicalfilteringlayersforglobaloptimizationhaveyieldedhigheraccuracythanthe hand-designed features [12], and have shown promise in multiple applications in- cludingimageclassification[8],objectdetection[101],semanticsegmentation[102], anddepthestimation[103]. The2DCNNstackedautoencoder[93]wasthefirstat- tempttoextractdeepfeaturesfromitscompressedlatentspacetoclassifyHSIs. To extractthespatial-spectralfeaturesjointlyfromtherawHSI,researchersproposed a 3D CNN architecture [104], which achieved SOTA classification results. Authors in [105–107] successfully created multiscale spatiospectral relationships using 3D CNNs and fused the features using a 2D CNN to extract more robust represen- tation of spectral–spatial information. However, compared to 2D CNNs used to classify traditional RGB images, multi-layer 3D CNNs require significantly higher power and energy costs [108]. A typical hyperspectral image cube consists of sev- eral hundred spectral frequency bands that, for target tracking and identification, require real time on-device processing [109]. This desire for HSI sensors operating on energy-limited devices motivates exploring alternative lightweight classification models. In particular, as discussed in Chapters 2, low-latency spiking neural networks (SNNs) [110] have gained attention because they are more computational efficient than CNNs for a variety of applications, including image analysis. Besides the computeefficiency,low-latencySNNstrainedusingANN-SNNconversion,coupled with supervised training, have been able to perform at par with ANNs in terms of classification accuracy in traditional image classification tasks [42,76]. Hence, SNN-based models are particularly useful in 3D convolutional architectures which have higher arithmetic intensity (the ratio of floating point operations to accessed bytes) than 2D CNNs, which is further elaborated below. Let us evaluate the compute and memory access cost of a 3D CNN layer l with X l ∈R H i l × W i l × C i l × D i l as the input activation tensor, and W l ∈R k x l × k y l × k z l × C i l × C o l as 51 the weight tensor. Assuming no spatial reduction, the total number of floating point operations (FLOP) and memory accesses (Mem), which involves fetching the input activation (IA) tensor, weight (W) tensor, and writing to the output activation(OA)) tensor, in layer l are given as FLOP l 3D =k x l × k y l × k z l × C i l × C o l × H i l × W i l × D i l (4.1) Mem l 3D =H i l × W i l × C i l × D i l +k x l × k y l × k z l × C i l × C o l +H i l × W i l × C o l × D i l (4.2) where the first, second and third term in Mem l 3D correspond to IA, W, and OA respectively. Notethatweassumethewholeoperationcanbeperformedinasingle compute substrate (e.g. systolic array), without having to incur any additional data movement, and that the number of operations is independent of activation and weight bit-widths. Similarly, for a 2D CNN layer l, the total number of MACs and memory accesses is FLOP l 2D =k x l × k y l × C i l × C o l × H i l × W i l (4.3) Mem l 2D =H i l × W i l × C i l +k x l × k y l × C i l × C o l +H i l × W i l × C o l (4.4) where we do not have the third dimension D. From Eq. (3-6), FLOP l 3D FLOP l 2D =k z l × D i l (4.5) Mem l 3D Mem l 2D = (H i l × W i l × C i l × D i l )+(k x l × k y l × k z l × C i l × C o l )+(H i l × W i l × C o l × D i l ) (H i l × W i l × C i l )+(k x l × k y l × C i l × C o l )+(H i l × W i l × C o l ) (4.6) ≤ H i l × W i l × C i l × D i l H i l × W i l × C i l + (k x l × k y l × k z l × C i l × C o l ) (k x l × k y l × C i l × C o l ) + (H i l × W i l × C o l × D i l ) (H i l × W i l × C o l ) (4.7) ≤ 2D i l +k z l (4.8) 52 Assumingk z l =3(allSOTACNNarchitectureshavefiltersize3ineachdimension), FLOP l 3D FLOP l 2D ≥ Mem l 3D Mem l 2D if D i l ≥ 3 (4.9) Hence, 3D CNNs have higher arithmetic intensity, compared to 2D CNNs, when the spatial dimension D is higher than 3. This holds true in all but the last layer of a deep CNN network. For a 100× 100 input activation tensor with 64 and 128 input and output channels respectively, adding a third dimension of size 100 (typical hyperspectral images has 100s of spectral bands), and necessitating the use of 3D CNNs, increases the FLOP count by 300× , whereas the memory access cost increases by 96.5× . Note that these improvement factors are obtained by setting the input and output activation dimensions above in Eqs. 8 and 9 and assuming k x l =k y l =k z l =3. Moreover, as shown in Section 4.5, the energy consumption of a 3D CNN is compute bound on both general-purpose and neuromorphic hardware, and the large increment in FLOPs translates to significant SNN savings in total energy, as an AC operation is significantly cheaper than a MAC operation. Note that SNNs cannot reduce the memory access cost involving the weights. 4.2 Proposed Quantized SNN Training Method Inthissection,weevaluateandcomparethedifferentchoicesforSNNquantization intermsofcomputeefficiencyandmodelaccuracy. Wethenincorporatethechosen quantization technique into STDB, which we refer to as Q-STDB. 4.2.1 Study of Quantization Choice Uniform quantization transforms a weight element w ∈ [w min ,w max ] to a range [− 2 b− 1 ,2 b− 1 − 1] where b is the bit-width of the quantized integer representation. There are primarily two choices for the above transformation, known as affine and scale quantization. In affine quantization, the quantized value can be written as 53 w a = s a · w+z a , where s a and z a denote the scale and zero point (the quantized value to which the real value zero is mapped) respectively. However, scale quanti- zation performs range mapping with only a scale transformation, does not have a zero correction term, and has a symmetric representable range [− α, +α ]. Hence, affine quantization leads to more accurate representations compared to the scale counterpart. Detailed descriptions of these two types of quantization can be found in [111,112]. To evaluate the compute cost of our quantization framework, let us consider a 3D convolutional layer l, the dominant layer in HSI classification models, that performs a tensor operation O l = X l ⊛W l where X l ∈ R H i l × W i l × C i l × D i l is the IA tensor, W l ∈ R k x l × k y l × k z l × C i l × C o l is the W tensor and O l ∈ R H o l × W o l × C o l × D o l is the OA tensor, with the same notations as used in Section 4.1. The result of the real-valued operation O l = X l ⊛W l can be approximated with quantized tensors X Q l and W Q l , by first dequantizing them producing ˆ X l and ˆ W l respectively, and then performing the convolution. Note that the same quantization parameters are sharedbyallelementsintheweighttensor,becausethisreducesthecomputational cost compared to other granularity choices with no impact on model accuracy. Activations are similarly quantized, but only in the input layer, since they are binary spikes in the remaining layers. Also, note that both X Q l and W Q l have similar dimensions as X l and W l respectively. Assuming the tensors are scale- quantized per layer, O l =X l ⊛W l ≈ ˆ X l ⊛ ˆ W l =X Q l ⊛W Q l · ( 1 s X s · s W s ) (4.10) where s X s and s W s are scalar values for scale quantization representing the levels of the input and weight tensor respectively. Hence, scale quantization results in an integer convolution, followed by a point-wise floating-point multiplication for eachoutputelement. Giventhatatypical3Dconvolutionoperationinvolvesafew thousands of MAC operations (accumulate for binary spike inputs) to compute an output element, a single floating-point operation for the scaling shown in Eq. 4.10 54 is a negligible computational cost. This is because computing X l ⊛W l involves element-wise multiplications of the weight kernels across multiple channels (for example, for a 3D convolution with 3× 3× 3 kernel and 100 channels, we need to perform 2700 MACs) and the corresponding overlapping input activation maps. Theaccumulatedoutputthenneedstobedividedbys X s · s W s ,whichaddsnegligible compute cost. Although both affine and scale quantization enable the use of low-precision arithmetic,affinequantizationresultsinmorecomputationallyexpensiveinference as shown below. O l ≈ X Q l − z X a s X a ⊛ W Q l − z W a s W a = (X Q l ⊛W Q l − z X a ⊛(W Q l − z W a )− X Q l ⊛z W a ) s X a · s W a (4.11) Note that z X a and z W a are tensors of sizes equal to that of X Q l and W Q l respectively that consist of repeated elements of the scalar zero-values of the input activation and weight tensor respectively. On the other hand, s X a and s W a are the corre- sponding scale values. The first term in the numerator of Eq. 4.11 is the integer convolution operation similar to the one performed in scale quantization shown in Eq. 4.10. The second term contains integer weights and zero-points, which can be computed offline, and adds an element-wise addition during inference. The third term, however, involves point-wise multiplication with the quantized activation X Q l , which cannot be computed before-hand. As we show in Section 4.5.5, this extra computation can increase the energy consumption of our SNN models by over an order of magnitude. However, our experiments detailed in Section 4.5 show that ignoring the affine shift during SNN training degrades the test accuracy significantly. Hence, the forward path computations during SNN training follows affine quantization as per Eq. 4.11, while the other steps involved in SNN training (detailed in Section 4.2.2, 55 namelygradientcomputation,andparameterupdate,usethefull-precisionweights andmembranepotentials,similartobinaryANNtrainingtoaidconvergence[113]. After training, the full-precision weights are rescaled for inference using scale quantization, as per Eq. 4.10, which our results show yields negligible accuracy drop compared to using affine-scaled weights. The membrane potentials obtained asresultsoftheaccumulateoperationsonlyneedtobecomparedwiththethreshold voltage once for each time step, which consumes negligible energy, and can be performed using fixed-point comparators (in the periphery of the memory array for PIM accelerators). Notice that the affine quantization acts as an intermediate representation that lies between full-precision and scale quantization during training; using full- precision causes a large mismatch between weight representations during training and inference, while scale quantization during training results in a similar mis- match during its forward and backward computations. Thus, in principle, this approach is similar to incremental quantization approaches [114] in which we in- crementally adjust the type of quantization from the more accurate affine form to more energy-efficient scale form. Lastly, we note that our approach to quantiza- tionisalsoapplicabletostandard3DCNNsbuttherelativesavingsissignificantly higher in SNNs since the inference is implemented without multiply accumulates. 4.2.2 Q-STDB based Training Our proposed training algorithm, illustrated in Fig. 4.1, incorporates the above quantization methodology into the STDB technique [76], where the spatial and temporal credit assignment is performed by unrolling the SNN network in time and employing BPTT. Output Layer: The neuron model in the output layer L only accumulates the 56 incoming inputs without any leakage, does not generate an output spike, and is described by u t L =u t− 1 L + ˆ w L o t L− 1 (4.12) where N is the number of output labels, u L is a vector containing the membrane potentialofN outputneurons, ˆ w L istheaffinequantizedweightmatrixconnecting the last two layers (L and L− 1), ando L− 1 is a vector containing the spike signals from layer (L− 1). The loss function is defined on u L at the last time step T (u T L ). Since u T L is a vector consisting of continuous values, we compute the SNN’s pre- dicted distribution (p) as the softmax ofu T L , similar to the output fully-connected layer of a CNN. Since our SNN is used only for classification tasks, we employ the popularcross-entropyloss. ThelossfunctionListhusdefinedasthecross-entropy between the true one-hot encoded output (y) and the distribution p. L=− N X i=1 y i log(p i ), p i = e u T i P N j=1 e u T j , (4.13) The derivative of the loss function with respect to the membrane potential of the neurons in the final layer is described by ∂L ∂u T L =(p− y) (4.14) Here, p and y are vectors containing the softmax and one-hot encoded values of the true label respectively. To compute the gradient at the current time step, the membrane potential at the previous step is considered as an input quantity [76]. With the affine-quantized weights in the forward path, gradient descent updates the network parametersw L of the output layer as w L =w L − η ∆ w L (4.15) 57 ∆ w L = X t ∂L ∂w L = X t ∂L ∂u t L ∂u t L ∂ˆ w L ∂ˆ w L ∂w L = ∂L ∂u T L X t ∂u t L ∂ˆ w L ∂ˆ w L ∂w L ≈ (p− y) X t o t L− 1 (4.16) ∂L ∂o t L− 1 = ∂L ∂u t L ∂u t L ∂o t L− 1 =(p− y)ˆ w L (4.17) where η is the learning rate (LR). Note that the derivative of the affine quantiza- tion function of the weights ( ∂ˆ w L ∂w L ) is undefined at the step boundaries and zero everywhere, as shown in Fig. 4.1(a). Our training framework addresses this chal- lenge by using the Straight-through Estimator (STE) [113], which approximates the derivative to be equal to 1 for inputs in the range [w min ,w max ] as shown in Fig. 4.1(b), where w min and w max are the minimum and maximum values of the weights in a particular layer. Note that w min and w max are updated at the end of every mini-batch to ensure all the weights lie between w min and w max during the forward and backward computations in each training iteration. Hence, we use ∂ˆ w L ∂w L ≈ 1 to compute the loss gradients in Eq. 4.16. Hidden layers: The neurons in all the hidden layers follow the quantized LIF model shown in Eq. 2.3. All neurons in a layer possess the identical leak and threshold value. This reduces the number of trainable parameters and we did not observeanynoticeableaccuracychangebyassigningdifferentthreshold/leakvalue to each neuron, similar to [44]. With a single threshold for each layer, it may seem redundant to train both the weights and threshold together. However, we observe, similar to [44,76] that the latency required to obtain the SOTA classification ac- curacy decreases with the joint optimization, which further drops by training the leak term. This may be because the loss optimizer can reach an improved local 58 Figure 4.1: (a) Proposed SNN training framework details with 3D convolutions, and (b)Fakequantizationforwardandbackwardpasswithstraightthroughestimator(STE) approximation minimum when all the parameters are tunable. The weight update in Q-STDB is calculated as ∆ w l = X t ∂L ∂w l = X t ∂L ∂z t l ∂z t l ∂o t l ∂o t l ∂u t l ∂u t l ∂ˆ w l ∂ˆ w l ∂w l ≈ X t ∂L ∂z t l ∂z t l ∂o t l 1 v l o t l− 1 · 1 (4.18) where ∂ˆ w l ∂w l and ∂z t l ∂o t l are the two discontinuous gradients. We calculate the for- mer using STE described above, while the latter is approximated using surrogate gradient [74] shown below. ∂z t l ∂o t l =γ · max(0,1−| z t l |) (4.19) Notethatγ isahyperparameterdenotingthemaximumvalueofthegradient. The threshold and leak update is computed similarly using BPTT [76]. 59 4.3 SRAM-based PIM Acceleration Efficient hardware implementations of neural network algorithms are being widely exploredbytheresearchcommunityinanefforttoenableintelligentcomputations on resource constrained edge devices [115]. Existing computing systems based on the well-known von-Neumann architecture (characterized by physically separated memory and computing units) suffer from energy and throughput bottleneck, re- ferred as the memory wall bottleneck [116,117]. Novel memory-centric paradigms like PIM are being extensively investigated by the research community to mitigate the energy-throughput constraints arising from the memory wall bottleneck. As discussed in Section 7.1, the first layer of a direct coded SNN is not as compu- tationally efficient as the other layers, as it processes continuous valued inputs as opposed to spiking inputs, and dominates the total energy consumption. Further, for 3D images such as HSI, the number of real valued computations in the first layer of an SNN is orders of magnitude more than 2D images. In order to enable energy-efficient hardware for SNNs catering to 3D images, we propose to exploit the high-parallelism, high-throughput and low-energy bene- fits of analog PIM in SRAM, for the first layer of the SNN. As mentioned earlier, the first layer of SNN requires real valued MAC operations which are well-suited to be accelerated using analog PIM approaches [3,118]. Moreover, the number of weights in the first layer of a typical 3D CNN architecture is substantially less compared to the other layers, which ensures that we can perform PIM using a sin- gle memory array, thereby reducing the complexity of the peripheral circuits, such as adder trees for partial sum reduction. Several proposals achieving multiple de- grees of compute parallelism within on-chip memory based on SRAM arrays have been proposed [116,117,119–121]. Interestingly, both digital [116,117] as well as analog- mixed-signal approaches [119,120] have been explored extensively. Analog approaches are of particular importance due to higher levels of data parallelism and compute throughput compared to digital counterparts in performing MAC computations. Our adopted PIM architecture for the first layer of our proposed 60 Figure 4.2: PIM architecture in the first layer to process MAC operations for the first layer of direct coded SNNs. Other layers of the SNN are processed with highly parallel programmable architecture using simpler accumulate operations. SNNs is illustrated in Fig. 4.2. The PIM architecture leverages analog computing for parallel MAC operations by mapping activations as voltages on the wordlines and weights as data stored in the SRAM bit-cells (represented as Q and QB). As shown in [3], multi-bit MAC operations can be enabled in SRAM arrays by acti- vating multiple rows, simultaneously, allowing appropriately weighted voltages to developoneachcolumnoftheSRAMarrayrepresentingtheresultingMACopera- tionscomputedinanalogdomain. PeripheralADCcircuitsareusedtoconvertthe analog MAC operation into corresponding digital data for further computations. To summarize, we propose use of analog PIM to accelerate the MAC inten- sive compute requirements for the first layer of the SNN. The remaining layers of the SNN leverage traditional digital hardware implementing simpler accumu- late operations. Advantageously, our proposed quantized SNN with small number of weights in the first layer is well-suited for low-overhead PIM circuits, as reduc- tioninbit-precisionandperipheralcomplexitydrasticallyimprovestheenergyand throughput efficiency of analog PIM architectures [118]. 61 4.4 ProposedCNNArchitectures, Datasets, and Training Details 4.4.1 Model Architectures We developed two models, a 3D and a hybrid fusion of 3D and 2D convolutional architectures, that are inspired by the recently proposed CNN models [104,106, 107] used for HSI classification and compatible with our ANN-SNN conversion framework. We refer to the two models CNN-3D and CNN-32H. ThereareseveralconstraintsinthetrainingofthebaselineANNmodelsneeded to obtain near lossless ANN-SNN conversion [34,40]. In particular, we omit the bias term from the ANN models because the integration of the bias term over multiple SNN timesteps tends to shift the activation values away from zero which causes problems in the ANN-SNN conversion process [40]. In addition, similar to [30,40,76,122], we do not use batch normalization (BN) layers because using identical BN parameters (e.g., global mean µ , global standard deviation σ , and trainable parameter γ ) for the statistics of all timesteps do not capture the tem- poral dynamics of the spike train in an SNN. Instead, we use dropout [80] as the regularizerforbothANNandSNNtraining. Recentresearch[30,76]indicatesthat thereisnoprobleminyieldingstate-of-the-artaccuracyincompleximagerecogni- tiontasks,suchasCIFAR-100,withmodelswithoutbatchnormalizationandbias. We observe the same for HSI models in this work as well. Moreover, our initial ANN models employ ReLU nonlinearity after each convolutional and linear layer (except the classifier layer), due to the similarity between ReLU and LIF neurons. Our pooling operations use average pooling because for binary spike based acti- vation layers, max pooling incurs significant information loss. Our SNN-specific architectural modifications are illustrated in Fig. 4.3. We also modified the number of channels and convolutional layers to obtain compact yet accurate models. 2D patches of sizes 5× 5 and 3× 3 were extracted for CNN-3D and CNN-32H respectively, without any reduction in dimensionality 62 from each dataset. Higher sized patches increase the computational complexity without any significant improvement in test accuracy. Note that magnitude based structuredweightpruning[123],whichhasbeenshowntobeaneffectivetechnique formodelcompression, canonlyremove<15%oftheweightsaveragingacrossthe twoarchitectures, with<1%degradationintestaccuracyforallthethreedatasets used in our experiments, which also indicates the compactness of our models. The details of both models are given in Table 4.1. 4.4.2 Datasets We used four publicly available datasets, namely Indian Pines, Pavia University, Salinas scene, and HyRANK. A brief description follows for each one, and few sample images found in some of these datasets are shown in Fig. 4.4. Indian Pines: The Indian Pines (IP) dataset consists of 145× 145 spatial pixels and 220 spectral bands in a range of 400− 2500 nm. It was captured using the AVIRIS sensor over North-Western Indiana, USA, with a ground sample distance (GSD) of 20 m and has 16 vegetation classes. Pavia University: ThePaviaUniversity(PU)datasetconsistsofhyperspectralim- ageswith610× 340pixelsinthespatialdimension,and103spectralbands,ranging from 430 to 860 nm in wavelength. It was captured with the ROSIS sensor with GSD of 1.3 m over the University of Pavia, Italy. It has a total of 9 urban land- cover classes. Salinas Scene: TheSalinasScene(SA)datasetcontainsimageswith512× 217spa- tial dimension and 224 spectral bands in the wavelength range of 360 to 2500 nm. The 20 water absorbing spectral bands have been discarded. It was captured with the AVIRIS sensor over Salinas Valley, California with a GSD of 3.7 m. In total 16 classes are present in this dataset. HyRANK: The ISPRS HyRANK dataset is a recently released hyperspectral benchmark dataset. Different from the above HSI datasets that contain a singlr hyperspectral scene, the HyRANK dataset consists of two hyperspectral scenes, 63 Figure 4.3: Architectural differences between (a) ANN and (b) SNN for near-lossless ANN-SNN conversion. Figure 4.4: (i) False color-map and (ii) ground truth images of different HSI datasets used in our work, namely (a) Indian Pines, (b) Pavia University, and (c) Salinas Scene. namely Dioni and Loukia. Similar to [124], we use the available labelled samples in the Dioni scene for training, while those in the Loukia scene for testing. The Dioni and Loukia scenes comprise 250× 1376 and 249× 945 spectral samples, respectively, and each has 176 spectral reflectance bands. For preprocessing, images in all the data sets are normalized to have a zero mean and unit variance. For our experiments, all the samples (except that of the HyRANK dataset) are randomly divided into two disjoint training and test sets. The limited 40% samples are used for training and the remaining 60% for performance evaluation. 64 Table 4.1: Model architectures employed for CNN-3D and CNN-32H in classifying the IP dataset. Every convolutional and linear layer is followed by a ReLU non-linearity. The last classifier layer is not shown. The size of the activation map of a 3D CNN is written as (H,W,D,C) where H, W, D, and C represent the height, width, depth of the input feature map and the number of channels. Since the 2D CNN layer does not have the depth dimension, its feature map size is represented as (H,W,C). Layer Size of input Number of Size of Stride Padding Dropout Size of output type feature map filters each filter value value value feature map Architecture : CNN-3D 3D Convolution (5,5,200,1) 20 (3,3,3) (1,1,1) (0,0,0) - (3,3,198,20) 3D Convolution (3,3,198,20) 40 (1,1,3) (1,1,2) (1,0,0) - (3,3,99,40) 3D Convolution (3,3,99,40) 84 (3,3,3) (1,1,1) (1,0,0) - (1,1,99,84) 3D Convolution (1,1,99,84) 84 (1,1,3) (1,1,2) (1,0,0) - (1,1,50,84) 3D Convolution (1,1,50,84) 84 (1,1,3) (1,1,1) (1,0,0) - (1,1,50,84) 3D Convolution (1,1,50,84) 84 (1,1,2) (1,1,2) (1,0,0) - (1,1,26,84) Architecture : CNN-32H 3D Convolution (3,3,200,1) 90 (3,3,18) (1,1,7) (0,0,0) - (1,1,27,90) 2D Convolution (27,90,1) 64 (3,3) (1,1) (0,0) - (25,88,64) 2D Convolution (25,88,64) 128 (3,3) (1,1) (0,0) - (23,86,128) Avg. Pooling (23,86,128) - (4,4) (4,4) (0,0) - (5,21,128) Dropout (5,21,128) - - - - 0.2 (5,21,128) Linear 13440 6881280 - - - - 512 4.4.3 ANN Training and SNN Conversion Procedures Westartbyperformingfull-precision32-bitANNtrainingfor100epochsusingthe standard SGD optimizer with an initial learning rate (LR) of 0.01 that decayed by a factor of 0.1 after 60, 80, and 90 epochs. The ANN-SNN conversion entails the estimation of the values of the weights and per-layer thresholds of the SNN model architecture. The weights are simply copied from a trained DNN model to the iso-architecture target SNN model. The threshold for each layer is computed sequentially as the 99.7 percentile of the pre-activation distribution (weighted sum of inputs received by each neuron in a layer) over the total number of timesteps [76] for a small batch of HSI images (of size 50 in our case). Note that we use 100 time steps to evaluate the thresholds, while the SNN training and inference are performed with only 5 time steps. In our experiments we scale the initial layer thresholds by 0.8. We keep the leak of each layer set to unity while evaluating these thresholds. Note that employing 65 directcodingasusedinourworkandothers[76]canhelpavoidanyapproximation error arising from the input spike generation (conversion from raw images to spike trains) process and aid ANN-SNN conversion. Lower bit-precision of weights will most likely not exacerbate the conversion process, assuming the ANN models can be trained accurately with the same bit-precision. We then perform quantization-aware SNN training as described in Section 4.2 for another 100 epochs. We set γ = 0.3 [74] and used the ADAM optimizer with a starting LR of 10 − 4 which decays by a factor of 0.5 after 60, 80, and 90 epochs. All experiments are performed on a Nvidia 2080Ti GPU with 11 GB memory. 4.5 Experimental Results and Analysis This section first describes our inference accuracy results, then analyzes the asso- ciated spiking and energy consumption. It then describes several ablation studies and a comparison of the training time and memory requirements. 4.5.1 ANN & SNN Inference Results We report the best Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient measures to evaluate the HSI classification performance for our pro- posed architectures, similar to [104]. Here, OA represents the number of correctly classifiedsamplesoutofthetotaltestsamples. AArepresentstheaverageofclass- wise classification accuracies, and Kappa is a statistical metric used to assess the mutual agreement between the ground truth and classification maps. Column-2 in Table4.2showstheANNaccuracies,column-3showstheaccuracyafterANN-SNN conversionwith50timesteps 1 . Column-4showstheaccuracywhenweperformour proposed training without quantization, while columns 5 to 7 shows the SNN test accuracies obtained with Q-STDB for different weight bit precisions (4 to 6 bits). 1 We empirically observe that at least 50 time steps are required for lossless ANN-SNN con- version. 66 Table 4.2: Model performances with Q-STDB based training on IP, PU, SS, and HyRANK datasets for CNN-3D and CNN-32H after a) ANN training, b) ANN-to-SNN conversion, c) 32-bit SNN training, d) 4-bit SNN training, e) 5-bit SNN training, and f) 6-bit SNN training, with only 5 time steps. A. ANN B. Accuracy after C. Accuracy after D. Accuracy after E. Accuracy after F. Accuracy after Dataset accuracy (%) ANN-to-SNN conv. (%) FP SNN training (%) 4-bit SNN training (%) 5-bit SNN training (%) 6-bit SNN training (%) OA AA Kappa OA AA Kappa OA AA Kappa OA AA Kappa OA AA Kappa OA AA Kappa Architecture : CNN-3D IP 98.86 98.42 98.55 57.68 50.88 52.88 98.92 98.76 98.80 97.08 95.64 95.56 98.38 97.78 98.03 98.68 98.34 98.20 PU 99.69 99.42 99.58 91.16 88.84 89.03 99.47 99.06 99.30 98.21 97.54 97.75 99.26 98.48 98.77 99.50 99.18 99.33 SS 98.89 98.47 98.70 81.44 76.72 80.07 98.49 97.84 98.06 96.47 93.16 94.58 97.25 95.03 95.58 97.95 97.09 97.43 HyRANK 64.21 63.27 47.34 34.80 58.97 20.64 63.18 61.25 45.25 59.76 56.40 42.28 61.70 60.48 46.06 62.96 61.27 46.82 Architecture : CNN-32H IP 97.60 97.08 97.44 70.88 66.56 67.89 97.27 96.29 96.35 96.63 95.81 95.89 97.23 96.08 96.56 97.45 96.73 96.89 PU 99.50 99.09 99.30 94.96 90.12 93.82 99.38 98.83 99.13 99.17 98.41 98.68 99.25 98.84 98.86 99.35 98.88 98.95 SS 98.88 98.39 98.67 88.16 84.19 85.28 97.92 97.20 97.34 97.34 96.32 96.77 97.65 96.81 96.97 97.99 97.26 97.38 HyRANK 64.43 70.68 52.82 24.26 26.90 19.37 63.72 67.89 49.59 62.27 62.50 46.58 63.27 65.32 47.98 63.34 66.66 48.21 SNNs trained with 6-bit weights result in 5.33× reduction in bit-precision com- pared to full-precision (32-bit) models and, for all three tested data sets, perform similar to the full precision ANNs for both the CNN-3D and CNN-32H architec- tures. Althoughthemembranepotentialsdonotneedtobequantizedasdescribed in Section 4.2, we observed that the model accuracy does not drop significantly even if we quantize them, and hence, the SNN results shown in Table 4.2 corre- spond to 6-bit membrane potentials. Four-bit weights and potentials provide even lower complexity, but at the cost of a small accuracy drop. Fig. 4.5 shows the confusion matrix for the HSI classification performance of the ANN and proposed SNN over the IP dataset for both the architectures. The inference accuracy (OA, AA, and Kappa) of our ANNs and SNNs trained via Q-STDB are compared with the current state-of-the-art ANNs used for HSI classification in Table 4.3. As we can see, simply porting the ANN architectures used in [104,107] to SNNs, and performing 6-bit Q-STDB results in significant drops in accuracy, particularly for the India Pines data set. In contrast, our CNN- 3D-based SNN models suffer negligible OA drop ( <1% for all datasets) compared to the best performing ANN models for HSI classification. 67 Figure 4.5: Confusion Matrix for HSI test performance of ANN and proposed 6-bit SNN over IP dataset for both CNN-3D and CNN-32H. The ANN and SNN confusion matrices look similar for both the network architectures. CNN-32H incurs a little drop in accuracy compared to CNN-3D due to shallow architecture. 4.5.2 Spiking Activity EachSNNspikeinvolvesaconstantnumberofACoperations,andhence,consumes a fixed amount of energy. Consequently, the average spike count of an SNN layer l, denoted ζ l , can be treated as a measure of compute-energy of the model [30,40]. We calculate ζ l as the ratio of the total spike count in T steps over all the neurons of layer l to the number of neurons in the layer. Hence, the energy efficiency of an SNN model can be improved by decreasing the spike count. Fig. 4.6 shows the average spike count for each layer with Q-STDB when evaluated for 200 samples from each of the three datasets (IP, PU, SS) for the CNN-3D and CNN-32H architecture. For example, the average spike count of the 3 rd convolutional layer of the CNN-3D-based SNN for IP dataset is 0.568, which meanseachneuroninthatlayerspikes0.568timesonaverageoverallinputsamples over a 5 time step period. Note that the average spike count is less than 1.4 for all thedatasetsacrossboththearchitectureswhichleadstosignificantenergysavings as described below. 68 Table 4.3: Inference accuracy (OA, AA, and Kappa) comparison of our proposed SNN models obtained from CNN-3D and CNN-32H with state-of-the-art deep ANNs on IP, PU, SS, and HyRANK datasets Authors ANN/SNN Architecture OA (%) AA (%) Kappa (%) Dataset : Indian Pines [125] ANN MSKNet 81.73 71.4 79.2 [126] ANN DFFN 98.52 97.69 98.32 [127] ANN SSRN 99.19 98.93 99.07 [106] ANN HybridSN 99.75 99.63 99.71 [104] ANN 6-layer 3D CNN 98.29 97.52 97.72 SNN 95.88 94.26 95.34 [107] ANN Hybrid CNN 96.15 94.96 95.73 SNN 94.90 94.08 94.78 This work ANN CNN-3D 98.86 98.42 98.55 SNN 98.79 98.34 98.60 This work ANN CNN-32H 97.60 97.08 97.44 SNN 97.45 96.73 96.89 Dataset : Pavia University [125] ANN MSKNet 90.66 88.09 87.64 [126] ANN DFFN 98.73 97.24 98.31 [127] ANN SSRN 99.61 99.56 99.33 [124] ANN DRIN 96.4 95.8 95.2 [104] ANN 6-layer 3D CNN 99.32 99.02 99.09 SNN 98.55 98.02 98.28 [107] ANN Hybrid CNN 99.05 98.35 98.80 SNN 98.40 97.66 98.21 This work ANN CNN-3D 99.69 99.42 99.58 SNN 99.50 99.18 99.33 This work ANN CNN-32H 99.50 99.09 99.30 SNN 99.35 98.88 98.95 Dataset : Salinas Scene [126] ANN DFFN 98.87 98.75 98.63 [124] ANN DRIN 96.7 98.6 96.3 [107] ANN Hybrid CNN 98.85 98.35 98.22 SNN 97.05 97.41 97.18 This work ANN CNN-3D 98.89 98.47 98.70 SNN 97.95 97.09 97.43 This work ANN CNN-32H 98.88 98.39 98.67 SNN 97.99 97.26 97.38 Dataset : HyRANK [124] ANN DRIN 54.4 56.0 43.3 This work ANN CNN-3D 64.21 63.27 47.34 SNN 62.96 61.27 46.82 This work ANN CNN-32H 64.43 69.68 52.82 SNN 63.34 66.66 48.21 4.5.3 Energy Consumption and Delay In this section, we analyze the improvements in energy, delay, and EDP of our proposed SNN models compared to the baseline SOTA ANN models running on digital hardware for all the three datasets. We show that further energy savings 69 Figure 4.6: Layerwise spiking activity plots for (a) CNN-3D and (b) CNN-32H on Indian Pines, Salinas Scene and Pavia University datasets. can be obtained by using the PIM architecture discussed in Section 4.3 to process the first layer of our SNN models. Digital Hardware Letusassumea3Dconvolutionallayer l havingweighttensorW l ∈R k× k× k× C i l × C o l thatoperatesonaninputactivationtensorI l ∈R H i l × W i l × C i l × D i l ,wherethenotations are similar to the one used in Section 4.2. We now quantify the energy consumed to produce the corresponding output activation tensor O l ∈R H o l × W o l × C o l × D o l for an ANN and SNN, respectively. Our model can be extended to fully-connected layers with f i l and f o l as the number of input and output features respectively, and to 2D convolutional layers, by shrinking a dimension of the feature maps. Inparticular,foranylayerl,weextendtheenergymodelof[3,118]to3DCNNs by adding the third dimension of weights (k) and output feature maps (D o l ), as follows E CNN l =C i l C o l k 3 E read +C i l C o l k 3 H o l W o l D o l E mac +P leak T CNN l (4.20) 70 where the first term denotes the memory access energy, the second term denotes thecomputeenergy,whilethethirdtermhighlightsthestaticleakageenergy. Note that T l is the latency incurred to process the layer l, and can be written as T CNN l = C i l C o l k 3 B IO B W N bank ! T read + C i l C o l k 3 N mac H o l W o l D o l T mac (4.21) The notations for Equations 4.20 and 4.21, along with their values, obtained from [3,118] are illustrated in Table 7.1. The total energy is compute bound since the compute energy alone consumes∼ 98% of the total energy averaged across all the layers for the CNN-3D architecture on all the datasets. The memory cost only dominates the few fully connected layers, accounting for > 85% of their total energy. Table 4.4: Notations and their values used in energy, delay, and EDP equations for ANN and 6-bit SNNs. Notation Description Value B IO number of bits fetched from SRAM to processor per bank 64 B W bit width of the weight stored in SRAM 6 N col number of columns in SRAM array 256 N bank number of SRAM banks 4 Nmac(Nac) number of MACs (ACs) in processing element (PE) array 175 (175) T read time required to transfer 1-bit data between SRAM and PE 4 ns T BLP time required for one analog in-memory accumulation 4 ns Emac(Eac) energy consumed in a single MAC (AC) 3.1 pJ (0.1 pJ) for 32-bit operation for a particular bit-precision full-precision inputs [1]) Tmac(Tac) time required to perform a single MAC (AC) in PE 4 ns (0.4 ns) T adc time required for a single ADC operation 6 ns E read energy to transfer each weight element between SRAM and PE 5.2 pJ E BLP energy required for a single in-memory analog accumulation 0.08 pJ E adc energy required for an ADC operation 0.268 pJ Similarly, we can extend the energy and delay model of [3,118] with similar FLOPs evaluation strategy as 2D convolution, to our proposed SNNs, as follows E SNN l =C i l C o l k 3 E read +C i l C o l k 3 H o l W o l D o l ζ l E ac +P leak T SNN l (4.22) 71 T SNN l = C i l C o l k 3 B IO B W N bank ! T read + C i l C o l k 3 N ac H o l W o l D o l T ac (4.23) foranylayerlexcepttheinputlayerthatisbasedondirectencoding,whoseenergy anddelaycanbeobtainedfromEq. 4.20and4.21respectively. Thenotationsused in Eq. 23-24, along with their values are also shown in Table 7.1. Notice that the spiking energy in Eq. 4.20 assume the use of zero-gating logic that activates the computeunitonlywhenaninputspikeisreceivedandthusisafunctionofspiking activityζ l . However, toextendthebenefitsofalow ζ l tolatency, werequireeither custom hardware or compiler support [128]. For this reason, unlike energy, this chapter assumes no delay benefit from ζ l as is evident in Eq. 4.23. To compute E MAC for full-precision weights (full-precision and 6-bits) and E AC (6-bits) at 65 nm technology, we use the data from [1] obtained by silicon measurements (see Table 7.1. For 6-bit inputs, we scale the energy according to E mac ∝ Q 1.25 as shown in [129], where Q is the bit-precision. On the other hand, E ac (6-bits) is computed by scaling the full-precision data from [1], accord- ing to [130], which shows E AC is directly proportional to the data bit-width. Our calculations imply that E AC is∼ 13× smaller than E MAC for 6-bit precision. Note that this number may vary for different technologies, but, in most technologies, an AC operation is significantly less expensive than a MAC operation. As required in the direct input encoding layer, we obtain E mac for 8-bit inputs and 6-bit weights from [118], applying voltage scaling for iso-V dd conditions with the other E mac and E ac estimations from [1]. We use T ac = 0.1T mac for 6-bit inputs from [131] and the fact that the latency of a MAC unit varies logarithmically with bit precision (assuming a carry-save adder) to calculate the delay, and the resulting EDP of the baseline SOTA ANN and our proposed SNN models. Note that the architec- tural modifications applied to the existing SOTA models to create our baseline ANNs [104,106] only enhance ANN-SNN conversion, and do not lead to signifi- cant changes in energy consumption. Since the total energy is compute bound, we 72 also calculate the total number of floating point operations (FLOPs), which is a standard metric to evaluate the energy cost of ML models. We observe that 6-bit ANN models are 12.5× energy efficient compared to 32- bitANNmodelsduetosignificantimprovementsinMACenergywithquantization, as shown in [132]. Note that we can achieve similar HSI test accuracies shown in Table 4.2 with quantized ANNs as well. We compare the layer-wise and total energy, delay, and EDP of our proposed SNNs with those of equivalent-precision ANNs in Fig. 4.7. The FLOPs for SNNs obtained by our proposed training framework is smaller than that for the baseline ANN due to low spiking activity. Moreover, because the ACs consume significantly less energy than MACs for all bit precisions, SNNs are significantlymorecomputeefficient. Inparticular,forCNN-3DonIP,ourproposed SNN consumes ∼ 199.3× and ∼ 33.8× less energy than an iso-architecture full- precision and 6-bit ANN with similar parameters respectively. The improvements become∼ 560.6× (∼ 9976× in EDP) and∼ 44.8× (∼ 412.2× in EDP) respectively averaging across the two network architectures and three datasets. PIM Hardware Though SNNs improve the total energy significantly as shown above, the first layer needs the expensive MACs due to direct encoding, and accounts for ∼ 27% and ∼ 22% of the total energy on average across the three datasets for CNN-3D and CNN-32H respectively. To address this issue, we propose to adopt an SRAM- based memory array to process the computations incurred in the first layer, in the memory array itself, as discussed in Section 4.3. We similarly extended the energy and delay models of [3,118] to the PIM implementationofthefirstlayerofourproposedSNNarchitectures. Theresulting energy and delay can be written as E SNN 1 =C i 1 C o 1 k 3 E BLP + E ADC R +P leak T SNN 1 (4.24) 73 T SNN 1 = C i 1 C o 1 k 3 N col B W N bank ! H o 1 W o 1 D o 1 T read + T adc R (4.25) where the new notations along with their values are in Table 7.1. Following 65 nm CMOS technology limitations, we keep the array parameters similar to [118], and T adc and E adc for our 6-bit SNN are obtained by extending the circuit simulation results of [3] with the ADC energy and delay models proposed in [133]. The improvements in the total energy, delay and EDP for CNN-3D on IP dataset, are observed to be 1.28× , 1.08× and 1.38× respectively over an iso- architecture-and-precision SNN implemented with digital hardware. The improve- ments become 1.30× , 1.07× and 1.38× respectively averaging across the three datasets. However, since CNN-32H is shallower than CNN-3D, and has relatively cheaper 2D CNNs following the input 3D CNN layer, the PIM implementation in thefirstlayercandecreasethe totalenergyconsumptionsignificantly. Theenergy, delay, and EDP improvements compared to the digital implementations are esti- mated to be 2.12× , 1.04× , and 2.20× for CNN-32H, and 1.71× , 1.06× , and 1.79× on average across the two architectures and three datasets. Hence, the total im- provements for our proposed hybrid hardware implementation (PIM in first layer and digital computing in others), coupled with our energy-aware quantization and training technique, become 953× , 17.76× , 16921× compared to iso-architecture full-precision ANNs and 76.16× , 9.2× , 700.7× compared to iso-architecture iso- precision ANNs. Note that analog-PIM based SNNs are more cheaper in terms of energy con- sumptionthantheirCNNcounterparts. Thisisbecauseofthereasonssummarized below. • Since CNN requires both multi-bit activations and multi-bit weights, the precisionofADCsandDACsrequiredinanalog-PIMbasedCNNaccelerator is higher than for analog-SNN based accelerators. As is well known, ADCs are the most energy-expensive components in analog PIM accelerators, thus, this higher precision requirement leads to higher energy consumption. For 74 example, an 8 bit ADC consumes 2× more energy compared to a 4 bit ADC [134]. • ThelimitedprecisionofADCsalsonecessitates‘bit-streaming’[135],wherein multi-bit activations of CNN are serially streamed to analog-PIM crossbars and accumulated over time. Such serial streaming increases both delay and power consumption for computing. • Finally, the higher algorithmic sparsity associated with SNN leads to reduc- tion in energy consumption while performing analog-PIM operations. Note that this sparsity can also be leveraged by custom digital hardware. However,theenergy-delaybenefitassociatedwithanalog-PIMbasedSNNswith respect to digital SNN implementation is lower as compared to analog-PIM based CNN in comparison digital CNN implementation. This is because CNNs require extensive energy-hungry multiplication operations, while SNNs rely on cheaper accumulate operations. Moreover, analog PIM implementation leads to increased non-idealities and can decrease the resulting test accuracy of our HSI models. As the number of weights increases after the first layer (4 .5× in the 2 nd layer to 352.8× inthe6thlayerforCNN-3D),asinglelayerhastobemappedovermultiple memory sub-arrays. This, in turn, requires partial sums generated from individ- ual sub-arrays to be transferred via Network-on-chip (NoC) for accumulation and generation of output activation. The NoC and associated data transfer incurs in- crease in energy-delay and design complexity. Hence, we choose to avoid PIM in the subsequent layers. 4.5.4 Training Time and Memory Requirements Wealsocomparedthesimulationtimeandmemoryrequirementsduringthetrain- ing of the baseline SOTA ANN and our proposed SNN models. Because SNNs require iterating over multiple time steps and storing the membrane potentials for 75 Figure4.7: Energy,delay,andEDPoflayersof(a)CNN-3Dand(b)CNN-32Harchitec- tures, comparing 6-bit ANNs and SNN (obtained via Q-STDB) models while classifying IP. Figure 4.8: (a)Testaccuraciesfordifferentquantizationtechniquesduringtheforward path of training and inference with a 6-bit CNN-3D model on the IP dataset with 5 timesteps, (b) Test accuracies with 6, 9, and 12-bit weight precisions for post-training quantization with a CNN-32H model on the IP dataset with 5 timesteps. each neuron, their simulation time and memory requirements can be substantially higher than their ANN counterparts. However, training with ultra low-latency, as done in this work, can bridge this gap significantly as shown in Fig. 4.10. We compare the simulation time and memory usage during training of the baseline 76 ANNs and our proposed SNN models in Fig. 4.10(a) and (b) respectively. As we can see, the training time per epoch is less than a minute for all the architectures and datasets. Moreover, the peak memory usage during training is also lower for our SNN models compared to their ANN counterparts. Hence, we conclude that our approach does not incur any significant training overhead. Note that both the training time and memory usage are higher for CNN-32H than for CNN-3D because the output feature map of its last convolutional layer is very large. 4.5.5 Ablation Studies Weconductedseveralablationstudiesoncombinationsofaffineandscalequantiza- tion during training and inference, quantized training approaches, and the efficacy of ANN-based pre-training. Affine vs Scale Quantization Fig. 4.8(a) compares inference accuracies for three different quantization tech- niques during the forward path of training and test on the CNN-3D architecture withtheIPdatasetusing6-bitquantization. Performingscalequantizationduring trainingsignificantlydegradesperformance,whichfurtherjustifiesouruseofaffine quantization during training. However, using scale quantization during inference results in similar accuracy as affine quantization. We further explored the gap in accuracy for 4-bit and 5-bit quantization, as summarized in Table 4.5. We ob- served that the accuracy gap associated with using scale quantization instead of affine quantization during inference modestly grows to 1 .42% for 4-bit weights. Thissmalldropinrelativeaccuracyforlowbit-precisionsmaybeattributed to thebenefitofthezerofactorinaffinequantizationonquantizationerror. Quantiza- tionerroristypicallymeasuredbyhalfofthewidthofthequantizationbins,where the number of bins N B used is independent of the type of quantization and, due to the 2’s complement representation, centered around zero. However, the range of 77 A. Affine (training) and B. Affine (training) and Bit-precision Affine (inference) Scale (inference), ∆ from Column A. OA (%) AA (%) Kappa (%) ∆ OA (%) ∆ AA (%) ∆ Kappa (%) 6 98.89 98.39 98.21 0.21 0.05 0.01 5 98.79 98.36 98.24 0.41 0.13 0.21 4 98.50 98.01 98.07 1.42 2.37 2.53 Table 4.5: Loss in accuracy associated with use of scale quantization during inference. Evaluated using the CNN-3D model on the IP dataset. Figure 4.9: Weight shift (∆) in each layer of CNN-3D for (a) 4, (b) 5, and (c) 6-bit quantization, while classifying the IP dataset. valuesthesebinsmustspanissmallerforaffinequantizationbecausethezerofactor ensures the distribution of values is also centered at zero. This difference in range can be calculated as ∆ = r scale − r affine = 2· max(w max ,|w min |)− (w max − w min ). Assuming w min =− x· w max , ∆= (1− x)w max , if w max >− w min (x− 1)w max , otherwise. (4.26) As empirically shown in Fig. 4.9, the average ∆ across all the layers increases modestly as we decrease the bit-precision from 6 to 4. In contrast, the increase in quantization error associated with scale quantization is equal to ∆ 2N B and thus grows exponentially as the number of bits decrease. 78 A. Q-STDB from B. Diff. between proposed hybrid training C. Diff. between ANN-SNN conversion alone Architecture Dataset scratch and Q-STDB from scratch and Q-STDB from scratch OA (%) AA (%) Kappa (%) ∆ OA (%) ∆ AA (%) ∆ Kappa (%) ∆ OA (%) ∆ AA (%) ∆ Kappa (%) IP 96.83 96.25 96.23 1.85 2.11 1.97 -39.15 -45.37 -43.35 CNN-3D PU 99.38 99.04 99.17 0.14 0.13 0.16 -8.22 -10.2 -10.14 SS 96.05 95.79 95.60 1.90 1.30 1.83 -14.61 -19.07 -15.53 IP 95.93 95.36 95.40 1.53 1.37 1.49 -25.05 -28.8 -27.51 CNN-32H PU 99.12 98.49 98.55 0.23 0.39 0.40 -4.16 -8.37 -4.73 SS 96.04 95.90 95.33 1.95 1.36 1.95 -7.88 -11.71 -10.05 Table 4.6: Comparison between model performances for Q-STDB from scratch, pro- posed hybrid training, and ANN-SNN conversion alone. All cases are for 5 time steps and 6-bits. Q-STDB vs Post-Training Quantization (PTQ) PTQ with scale representation cannot always yield ultra low-precision SNNs with SOTA test accuracy. For example, as illustrated in Fig. 4.8(b), for the IP dataset andCNN-32Harchitecturewith5timesteps,thelowestbitprecisionoftheweights that the SNNs can be trained with PTQ for no more than 1% reduction in SOTA test accuracy is 12, two times larger bit-width than required by Q-STDB. Inter- estingly, the weights can be further quantized to 8-bits with less than 1% accuracy reduction if we increase the time steps to 10, but this costs latency. ComparisonbetweenQ-STDB with and without ANN-SNNConversion To quantify the extent that the ANN-based pre-training helps, we performed Q- STDB from scratch (using 5 time steps), where the weights are initialized from the standard Kaiming normal distribution. The results are reported in Table 4.6, wheretheresultsinthecolumnslabelledBandCareobtainedbycomparingthose fromthecolumnslabelledFandBrespectivelyinTable4.2withQ-STDBwithout ANN-SNN conversion. The results show that while Q-STDB from scratch beats conversion-only approaches, the inference accuracy can often be further improved usingourproposedhybridtrainingcombiningQ-STDBandANN-SNNconversion. 79 Figure 4.10: Comparison between our baseline SOTA ANNs and proposed SNNs with 5timestepsbasedon(a)trainingtimeperepoch,and(b)memoryusageduringtraining. Variation of (a) and (b) with the number of time steps for the IP dataset and CNN-32H architecture are shown in (c). 4.6 Conclusions and Broader Impact In this chapter, we extensively analyse the arithmetic intensities of 3D and 2D CNNs, and motivate the use of energy-efficient, low-latency, LIF-based SNNs for applications involving 3D image recognition, that requires 3D CNNs for accurate processing. We then present a quantization-aware training technique, that yields highly accurate low-precision SNNs. We propose to represent weights during the forwardpathoftrainingusingaffinequantizationandduringtheinferenceforward path using scale quantization. This provides a good trade-off between the SNN accuracy and inference complexity. We propose a 3D and hybrid combination of 3D and 2D convolutional architectures that are compatible with ANN-SNN con- version for HSI classification; the hybrid architecture incurs a small accuracy drop compared to the 3D counterpart, which shows the efficacy of 3D CNNs for HSI. Our quantized SNN models offer significant improvements in energy consumption compared to both full and low-precision ANNs for HSI classification. We also pro- pose a PIM architecture to process the energy-expensive first layer of our direct encoded SNN to further reduce the energy, delay and EDP of the SNN models. Our proposal results in energy-efficient SNN models that can be more easily deployedinHSIor3Dimagesensorsandtherebymitigatesthebandwidthandpri- vacyconcernsassociatedwithoff-loadinginferencetothecloud. Thisimprovement 80 in energy-efficiency is particularly important as the applications of HSI analysis expand and the depth of the SOTA models increases [136]. To the best of our knowledge, this work is the first to address energy efficiency of HSI models, and can hopefully inspire more research in algorithm-hardware co- design of neural networks for size, weight, and power (SWAP) constrained HSI applications. 81 Chapter 5 Hoyer Regularized Training for One-Time-Step SNNs This chapter first provides the introduction and motivation behind the develop- ment of Hoyer-regularized one-time-step SNNs 5.1. Preliminaries on Hoyer reg- ularizers are provided in Section 5.2. Section 5.3 presents our proposed training framework,involvinganovelHoyerspikelayerthatsetsthethresholdbasedupona novel Hoyer regularized training process, our network architectural modifications, and other training strategies that can be adopted to train one-time-step SNNs. Section 5.4 presents our experimental results of the accuracy, energy, and latency benefits of one-time-step SNNs compared to existing efficient networks. Finally, some discussions and conclusions are provided in Section 5.5. 5.1 Introduction & Related Work Most of the SNN training algorithms, including ours, require multiple time steps which increases training and inference costs compared to non-spiking counterparts for static vision tasks. The training effort is high because backpropagation must integratethegradientsoveranSNNthatisunrolledonceforeachtimestep[41,46]. Moreover, the multiple forward passes result in an increased number of spikes, whichdegradestheSNN’senergyefficiency,bothduringtrainingandinference,and offsets the compute advantage of the ACs. The multiple time steps also increase the inference complexity because of the need for input encoding logic and one forward pass per time step. 82 Tomitigatetheseconcerns,weproposeone-time-stepSNNsthatdonotrequire non-spiking DNN pre-training and are more compute-efficient than existing multi- time-stepSNNs. Withoutanytemporaloverhead,theseSNNsaresimilartovanilla feed-forward DNNs, with Heaviside activation functions [18]. These SNNs are also similar to sparsity-induced or uni-polar binary neural networks (BNNs) [137] that have0and1astwostates. However, theseBNNsdonotyieldSOTAaccuracy like thebi-polarBNNs[138]thathave1and-1astwostates. ArecentSNNwork[139] also proposed the use of one time-step, however, it required CNN pre-training, followed by iterative SNN training from 5 to 1 steps, significantly increasing the training complexity, particularly for ImageNet-level tasks. There have been sig- nificant efforts in the SNN community to reduce the number of time steps via optimal DNN-to-SNN conversion [4,140], lottery ticket hypothesis [141], and neu- ral architecture search [142]. However, none of these works have been shown to train one-time-step SNNs without significant accuracy loss. Our Contributions. Our training framework is based on a novel application of the Hoyer regularizer and a novel Hoyer spike layer. More specifically, our spike layer threshold is training-input-dependent and is set to be a novel function of the Hoyer extremum of a clipped version of the membrane potential tensor, where the clipping threshold (existing SNNs use this as the threshold) is trained using gradient descent with our Hoyer regularizer. In this way, compared to SOTA one- time-step non-iteratively trained SNNs, our threshold increases the rate of weight updatesandourHoyerregularizershiftsthemembranepotentialdistributionaway from this threshold, improving convergence. We consistently surpass the accuracies obtained by SOTA one-time-step SNNs [139] on diverse image recognition datasets with different convolutional architec- tures, while reducing the average training time by ∼ 19× . Compared to binary neural networks (BNN) and adder neural network (AddNN) models, our SNN models yield similar test accuracy with a∼ 5.5× reduction in floating point opera- tions(FLOPs), thanks totheextreme sparsityenabledby ourtrainingframework. 83 Downstream tasks on object detection also demonstrate that our approach sur- passes the test mAP of existing BNNs and SNNs. We extend our approach to multiple time steps, thereby leading to small but significant accuracy increase at the cost of significant increase in memory and compute cost. Our experiments on dynamic vision sensing (DVS) tasks demonstrate a 1.30% increase in accuracy on average compared to SOTA works at iso-time-step and architecture. 5.2 Preliminaries on Hoyer Regularizers Based on the interplay between L1 and L2 norms, a new measure of sparsity was firstintroducedin[143],basedonwhichreference[144]proposedanewregularizer, termed the Hoyer regularizer for the trainable weights that was incorporated into the loss term to train DNNs. We adopt the same form of Hoyer regularizer for the membrane potential to train our SNN models as H(u l ) = ∥u l ∥ 1 ∥u l ∥ 2 2 [145]. Here,∥u l ∥ i represents the Li norm of the tensor u l , and the superscript t for the time step is omitted for simplicity. Compared to the L1 and L2 regularizers, the Hoyer regularizer has scale-invariance (similar to the L0 regularizer). It is also differentiable almost everywhere (see (5.1)) where |u l | represents the element-wise absolute of the tensoru l . ∂H(u l ) ∂u l =2 ∥u l ∥ 1 ∥u l ∥ 2 2 sign(u l )·∥u l ∥ 2 − ∥u l ∥ 1 ∥u l ∥ 2 u l (5.1) Letting the gradient ∂H(u l ) ∂u l =0 and making all the u l positive, the value of the Hoyer extremum becomes E(u l )= ∥u l ∥ 2 2 ∥u l ∥ 1 . This extremum is the minimum, because the second derivative is greater than zero for any value of the output element. Training with the Hoyer regularizer can effectively help push the activation values that are larger than the extremum (u l >E(u l )) even larger and those that are smaller than the extremum (u l <E(u l )) even smaller. 84 5.3 Proposed Training Framework Our approach is inspired by the fact that Hoyer regularizers can shift the pre- activation distributions away from the Hoyer extremum in a non-spiking DNN [144]. Our principal insight is that setting the SNN threshold to this extremum shifts the distribution of the membrane potentials away from the threshold value, reducing noise and improving convergence. To yield for one-time-step SNNs we present a novel Hoyer spike layer that sets the threshold based upon a Hoyer regularized training process, as described below. 5.3.1 Hoyer spike layer In this work, we adopt a time-independent variant of the popular Leaky Integrate and Fire (LIF) representation, as illustrated in Eq. 5.2, to model the spiking neuron with one time-step. u l =w l o l− 1 z l = u l v th l o l = 1, ifz l ≥ 1; 0, otherwise (5.2) wherez l denotesthenormalizedmembranepotential. Suchaneuronmodelwitha unitstepactivationfunctionisdifficulttooptimizeevenwiththerecentlyproposed surrogate gradient descent techniques for multi-time-step SNNs [41,73], which ei- therapproximatesthespikingneuronfunctionalitywithacontinuousdifferentiable model or uses surrogate gradients to approximate the real gradients. This is be- cause the average number of spikes with only one time step is too low to adjust theweightssufficientlyusinggradientdescentwithonlyoneiterationavailableper input. If a pre-synaptic neuron does not emit a spike, the synaptic weight con- nectedtoitcannotbeupdatedbecauseitsgradientfromneuronitoj iscalculated as g u j × o i , where g u j is the gradient of the membrane potential u j and o i is the output of the neuron i. Therefore, it is crucial to reduce the value of the threshold to generate enough spikes for better network convergence. Note that a sufficiently 85 Figure 5.1: (a)ComparisonofourHoyerspikeactivationfunctionwithexistingactiva- tionfunctionswherethebluedistributiondenotestheshiftingofthemembranepotential awayfromthethresholdusingHoyerregularizedtraining, (b)Proposedderivativeofour Hoyer activation function. low value of threshold can generate a spike for every neuron, but that would yield random outputs in the final classifier layer. Previous works [5,44] show that the number of SNN time steps can be reduced by training the threshold term v th l using gradient descent. However, our experi- ments indicate that, for one-time-step SNNs, this approach still yields thresholds that produce significant drops in accuracy. In contrast, we propose to dynami- cally down-scale the threshold (see Fig. 5.1(a)) based on the membrane potential tensor using our proposed form of the Hoyer regularizer. In particular, we clip the membrane potential tensor corresponding to each convolutional layer to the trainable threshold v th l obtained from the gradient descent with our Hoyer loss, as detailed later in Eq. 5.11. Unlike existing approaches [5,42] that require v th l to be initialized from a pre-trained non-spiking model, our approach can be used to train SNNs from scratch with a Kaiming uniform initialization [146] for both the weights and thresholds. In particular, the normalized down-scaled threshold value, with which we compare the normalized membrane potential z l (which is the constant 1 in existing LIF models as shown in Eq. 5.2), for each layer is computed 86 as the Hoyer extremum of the clipped membrane potential tensor as shown in Fig. 5.1(a) and below. z clip l = 1,ifz l >1 z l ,if 0≤ z l ≤ 1 0,ifz l <0 o l =h s (z l )= 1,ifz l ≥ E(z clip l ) 0,otherwise (5.3) Note that our normalized threshold E(z clip l ) is less than the normalized threshold whose value is 1 for any output (proof in supplementary materials). Hence, our actual threshold value E(z clip l )× v th l is indeed less than the trainable threshold v th l usedinearlierworks[5,42]. WealsoobservethattheHoyerextremumineachlayer changes only slightly during the later stages of training, which indicates that it is most likely an inherent attribute of the dataset and model architecture. Hence, to estimate the threshold during inference, we calculate the exponential average of the Hoyer extremums during training (similar to BN), and use the same during inference. 5.3.2 Hoyer Regularized Training The loss function (L total ) of our proposed approach is shown below in Eq. 5.4. L total =L CE +λ H L H =L CE +λ H L− 1 X l=1 H(z clip l ) (5.4) where L CE denotes the cross-entropy loss calculated on the softmax output of the lastlayerL,andL H representstheHoyerregularizercalculatedontheinputofour 87 Hoyer spike layer after dividing the threshold term v th l and clipping. The weight update for the penultimate layer is computed as ∆ W L− 1 = ∂L CE ∂w L− 1 +λ H ∂L H ∂w L− 1 = ∂L CE ∂o L− 1 ∂o L− 1 ∂u L− 1 ∂u L− 1 ∂w L− 1 +λ H ∂L H ∂u L− 1 ∂u L− 1 ∂w L− 1 = ∂L CE ∂o L− 1 ∂o L− 1 ∂u L− 1 +λ H ∂H(z clip L− 1 ) ∂u L− 1 ! o L− 2 (5.5) ∂L CE ∂o L− 1 = ∂L CE ∂u L ∂u L ∂o L− 1 =(s− y)w L (5.6) where s denotes the output softmax tensor, i.e., s i = e u i L P N k=1 u k L where u i L and u k L denote the i th and k th elements of the membrane potential of the last layer L, and N denotes the number of classes. Note thaty denotes the one-hot encoded tensor of the true label, and ∂H(u L ) ∂u L is computed using Eq. 5.1. The last layer does not have any threshold and hence does not emit any spike. For a hidden layer l, the weight update is computed as ∆ W l = ∂L CE ∂w l +λ H ∂L H ∂w l = ∂L CE ∂o l ∂o l ∂z l ∂z l ∂u l ∂u l ∂w l +λ H ∂L H ∂u l ∂u l ∂w l = ∂L CE ∂o l ∂o l ∂z l o l− 1 v th l +λ H ∂L H ∂u l o l− 1 (5.7) where ∂L H ∂u l can be computed as ∂L H ∂u l = ∂L H ∂u l+1 ∂u l+1 ∂o l ∂o l ∂z l ∂z l ∂u l + ∂H(z clip l ) ∂u l = ∂L H ∂u l+1 w l+1 ∂o l ∂z l 1 v th l + ∂H(z clip l ) ∂z clip l ∂z clip l ∂z l 1 v th l (5.8) where ∂L H ∂u l+1 is the gradient backpropagated from the (l+1) th layer, that is itera- tively computed from the last layer L (see Eqs. 5.6 and 5.9). Note that for any 88 hidden layer l, there are two gradients that contribute to the Hoyer loss with re- spect to the potential u l ; one is from the subsequent layer (l+1) and the other is directlyfromitsHoyerregularizer. Similarly, ∂L CE ∂o l iscomputediteratively,starting from the penultimate layer (L− 1) defined in Eq. 5.6, as follows. ∂L CE ∂o l = ∂L CE ∂o l+1 ∂o l+1 ∂z l+1 ∂z l+1 ∂u l+1 ∂u l+1 ∂o l = ∂L CE ∂o l+1 ∂o l+1 ∂z l+1 w l+1 v th l (5.9) All the derivatives in Eq. 5-9 can be computed by Pytorch autograd, except the spike derivative ∂o l ∂z l , whose gradient is zero almost everywhere and undefined at z l =0. We extend the existing idea of surrogate gradient [75] to compute this derivative for one-time-step SNNs with Hoyer spike layers, as illustrated in Fig. 5.1(b) and mathematically defined as follows. ∂o l ∂z l = scale× 1 if 0<z l <2 0 otherwise (5.10) wherescale denotesahyperparameterthatcontrolsthedampeningofthegradient. Finally, the threshold update for the hidden layer l is computed as ∆ v th l = ∂L CE ∂v th l +λ H ∂L H ∂v th l = ∂L CE ∂o l ∂o l ∂z l ∂z l ∂v th l +λ H ∂L H ∂v th l = ∂L CE ∂o l ∂o l ∂z l − u l (v th l ) 2 +λ H ∂L H ∂u l+1 ∂u l+1 ∂v th l (5.11) ∂u l+1 ∂v th l = ∂u l+1 ∂o l · ∂o l ∂v th l =w l+1 · ∂o l ∂z l · − u l (v th l ) 2 (5.12) Note that we use this v th l , which is updated in each iteration, to estimate the threshold in our spiking model using Eq. 5.3. 5.3.3 Network Structure We propose a series of network architectural modifications of existing SNNs [5,42, 139] for our one-time-step models. As shown in Fig. 5.2(a), for the VGG variant, 89 Figure 5.2: Spiking network architectures corresponding to (a) VGG and (b) ResNet based models. we use the max pooling layer immediately after the convolutional layer that is common in many BNN architectures [147], and introduce the BN layer after max pooling. Similar to recently developed multi-time-step SNN models [148–151], we observe that BN helps increase the test accuracy with one time step. In contrast, for the ResNet variants, inspired by [152], we observe models with shortcuts that bypass every block can also further improve the performance of the SNN. We also observe that the sequence of BN layer, Hoyer spike layer, and convolution layer outperforms the original bottleneck in ResNet. More details are shown in Fig. 5.2(b). 5.3.4 Possible Training strategies Based on existing SNN literature, we hypothesize that two training strategies that can be effectively used to train one-time-step SNNs, other than our proposed ap- proach. Pre-trained DNN, followed by SNN fine-tuning. Similar to the hybrid training proposed in [30], we pre-train a non-spiking DNN model, and copy its weightstotheSNNmodel. Initializedwiththeseweights, wetrainaone-time-step SNN with normal cross-entropy loss. Iteratively convert ReLU neurons to spiking neurons. First, we train a DNN model which uses the ReLU activation with threshold, then we iteratively 90 Table 5.1: Accuracies from different strategies to train one-step SNNs on CIFAR10 Training Strategies Pretrained DNN(%) SNN (%) Spiking activity (%) Pre-trained+fine-tuning 93.15 91.39 23.56 Iterative training (N=10) 93.25 92.68 10.22 Iterative Training (N=20) 92.68 92.24 9.54 Proposed Training - 93.13 22.57 reduce the number of the ReLU neurons whose output activation values are multi- bit. Specifically, we first force the neurons with values in the top N percentile to spike (set the output be 1), and those with bottom N percentile percent to die (set the output be 0), and gradually increase N until there is a significant drop of accuracy or all neuron outputs are either 1 or 0. Proposed training from scratch. With our proposed Hoyer spike layer and Hoyer regularized training, we train a SNN model from scratch. Our results with these training strategies are shown in Table 5.1, which indicates that it is difficult for training strategies that involve pre-training and fine-tuning to approach the accuracy of non-spiking models with one time step. One possible reason for this might be the difference in the distribution of the pre-activation values between the DNNandSNNmodels[42]. Itisalsointuitivetoobtainaone-time-stepSNNmodel by iteratively reducing the proportion of the ReLU neurons from a pretrained full-precision DNN model. However, our results indicate that this method also fails to generate enough spikes at one time step required to yield SOTA accuracy. Finally, with our network modifications to existing SNN works, our Hoyer spike layer and our Hoyer regularizer, we can train a one-time-step SNN model with SOTA accuracy from scratch. 5.4 Experimental Results Datasets & Models: Similar to existing SNN works [5,30], we perform ob- ject recognition experiments on CIFAR10 [153] and ImageNet [154] dataset using 91 Table 5.2: Comparison of the test accuracy of our one-time-step SNN models with the non-spiking DNN models for object recognition. Model ∗ indicates that we remove the first max pooling layer, and SA denotes spiking activity. Network dataset DNN (%) SNN (%) SA (%) VGG16 CIFAR10 94.10 93.44 21.87 ResNet18 CIFAR10 93.34 91.48 25.83 ResNet18 ∗ CIFAR10 94.28 93.67 16.12 ResNet20 CIFAR10 93.18 92.38 23.69 ResNet34 ∗ CIFAR10 94.68 93.47 16.04 ResNet50 ∗ CIFAR10 94.90 93.00 17.79 VGG16 ImageNet 70.08 68.00 24.48 ResNet50 ImageNet 73.12 66.32 23.89 VGG16 [155] and several variants of ResNet [156] architectures. For object de- tection, we use the MMDetection framework [157] with PASCAL VOC2007 and VOC2014[158]astrainingdataset, andbenchmarkourSNNmodelsandthebase- lines on the VOC2007 test dataset. We use the Faster R-CNN [159] and Reti- naNet[160]framework,andsubstitutetheoriginalbackbonewithourSNNmodels pretrained on ImageNet1K. Object Recognition Results: For training the recognition models, we use the Adam [161] optimizer for VGG16, and use SGD optimizer for ResNet models. As shown in Table 5.2, we obtain the SOTA accuracy of 93.44% on CIFAR10 with VGG16 with only one time step; the accuracy of our ResNet-based SNN models on ImageNet also surpasses the existing works. On ImageNet, we obtain a 68.00% top-1 accuracy with VGG16 which is only ∼ 2% lower compared to the non-spiking counterpart. All our SNN models yield a spiking activity of∼ 25% or lower on both CIFAR10 and ImageNet, which is significantly lower compared to the existing multi-time-step SNN models as shown in Fig. 5.3. Object Detection Results: For object detection on VOC2007, we compare theperformanceobtainedbyourspikingmodelswithnon-spikingDNNsandBNNs in Table 5.3. For two-stage architectures, such as Faster R-CNN, the mAP of our 92 Table 5.3: Comparisonofourone-time-stepSNNmodelswithnon-spikingDNN,BNN, and multi-step SNN counterparts on VOC2007 test dataset. Framework Backbone mAP(%) Faster R-CNN Original ResNet50 79.5 Faster R-CNN Bi-Real [152] 65.7 Faster R-CNN ReActNet [162] 73.1 Faster R-CNN Our spiking ResNet50 73.7 Retinanet Original ResNet50 77.3 Retinanet SNN ResNet50 (ours) 70.5 YOLO SNN DarkNet [163] 53.01 SSD BNN VGG16 [164] 66.0 one-time-stepSNNmodelssurpasstheexistingBNNsby>0.6% 1 . Forone-stagear- chitectures,suchasRetinaNet(chosenbecauseofitsSOTAperformance),ourone- time-step SNN models with a ResNet50 backbone yields a mAP of 70.5% (highest among existing BNN, SNN, AddNNs). Note that our spiking VGG and ResNet- based backbones lead to a significant drop in mAP with the YOLO framework thatismorecompatiblewiththeDarkNetbackbone(evenexistingDarkNet-based SNNs lead to very low mAP with YOLO as shown in Table 5.3). However, our models suffer 5 .8− 6.8% drop in mAP compared to the non-spiking DNNs which may be due to the significant sparsity and loss in precision. Accuracy Comparison: We compare our results with various SOTA ultra low-latency SNNs for image recognition tasks in Table 6.3. Our one-time-step SNNs yield comparable or better test accuracy compared to all the existing works for both VGG and ResNet architectures, with significantly lower inference latency. The only exception for the latency reduction is the one-time-step SNN proposed in [139], however, it increases the training time significantly as illustrated later in Fig. 3. Other works that have training complexity similar or worse than ours, such as [42] yields 1.78% lower accuracy with a 2× more number of time steps. 1 We were unable to find existing SNN works for two-stage object detection architectures. 93 Table 5.4: Comparison of our one-time-step SNN models to existing low-latency coun- terparts. SGD and hybrid denote surrogate gradient descent and pre-trained DNN fol- lowed by SNN fine-tuning respectively. (qC, dL) denotes an architecture with q convo- lutional and d linear layers. Ref. Training Architecture Acc. (%) Timesteps Dataset : CIFAR10 [4] DNN-SNN conversion VGG16 92.29 16 [37] SGD 5C, 2L 90.53 12 [90] Hybrid VGG16 92.74 10 [165] Tandem Learning 5C, 2L 90.98 8 [166] DNN-SNN coonversion VGG16 90.96 8 [167] SGD 5C, 2L 91.41 5 [5] Hybrid VGG16 92.70 5 [148] STBP-tdBN ResNet19 93.16 6 [42] Hybrid VGG16 91.79 2 [140] DNN-SNN conversion VGG16 91.18 2 [89] SGD 5C, 2L 93.50 8 [139] Hybrid VGG16 93.05 1 [139] Hybrid ResNet20 91.10 1 Ours Adam+Hoyer Reg. VGG16 93.44 1 Dataset : ImageNet [85] DNN-SNN conversion VGG16 63.64 32 [140] DNN-SNN conversion ResNet34 59.35 16 [165] Tandem Learning AlexNet 50.22 12 [5] Hybrid VGG16 69.00 5 [168] SGD ResNet34 67.04 4 [168] SGD ResNet152 69.26 4 [5] Hybrid VGG16 69.00 5 [148] STBP-tdBN ResNet34 67.05 6 [139] Hybrid VGG16 67.71 1 Ours Adam+Hoyer Reg. VGG16 68.00 1 Across both CIFAR10 and ImageNet, our proposed training framework demon- strates 2-32× improvement in inference latency with similar or worse training complexity compared to other works while yielding better test accuracy. Table 6.3 also demonstrates that the DNN-SNN conversion approaches require more time steps compared to our approach at worse test accuracies. Inference Efficiency : We compare the energy-efficiency of our one-time-step SNNs with non-spiking DNNs and existing multi-time-step SNNs in Fig. 5.3. The compute-efficiency of SNNs stems from two factors: 1) sparsity, that reduces the number of floating point operations in convolutional and linear layers compared to non-spikingDNNsaccordingtoSNN flops l =S l × DNN flops l [139],whereS l denotes the average number of spikes per neuron per inference over all timesteps in layer 94 Figure 5.3: Layerwise spiking activities for a VGG16 across time steps ranging from 5 to 1 (average spiking activity denoted as S in parenthesis) representing existing low- latency SNNs including our work on (a) CIFAR10, (b) ImageNet, (c) Comparison of the total energy consumption between SNNs with different time steps and non-spiking DNNs. l. 2) Use of only AC (0.9pJ) operations that consume 5.1× lower compared to eachMAC(4.6pJ)operationin45nmCMOStechnology[1]forfloating-point(FP) representation. NotethatthebinaryactivationscanreplacetheFPmultiplications with logical operations, i.e., conditional assignment to 0 with a bank of AND gates. These replacements can be realized using existing hardware (eg. standard GPUs) depending on the compiler and the details of their data paths. Building a custom accelerator that can efficiently implement these reduced operations is also possible [169–171]. In fact, in neuromorphic accelerators such as Loihi [172], FPmultiplicationsaretypicallyavoidedusingmessagepassingbetweenprocessors that model multiple neurons. The total compute energy (CE) of a multi-time-step SNN (SNN CE ) can be estimated as SNN CE =DNN flops 1 ∗ 4.6+DNN com 1 ∗ 0.4 + L X l=2 S l ∗ DNN flops l ∗ 0.9+DNN com l ∗ 0.7 (5.13) because the direct encoded SNN receives analog input in the first layer ( l=1) without any sparsity [5,42,139]. Note that DNN com l denotes the total number of comparison operations in the layer l with each operation consuming 0.4pJ en- ergy. The CE of the non-spiking DNN (DNN CE ) is estimated as DNN CE = P L l=1 DNN flops l ∗ 4.6, where we ignore the energy consumed by the ReLU opera- tion since that includes only checking the sign bit of the input. 95 Figure 5.4: Normalized training and inference time per epoch with iso-batch (256) and hardware (RTX 3090 with 24 GB memory) conditions for (a) CIFAR10 and (b) ImageNet with VGG16. We compare the layer-wise spiking activities S l for time steps ranging from 5 to 1 in Fig. 5.3(a-b) that represent existing low-latency SNN works, including our work. Note, the spike rates decrease significantly with time step reduction from 5 to1,leadingtoconsiderablylowerFLOPsinourone-time-stepSNNs. Theselower FLOPs, coupled with the 5.1× reduction for AC operations leads to a 22.9× and 32.1× reduction in energy on CIFAR10 and ImageNet respectively with VGG16. Though we focus on compute energies for our comparison, multi-time-step SNNs also incur a large number of memory accesses as the membrane potentials and weightsneedtobefetchedfromandreadtotheon-/off-chipmemoryforeachtime step. Our one-time-step models can avoid these repetitive read/write operations asitdoesinvolveany state andleadtoa∼ T× reductioninthenumberofmemory accessescomparedtoaT-time-stepSNNmodel. Consideringthismemorycostand the overhead of sparsity [173], as shown in Fig. 5.3(c), our one-time-step SNNs lead to a 2.08− 14.74× and 22.5− 31.4× reduction of the total energy compared to multi-time-step SNNs and non-spiking DNNs respectively on a systolic array accelerator. Training & Inference Time Requirements: Because SOTA SNNs require iteration over multiple time steps and storage of the membrane potentials for each neuron, their training and inference time can be substantially higher than their DNN counterparts. However, reducing their latency to 1 time step can bridge this 96 Table 5.5: Ablation study of the different methods in our proposed training framework on CIFAR10. Arch. Network Structure Hoyer Reg. Hoyer Spike Acc. (%) Spiking Activity (%) VGG16 × × × 88.42 15.62 VGG16 ✓ × × 90.33 20.43 VGG16 ✓ ✓ × 90.45 20.48 VGG16 ✓ × ✓ 92.90 21.70 VGG16 ✓ ✓ ✓ 93.13 22.57 ResNet18 × × × 87.41 22.78 ResNet18 ✓ × × 91.08 27.62 ResNet18 ✓ ✓ × 90.95 20.50 ResNet18 ✓ × ✓ 91.17 25.87 ResNet18 ✓ ✓ ✓ 91.48 25.83 gap significantly, as shown in Figure 5.4. On average, our low-latency, one-time- step SNNs represent a 2.38× and 2.33× reduction in training and inference time perepochrespectively, comparedtothemulti-time-steptrainingapproaches[5,42] with iso-batch and hardware conditions. Compared to the existing one-time-step SNNs [139], we yield a 19× and 1.25× reduction in training and inference time. Such significant savings in training time, which translates to power savings in big data centers, can potentially reduce AI’s environmental impact. Ablation Studies: We conduct ablation studies to analyze the contribution of each technique in our proposed approach. For fairness, we train all the ablated models on CIFAR10 dataset for 400 epochs, and use Adam as the optimizer, with 0.0001 as the initial learning rate. Our results are shown in Table 5.5, where the modelwithoutHoyerspikelayerindicatesthatwesetthethresholdasv th l similarto existingworks[5,42]ratherthanourproposedHoyerextremum. WithVGG16,our optimal network modifications lead to a 1 .9% increase in accuracy. Furthermore, adding only the Hoyer regularizer leads to negligible accuracy and spiking activity improvements. This might be because the regularizer alone may not be able to sufficiently down-scale the threshold for optimal convergence with one time step. However, with our Hoyer spike layer, the accuracy improves by 2.68% to 93.13% 97 Table 5.6: Accuracies of weight quantized one-time-step SNN models based on VGG16 on CIFAR10 where FP is 32-bit floating point. CE denotes compute energy. Bits Acc. (%) Spiking Activity (%) CE (mJ) FP 93.13 22.57 297.42 6 93.11 22.46 61.9 4 92.84 21.39 39.4 2 92.34 22.68 21.6 while also yielding a 2.09% increase in spiking activity. We observe a similar trend for our network modifications and Hoyer spike layer with ResNet18. However, Hoyerregularizersubstantiallyreducesthespikingactivityfrom27.62%to20.50%, while also negligibly reducing the accuracy. Note that the Hoyer regularizer alone contributes to 0.20% increase in test accuracy on average. In summary, while our network modifications significantly increase the test accuracy compared to the SOTA SNN training with one time step, the combination of our Hoyer regularizer and Hoyer spike layer yield the SOTA SNN performance. Effect on Quantization : In order to further improve the compute-efficiency of our one-time-step SNNs, we perform quantization-aware training of the weights in our models to 2− 6 bits. This transforms the full-precision ACs to 2− 6 bit ACs, therebyleadingtoa4.8− 13.8reductionincomputeenergyasobtainedfromFPGA simulations on the Kintex7 platform using custom RTL specifications. Note that we only quantize the convolutional layers, as quantizing the linear layers lead to a noticeable drop in accuracy. From the results shown in Table 5.6, when quantized to 6 bits, our one-time-step VGG-based SNN incur a negligible accuracy drop of only 0.02%. Even with 2-bit quantization, our model can yield an accuracy of 92.34% with any special modification, while still yielding a spiking activity of ∼ 22%. 98 Table 5.7: Comparison of our one-time-step SNN models to AddNNs and BNNs that also incur AC-only operations for improved energy-efficiency. CE denotes compute en- ergy. Reference Dataset Acc.(%)CE (J) BNNs [174] CIFAR10 89.6 0.022 [137] CIFAR10 90.2 0.019 [137] ImageNet 59.7 3.6 [138] CIFAR10 91.9 0.073 AddNNs [175] (FP weights) CIFAR10 93.72 1.62 [175] (2-bit weights) CIFAR10 92.08 0.12 [175] (FP weights) ImageNet 67.0 77.8 [176] (FP weights) CIFAR10 91.56 1.62 Our SNNs This work (FP weights) CIFAR10 93.44 0.297 This work (2-bit weights) CIFAR10 92.34 0.021 This work (FP weights) ImageNet 68.00 14.28 ComparisonwithAddNNs&BNNs: WecomparetheaccuracyandCEof our one-time-step SNN models with recently proposed AddNN models [175] that also removes multiplications for increased energy-efficiency in Table 5.7. With the VGG16 architecture, on CIFAR10, we obtain 0.6% lower accuracy, while on ImageNet, we obtain 1.0% higher accuracy. Moreover, unlike SNNs, AddNNs do not involve any sparsity, and hence, consume ∼ 5.5× more energy compared to our SNN models on average across both CIFAR10 and ImageNet (see Table 5.7). We also compare our SNN models with SOTA BNNs in Table 7 that replaces the costly MAC operations with cheaper pop-count counterparts, thanks to the binary weights and activations. Both our full-precision and 2-bit quantized one- time-step SNN models yield accuracies higher than BNNs at iso-architectures on both CIFAR10 and ImageNet. Additionally, our 2-bit quantized SNN models also consume 3.4× lower energy compared to the bi-polar networks (see [138] in Table 5.7) due to the improved trade-off between the low spiking activity ( ∼ 22% as shown in Table 7) provided by our one-time-step SNN models, and less energy due 99 Table 5.8: Test accuracy obtained by our approach with multiple time steps on CI- FAR10. Architecture Time steps Acc. (%) Spiking activity (%) VGG16 1 93.44 21.87 VGG16 2 93.71 44.06 VGG16 4 94.14 74.88 VGG16 6 94.04 101.22 ResNet18 1 91.48 25.83 RseNet18 2 91.93 33.24 ResNet18 4 91.93 59.20 ResNet18 6 92.01 83.82 toXORoperationscomparedtoquantizedACs. Ontheotherhand, ourone-time- step SNNs consume similar energy compared to unipolar BNNs (see [137,174] in Table 5.7) while yielding 3.2% higher accuracy on CIFAR10 at iso-architecture. The energy consumption is similar because the∼ 20% advantage of the pop-count operations is mitigated by the∼ 22% higher spiking activity of the unipolar BNNs compared to our one-time-step SNNs. Extension to multiple time-steps: We extend our proposed approach to multi-time-step SNN models by adopting the standard LIF model [40] for the neu- rons but computing the threshold based on the Hoyer extremum defined in Eq. 3 and unrolling the gradients derived in Section 3.2 using traditional backpropaga- tionthroughtime[5]. AsshowninTable5.8,astimestepincreasesfrom1to4,the accuracy of the model also increases from 93.44% to 94.14%, which validates the effectiveness of our method. However, this accuracy increase comes at the cost of asignificantincreaseinspikingactivity(seeTable5.8wherethespikingactivityis computedacrossthetotalnumberoftimestepssimilarto[40]), therebyincreasing the compute energy. The memory cost also increases due to the repetitive access ofthemembranepotentialandweightsacrossthedifferenttimesteps. Inthisway, our single and multiple time steps form a bridge between sparsity-induced BNNs and low time-step SNNs, connecting the BNN and SNN communities. 100 Table 5.9: Comparison of our one- and multi-time-step SNN models to existing SNN models on DVS-CIFAR10 dataset. Reference Training Architecture Acc. (%) T.S [150] TET VGGSNN 83.17 10 [150] TET VGGSNN 75.20 4 [150] TET VGGSNN 68.12 1 [177] tdBN+NDA VGG11 81.7 10 [178] SALT+BN VGG16 67.1 20 This work Hoyer reg. VGGSNN 83.68 10 This work Hoyer reg. VGGSNN 76.17 4 This work Hoyer reg. VGGSNN 69.80 1 Extension toDVSdatasets: TheinherenttemporaldynamicsinSNNsmay be leveraged in DVS and event-based tasks [150,177–179]. This motivates the application of our framework on the DVS-CIFAR10 dataset, which provides each labelwithonly0.9ktrainingsamples,andisconsideredthemostchallengingevent- based dataset [150]. As illustrated in Table 5.9, we surpass the test accuracy of existing works [177,178] by 1.30% on average at iso-time-step and architecture. Note that the architecture VGGSNN employed in our work and [150] is based on VGG11 with two dense layers removed because [150] found that additional dense layers were unnecessary for neuromorphic datasets. In fact, our accuracy gain is moresignificantatlowtimesteps,therebyimplyingtheportabilityofourapproach to DVS tasks. Note that similar to static datasets, a large number of time steps increasethetemporaloverheadinSNNs, resultinginalargecomputeandmemory footprints. 5.5 Discussions & Future Impact Current SNN training works use ANN-SNN conversion to yield high accuracy or SNN fine-tuning to yield low latency or a hybrid of both for a balanced accuracy- latency trade-off. However, none of the existing works can discard the temporal 101 dimension completely, which can enable the deployment of SNN models in multi- ple real-time applications, without significantly increasing the training cost. This chapter presents a SNN training framework from scratch involving a novel com- bination of a Hoyer regularizer and Hoyer spike layer for one and ultra-low time steps. Our SNN models incur similar training time as non-spiking DNN models and achieveSOTAaccuracy, outperformingtheexistingSNN,BNN,andAddNNmod- els. Additionally, our models yield a 24.34− 34.21× reduction in energy compared to standard DNNs, which can enable the deployment of our models on power- constrained devices. However, our work can also enable cheap and real-time com- puter vision systems that might be susceptible to adversarial attacks. Preventing the application of this technology from abusive usage is an important and inter- esting area of future work. 102 Chapter 6 Efficient Spiking LSTMs for Streaming Workloads This chapter first provides the introduction and motivation behind the develop- ment of efficient spiking LSTMs 6.1. Section 6.2 presents our proposed training framework,involvinganovelANN-to-SNNconversionframework,followedbySNN fine-tuning via backpropagation through time (BPTT). Section 6.3 presents our pipelined parallel processing scheme that hides the SNN time steps, significantly improving system latency, especially for long sequences. Our accuracy, latency, and energy-efficiency results are presented in Section 6.4. This chapter concludes in Section 6.5. 6.1 Introduction & Related Work Though there has been a lot of efforts in developing SNNs for static vision tasks, there has been relatively little research that target SNNs for sequence learning tasks. Some proposed spiking LSTM works are limited to binary inputs [180,181], which might not represent several real-world use cases, and others [4,34,182] are obtainingfromvanillaRNNs,whichcanleadtolargeaccuracydrops. Otherworks [6]exploredconvertingSNNsfromLSTMsbutincurredlargedropsinaccuracyand lackedarchitecturalparallelismthatledtolonglatencies. Amorerecentwork[181] uses a more complex neuron model than the leaky-integrate-and-fire (LIF) model to improve the recurrence dynamics and multi-bit activation maps to improve training at the cost of requiring multiplication and the associated higher energy consumption. Finally, although transformers have recently attracted significant attention for temporal tasks [183], existing works on spiking transformers only 103 Figure 6.1: Spiking (hard) and non-spiking activation (IF and LIF) functions corre- sponding to (a) sigmoid and (b) tanh activation for T = 4, and V th sig = 4, V th tanh+ = 3, V th tanh− =− 2. We show the proposed bias shifts for IF activations. The green and red dotted lines show the continuous versions of the discrete LIF activation functions. target static vision tasks [184] and come at the cost of a very high compute and memory footprint. Moreover, LSTMs rather can ingest naturally temporal inputs, unlike transformers that can start processing only after the entire input sequence arrives. ThischapterleveragesboththeefficienttemporalandsparsedynamicsofSNNs toreducetheinferencelatencyandenergyoflarge-scalestreamingworkloadswhile achieving close to SOTA accuracy. In particular, we adopt an LSTM which pro- vides memorization using an unrolled architecture over a sequence length and is independent of the number of SNN time steps. We aim to reduce the number of time steps to reduce latency and energy consumption and improve the resulting inference accuracy through novel training techniques. The key contributions of this work are summarized below. • We propose a training framework that starts with a spiking LSTM that is converted from a pre-trained non-spiking LSTM. Our framework has four steps. i) Convert the traditional sigmoid and tanh activation functions in thesourceLSTMtoclippedversions, ii)judiciouslyconvertasubsetofthese functions to IF activation functions such that the SNN does not require the 104 expensive MAC operations, iii) find the optimal shifts of the IF activation functions for IID inputs, and iv) fine-tune the SNN with LIF activation functions, updating the shifts for non-IID training inputs. • We propose a high-level parallel and pipelined implementation of the result- ing SNN-based models that incurs negligible latency overhead compared to the baseline LSTM and improves hardware utilization over existing spiking LSTMs. • We demonstrate the energy-latency-accuracy trade-off benefits of our frame- work through FPGA synthesis and place-and-route and algorithmic experi- ments with different LSTM architectures on sequential tasks from computer vision (temporal MNIST), spoken term classification (Google Speech Com- mands),andhumanactivityrecogniton(UCISmartphone)applications, and comparisons with existing spiking and non-spiking LSTMs. 6.2 Proposed Training Framework 6.2.1 Non-spiking LSTM For traditional LSTM-based architectures, the non-linear activation functions are tanh and sigmoid. In order to yield accurate LSTM-based SNN models, we first replace them with hard (clipped) versions, as illustrated in the blue dotted lines in Fig. 6.1(a-b). In particular, we clip the hard sigmoid function to 1 at x = V th sig 2 and to 0 at x=− V th sig 2 , where V th sig is the intended threshold value of the converted SNNs. Toincreasetherepresentativepower,wemodelthehardtanhfunctionwith two hard sigmoid functions, one for when its input x≥ 0 clipped at x=V th tanh+ to 1 and the other for when x<0 clipped at x=V th tanh− to -1. Note that in our training framework the SNN threshold values are initialized with their values in trained non-spiking LSTM models. 105 Figure 6.2: LIF activation function corresponding to the (a) sigmoid and (b) tanh activation function used in the spiking LSTM architecture, and (c) Proposed spiking LSTM architecture and dataflow with the parallel pipelined execution for the example of 5 time steps and 3 input elements in the sequence. 6.2.2 Conversion from non-spiking to spiking LSTMs Let U temp sig (t) and U temp tanh (t) denote the accumulated membrane potentials at time step t associated with our novel sigmoid and tanh SNN modules. As per our proposed clipping, the associated spike outputs S sig (t) and S tanh (t) are S sig (t)= 1, if U temp sig (t)> V th sig 2 0, otherwise, (6.1) S tanh (t)= 1, if U temp tanh (t)>V th tanh+ − 1, if U temp tanh (t)<V th tanh− 0, otherwise. (6.2) Note that the two types of spikes (positive and negative) we propose for S tanh (t) maybelessbio-plausibilebutcanbeeasilyimplementedinhardwareasillustrated in Fig. 6.2(a)-(b) and does not significantly degrade the energy-efficiency. 106 Our SNN training framework starts by training a baseline non-spiking LSTM model that we convert into a spiking LSTM. To minimize the error during this conversion, inspiredby[4,85,185]thatfocusesonCNNs, wefirstdefineanotionof IF activation functions ¯ Y IF sig and ¯ Y IF tanh that represent the average of IF activation outputs over all time steps. ¯ Y IF sig = 1 T clip $ T V th sig W ¯ X sig +B+ V th sig 2 !% ,0,T ! ¯ Y IF tanh = 1 T clip j T V th tanh+ (W ¯ X tanh +B) k ,0,T ,if A 1 T clip j T V th tanh− (W ¯ X tanh +B) k ,− T,0 ,otherwise Here, T denotes the total number of SNN time steps, ¯ X sig and ¯ X tanh denote the time-averagedinputs,andAcorrespondstoW ¯ X tanh >0. clip(x,y,z)=y whenx≤ y, clip(x,y,z)=x when y<x<z, and clip(x,y,z)=z when x≥ z. We then observe, as illustrated in Fig. 6.1(a), the IF activation ¯ Y IF sig when B = 0 is always less than its sigmoid counterpart. Hence the error accumulates over the multiple time steps and input elements in the sequence. To mitigate this error, we propose to shift the IF activation curve to the left by appropriately shifting the bias term B as shown in Fig. 6.1(a). Under the assumption that ¯ X sig is IID and the converted spiking bias value is 0 (i.e., we do not include the bias term in the pre-trained non-spiking LSTM), the optimal value of B is V th sig /2T as showninFig6.1(a). Similarly, wecanshowthat, undertheassumptionthat ¯ X tanh is IID for ¯ X tanh ≥ 0 and ¯ X tanh <0 separately, the optimal value of B=V th tanh+ /2T for ¯ X tanh ≥ 0 and B=V th tanh− /2T for ¯ X tanh <0. 6.2.3 SNN Training with LIF Activation Shifts After initial conversion to a spiking LSTM, we aim to further optimize the error betweentheoutputsoftheIFandnon-spikingLSTMactivationsinordertoreduce 107 the number of time steps and resulting energy consumption. Towards this goal, we convert the IF model to its LIF counterpart by incorporating the leak term that provides a tunable control knob to minimize this error. The LIF activations converted from the hard sigmoid and hard tanh functions can be computed as ¯ Y LIF sig = 1 T j T t sig k ,if W ¯ X sig >V th sig (1− λ ) 0,otherwise, (6.3) ¯ Y LIF tanh = 1 T j T t tanh+ k ,if W ¯ X tanh >V th tanh+ (1− λ ) − 1 T j T t tanh− k ,else if W ¯ X tanh <V th tanh− (1− λ ) 0,otherwise The functions have been illustrated for two different settings of λ in Fig. 6.1(a), one for which λ> 1 and one for which λ< 1. Assuming K= (1− λ ) W , t sig = log 1− KV th sig ¯ Xsig log(λ ) (6.4) t tanh+ = log 1− 2KV th tanh+ ¯ X tanh log(λ ) t tanh− = log 1+ 2KV th tanh− ¯ X tanh log(λ ) (6.5) In both cases, the optimal value of λ can minimize the difference between the LIF and non-spiking LSTM output activations, but depends on the input distributions which are difficult to model. We thus propose to optimize both leak term along withthethresholdtermsandweightsviaSGLafterconversionduringtheproposed SNN fine-tuning phase. 6.2.4 Selective conversion of LSTM activation functions Instead of converting all the sigmoid and tanh activation functions in the non- spiking LSTM architecture to spiking counterparts, we judiciously select a subset 108 ofthemsuchthatonlyoneofthetwoinputsofeachLSTMmultiplicationoperation is spiking. This ensures that all multiplications are replaced with the conditional addition or subtraction of the weights. This also avoids the unnecessary accumu- latederrorduetothespikinggradientsandimprovestheinferenceaccuracyatlow time steps. To motivate this conversion, let us review the equations governing the LSTM architectureinEqs. 6.6-6.8whereh t andc t denotethehiddenandcellstatetensors respectively. We denote f t , i t , g t , and o t as outputs of the forget, input, cell, and output gates respectively. We assume the weight tensors W a,h , W a,x , W g,h , W g,x to be multi-bit as per typical SNN setups [30]. a t =sig a (W a,h h t− 1 +W a,x x t )∀a∈{f,i,o} (6.6) g t =tanh g (W g,h h t− 1 +W g,x x t ) (6.7) c t =f t ⊙ c t− 1 +i t ⊙ g t , h t =o t ⊙ tanh c (c t ) (6.8) We propose to encode x t using a spike tensor (of length equal to the number of hiddenneuronsinalayer),asotherwisetheMACoperationwiththeweighttensors would require costly multiplications. Similarly, h t also needs to be a spike tensor, which implies that o t should be a spike tensor and tanh c should be converted to a LIF activation. A spiking o t necessitates conversion of the sig o to LIF (see Eq. (9)). On the other hand, Eq. (10) implies that either f t or c t− 1 and i t or g t need to be a spike tensor for multiplier-less operations. Between f t and c t− 1 , we have to choose f t as the spike tensor because c t− 1 is the sum of two tensors which is not naturally a spike tensor. Moreover, sig f can be easily converted to LIF activation using our framework. Between i t and g t , we can arbitrarily choose one to spike. 109 Table 6.1: Test accuracy on temporal MNIST, GSC, and UCI datasets obtained by proposed approaches with direct encoding for 2 time steps. S and NS denote the spiking and non-spiking LSTM variants respectively. On the other hand, P and NP denotes the accuracies with and without a pre-trained non-spiking LSTM model respectively. LSTM V th V th λ NS NS T-MNIST Acc. (%) GSC Acc. (%) UCI Acc. (%) Model Shift Train Train sig i tanhg P NP P NP P NP NS × × – – – – 98.6± 0.2 – 95.42± 0.1 – 90.37± 0.2 S × × × × × 97.84± 0.2 97.74± 0.3 90.59± 0.2 63.45± 0.2 88.17± 0.2 87.63± 0.1 ✓ × × × × 97.98± 0.2 97.87± 0.1 92.05± 0.1 91.45± 0.2 88.60± 0.2 88.13± 0.3 ✓ ✓ × × × 97.92± 0.1 97.84± 0.2 92.87± 0.2 91.33± 0.3 88.64± 0.3 86.87± 0.2 ✓ ✓ ✓ × × 98.0± 0.2 97.95± 0.2 93.57± 0.1 92.14± 0.1 89.13± 0.4 87.50± 0.3 ✓ ✓ ✓ ✓ × 98.1± 0.3 97.98± 0.1 94.75± 0.1 92.63± 0.2 89.23± 0.2 89.20± 0.2 ✓ ✓ ✓ × ✓ 98.15± 0.1 98.12± 0.2 94.53± 0.2 92.61± 0.3 89.40± 0.3 89.12± 0.1 6.3 Pipelined Parallel SNN Processing The operation of each LSTM unit (both spiking and non-spiking) is repeated, i.e., unrolled, as many times as the sequence length of each input, which we denote as N. Ineachrolling,theLSTMoperation(bothspikingandnon-spiking)workswith a different element of the sequence. Each such element is direct encoded and re- peatedlyinputtedtothespikingLSTMunit T times. Tohidethelatencyincurred by the LIF temporal dynamics, we propose a pipelined and parallel processing scheme exemplified in Fig. 6.2(c) for N=5 and T=3. The LSTM states h t and c t for element n are updated every time step t, mod- ulated by the weights, and processed by the LIF activation function. This state update allows us to update states for element n for time t + 1 as well as start processing the t time step of the n+1 input tensor in the sequence, provided we have sufficient LSTM hardware to pipeline/parallelize these operations. Because, the LSTM algorithm is limited to processing a maximum of T input elements at the same time and because we achieve relatively small T, this level of hardware pipelining is quite manageable. By doing so, in each time step, a new inputelementinthesequencewillbegintobeprocessed,thefirstspikeinputofthe N th inputelementwillbeprocessedattheN th timestep. Toprocesstheremaining T− 1 spike inputs of its encoding, we need an additional T− 1 time steps. Hence, 110 the total number of time steps required to process the whole input sequence with our spiking LSTM is (T+N− 1). For hardware with built-in parallel processing capability such as GPUs, our approach improves the hardware utilization compared to non-spiking LSTMs that are sequential in nature. Note that previous research on LSTM-based SNNs [6] accumulates the spike outputs of the different gates over all the time steps for pro- cessing a single input element. As a result, it uses T× N time steps to process the entireinputsequence. Moreover,thehiddenstateinputtothenextunrolledLSTM blockbecomesmulti-bitwhichnecessitatesusingenergy-expensivemultiplications. 6.4 Experimental Results We validate our proposed techniques on temporal MNIST [186], Google speech commands (GSC) with 11 classes [187], and UCI smartphone datasets [188]. For temporal MNIST (T-MNIST), we use row-wise sequential inputs, resulting in 32 image pixels each over a sequence of 32 frames [6,189]. For GSC, we pre-process therawaudioinputsusinglog-melspectrogramsresultingin20frequencyfeatures over a sequence of 81 frames [190]. For UCI smartphone, we pre-process the sen- sor signals obtained from the smartphone worn on the waist of the participating humans by applying butterworth low-pass filters within a fixed-width sliding win- dowsof2.56secondsand50%overlap(128readingsperwindow). Forallthethree datasets, we use both one and two-layer LSTMs with 128 hidden neurons in each layer. While we use a single fully-connected (FC) classifier layer for the T-MNIST and UCI datasets, we use two FC layers of 32 and 11 neurons each, with softmax output for the GSC dataset, following [190]. We do not convert the FC layers to spiking counterparts as they consume <0.03% of total energy. 111 6.4.1 Inference Accuracy OurresultsforsinglelayerLSTMsareillustratedinTable6.2for2timesteps,both with and without a pre-trained non-spiking LSTM model. Each of our proposed techniques improve the test accuracy for large-scale tasks such as GSC, with an overall improvement of 4.1% (90.59% to 94.75). For T-MNIST, which is an easier task with less room for improvement, our techniques lead to a 0.31% improvement inaccuracy,whilethethresholdandleakoptimizationsyieldinghardlyanybenefit. For the relatively more challenging UCI dataset, the leak optimization leads to a maximum accuracy increase of +0.49% with a pre-trained model, while the total increase due to all our techniques is 1.23%. While the UCI accuracy can be further increased with the use of bi-directional and stacked LSTMs [191] (+1.5%), it increases energy consumption. We also evaluate the scalability of our approach on the bi-directional and two-layer LSTM architectures for the large-scale tasks, as shown in Table III. The results indicate that our approach can yield stacked and bi-directional spiking LSTMs with a 0.4− 0.5% drop in accuracy compared to non-spiking counterparts. Note that the accuracies obtained without pre-trained models are lower, particularly for the more complex applications. We also compare the impact of direct and Poisson encoding on the SNN accu- racy in Fig. 6.3(a-b). Note that the test accuracies obtained after only ANN-to- SNN conversion shown in Fig. 6.3(b) are significantly lower than those obtained after SNN training, especially for more complex tasks. In particular, our ap- proach, with SNN training yields close to state-of-the-art (SOTA) accuracy with only 2 time steps, providing more than 7× reduction in the latency compared to our conversion-only approach for the GSC dataset. Our conversion framework provides the optimal value of the SNN threshold (before SNN fine-tuning) backed by our theoretical insights. Thus, our conversion framework acts as a good initializer for the weights and the membrane potential, without which the test accuracy of more complex tasks, such as GSC, drops by >3.4%. Note that our conversion framework alone also outperforms the existing 112 Table 6.2: Test accuracy on GSC and UCI datasets obtained by proposed approaches withdirectencodingfor4timestepsonbi-directionalandstackedLSTMs. ‘St.’ denotes a two-layer stacked LSTM, with both layers having 128 nodes each. ‘Bi-St.’ denotes a two-layer LSTM, with the first layer being bi-directional having 128 nodes. LSTM V th V th λ NS NS GSC Acc. (%) UCI Acc. (%) Model Shift Train Train sig i tanhg St. Bi-St. St. Bi-St. NS × × – – – 95.03± 0.3 94.72± 0.3 91.42± 0.1 91.90± 0.1 S × × × × × 63.24± 0.2 62.87± 0.3 86.24± 0.2 87.51± 0.1 ✓ × × × × 92.77± 0.2 89.75± 0.2 87.80± 0.2 88.46± 0.2 ✓ ✓ × × × 93.12± 0.2 91.51± 0.1 87.72± 0.3 88.80± 0.2 ✓ ✓ ✓ × × 93.23± 0.3 91.88± 0.2 89.26± 0.1 89.65± 0.3 ✓ ✓ ✓ ✓ × 94.61± 0.1 94.22± 0.3 91.00± 0.2 91.68± 0.2 ✓ ✓ ✓ × ✓ 94.39± 0.2 94.16± 0.1 89.53± 0.2 91.07± 0.3 Figure 6.3: Comparison between the accuracies obtained by our direct and Poisson encoded spiking LSTMs (a) with both conversion and SNN fine-tuning and (b) with only conversion. ANN-to-SNN conversion frameworks for LSTMs. For example, our conversion- only spiking LSTM models yield 94.3% test accuracy with 15 time steps; the conversion-only baseline spiking models (without our proposed threshold shifts) require 32 time steps to obtain similar accuracy. On the other hand, the baseline models can attain only 86% accuracy with 15 time steps. We compare the test accuracies obtained by our training framework with that of existing works in Table 6.3. Our accuracies are close to SOTA (within 0.6%) obtained by the non-spiking models, while yielding significant energy and latency 113 Table6.3: Accuracycomparisonofthebestperformingmodelsobtainedbyourtraining framework with SOTA spiking and non-spiking LSTM models on different datasets. Ref. Model Training technique Architecture Accuracy (%) Dataset : temporal MNIST [192] Spiking SGD LSTM(128) 97.29 [193] Spiking SGD LSTM(220) 96.4 [180] Spiking SGD LSTM(1000) 98.23 [6] Spiking ANN-SNN conv. LSTM(128) 98.72 (T=64) [189] Spiking SGD RNN(64-256-256) 98.7 [194] Non-spiking SGD u-RNN(128) 98.2 [190] Non-spiking RC-BPTT LSTM (320) 98.14 This work Spiking Hybrid training LSTM(128) 98.93 (T=8) Dataset : GSC [195] Spiking BPTT CNN(64-64-64) 94.5 [190] Partly spiking RC-BPTT LSTM(128) 95.6 [196] Spiking BPTT LSTM(128) 91.2 This work Spiking Hybrid training LSTM(128) 95.02 (T=4) Dataset : UCI [195] Non-Spiking SGD Bi-dirLSTM(-) 91.1 This work Spiking Hybrid training LSTM(128) 90.78 (T=4) savings as shown in Fig. 6.4. We surpass the SOTA spiking models in terms of accuracy for both the T-MNIST and GSC datasets. 1 6.4.2 Inference Energy Efficiency Theinferenceenergyisdominatedbythetotalnumberoffloatingpointoperations (FLOPs)andmemoryaccesses. Fornon-spikingLSTMs,theFLOPsconsistsofthe MAC, AC, hard sigmoid and hard tanh operations required in the four gates, and the memory cost includes the weight accesses for each unrolled LSTM operation. On the contrary, for spiking LSTMs, each emitted spike indicates which weights needtobeaccumulatedatthepost-synapticneuronsandresultsinafixednumber ofACoperations. This,coupledwiththecomparisonoperationsforthemembrane potential in each time step dominates the spiking compute energy. However, spik- ing LSTMs incur higher memory energy compared to non-spiking counterparts, due to the repetitive membrane potential and weight accesses at each time step. 1 Note that we were unable to find any deep SNN architectures, classifying the UCI dataset for comparison. 114 Figure 6.4: (a) Energy and (b) Delay comparisons between the non-spiking LSTM, proposeddirectandPoissonencodedspikingLSTM,andtheSOTAspikingLSTMmodel [6], that does not include any of our proposed approaches. Ourframeworkcanreducethismemorycostcomparedtoexistingsolutions,thanks to the use of only a few number of time steps. We use custom RTL specifications and 28 nm Kintex-7 FPGA platform to estimate the post place-and-route energy consumption of the hardware implemen- tations of the spiking and non-spiking networks. In particular, we develop Verilog RTL block-level models to design, simulate, and synthesize an inference pipeline thatcapturestheLSTMprocessingincludingthewritingandreadingoftheweights (andmembranepotentialsforthespiking)fortheLSTMsonourtargetFPGAde- vice. Inaddition,forcomparisonpurposes,wedevelopasimilarsynthesizableRTL design for the non-spiking LSTMs. Fig. 6.4(a) and (b) illustrate the energy consumption for our spiking and non- spiking LSTM architectures used for classifying the three datasets, along with the SOTA spiking LSTM implementation [6]. As we can see, we obtain 2.8-5.1× and 10.1-13.2× lower energy than the non-spiking and SOTA spiking implementations respectively for direct coding. The reductions obtained by Poisson encoding are a little lower (1.8-3.5× compared to non-spiking and 6.6-9.0× compared to SOTA spiking) due to the degraded trade-off between more time steps and less energy due to ACs. 115 On custom neuromorphic architectures, such as TrueNorth [91], and SpiN- Naker[92],thetotalenergyisestimatedasFLOPs∗ E compute +T∗ E static [62],where (E compute ,E static )canbenormalizedto(0.4,0.6)and(0.64,0.36)forTrueNorthand SpiNNaker,respectively[62]. SincetheFLOPsfortheLSTMarchitecturesaregen- erally several orders of magnitude higher than T [6,190], we expect to see similar compute energy improvements on them. 6.4.3 Inference Latency The latency incurred by the non-spiking LSTM architecture depends on the la- tencyofourRTL-implementedMAC,AC,multiplication, hardsigmoid, hardtanh modulesandthememoryaccesses. Dependingonthenumberofthecomputeunits available in each LSTM block, we parallelize the MAC operations, followed by the AC and activation functions in the four LSTM gates. We incur additional AC and hard tanh delay to produce c t and h t respectively. In contrast, with our proposed implementation, both the direct and Poisson encoded spiking LSTM architectures can process T different input elements simultaneously by internally pipelining the RTL models. Note that we use the similar RTL models and FPGA evaluation setup illustrated above to evaluate the latency of the LSTM implementations. As shown in Fig. 6.4(b), our processing scheme, coupled with the low time steps and accumulate-only operations in SNN, results in ∼ 4× and 25.9-105.7× reductioninlatencycomparedtothenon-spikingLSTMandSOTAspikingLSTM implementation respectively. This significant improvement over the SOTA spiking implementation can be attributed to three factors. Firstly, the SOTA spiking LSTM architecture require more time steps (3-8× ) to encode the original multi- bit input tensor than ours to achieve similar test accuracy. Secondly, while our proposed spiking architecture requires a total of (T+N− 1) time steps to process the whole input sequence, the existing spiking counterpart requires T ′ × N time steps where T and T ′ denote the total number of SNN time steps for the proposed and existing networks respectively. Lastly, since the hidden and cell state tensors 116 are multi-bit tensors, the LSTM block requires MACs for certain computations, which also increases the latency by 5.1× obtained from our FPGA simulations compared to our AC-only approach. 6.5 Conclusions & Broader Impact ML models for large-scale streaming tasks are typically compute-intensive and of- ten demand significant processing power that have large carbon footprints. This work proposes a spiking LSTM training framework which reduces the inference latency by up to 4× and energy efficiency by up to 3 .2× when implemented on FPGA hardware compared to existing works with minimal (< 0.6%) accuracy drop for diverse large-scale streaming use cases. Finally, our LIF derivations that approximate sigmoid and tanh activation functions may help bridge the gap be- tween traditional deep learning and SNNs for other use cases of these forms of non-linearity. 117 Chapter 7 In-Pixel Computing for several CV applications This chapter first provides the introduction and motivation behind processing-in- pixel for on-device compute vision applications in 7.1. Section 7.2 discusses our approach for P 2 M-constrained algorithm-circuit co-design. Section 7.3 presents our TinyML benchmarking dataset, model architectures, test accuracy and EDP results. Finally, some conclusions are provided in Section 7.4. 7.1 Introduction & Motivation Today’s widespread applications of computer vision spanning surveillance [197], disastermanagement[198],cameratrapsforwildlifemonitoring[199],autonomous driving, smartphones, etc., are fueled by the remarkable technological advances in image sensing platforms [200] and the ever-improving field of deep learning algorithms [201]. However, hardware implementations of vision sensing and vision processing platforms have traditionally been physically segregated. For example, current vision sensor platforms based on CMOS technology act as transduction entities that convert incident light intensities into digitized pixel values, through a two-dimensional array of photodiodes [202]. The vision data generated from such CMOS Image Sensors (CIS) are often processed elsewhere in a cloud environment consisting of CPUs and GPUs [203]. The physical segregation of vision sensing and computing platforms leads to multiple bottlenecks concerning throughput, bandwidth, and energy-efficiency. To address these bottlenecks, many researchers are trying to bring intelligent data processing closer to the source of the vision data, i.e., closer to the CIS, 118 taking one of three broad approaches - near-sensor processing [204,205], in-sensor processing [50], and in-pixel processing [51,52,206]. Near-sensor processing aims to incorporate a dedicated machine learning accelerator chip on the same printed circuit board [204], or even 3D-stacked with the CIS chip [205]. Although this enables processing of the CIS data closer to the sensor rather than in the cloud, it still suffers from the data transfer costs between the CIS and processing chip. On the other hand, in-sensor processing solutions [50] integrate digital or analog circuits within the periphery of the CIS sensor chip, reducing the data transfer between the CIS sensor and processing chips. Nevertheless, these approaches still often require data to be streamed (or read in parallel) through a bus from CIS photo-diodearraysintotheperipheralprocessingcircuits[50]. Incontrast,in-pixel processing solutions, such as [51,52,206,207], aim to embed processing capabilities within the individual CIS pixels. Initial efforts have focused on in-pixel analog convolution operation [207] but many [51,207,208] require the use of emerging non-volatile memories or 2D materials. Unfortunately, these technologies are not yet mature and thus not amenable to the existing foundry-manufacturing of CIS. Moreover, these works fail to support multi-bit, multi-channel convolution oper- ations, batch normalization (BN), and Rectified Linear Units (ReLU) needed for most practical deep learning applications. Furthermore, works that target digi- tal CMOS-based in-pixel hardware, organized as pixel-parallel single instruction multiple data (SIMD) processor arrays [52], do not support convolution operation, and are thus limited to toy workloads, such as digit recognition. Many of these works rely on digital processing which typically yields lower levels of parallelism comparedtotheiranalogin-pixelalternatives. Incontrast,theworkin[206],lever- ages in-pixel parallel analog computing, wherein the weights of a neural network are represented as the exposure time of individual pixels. Their approach requires weightstobemadeavailableformanipulatingpixel-exposuretimethroughcontrol pulses, leading to a data transfer bottleneck between the weight memories and the sensor array. Thus, an in-situ CIS processing solution where both the weights and input activations are available within individual pixels that efficiently implements 119 critical deep learning operations such as multi-bit, multi-channel convolution, BN, andReLUoperationshasremainedelusive. Furthermore,allexistingin-pixelcom- puting solutions have targeted datasets that do not represent realistic applications of machine intelligence mapped onto state-of-the-art CIS. Specifically, most of the existing works are focused on simplistic datasets like MNIST [52], while few [206] use the CIFAR-10 dataset which has input images with a significantly low reso- lution (32× 32), that does not represent images captured by state-of-the-art high resolution CIS. Towards that end, we propose a novel in-situ computing paradigm at the sen- sornodescalledProcessing-in-Pixel-in-Memory (P 2 M),illustratedinFig. 7.1,that incorporates both the network weights and activations to enable massively paral- lel, high-throughput intelligent computing inside CISs. In particular, our circuit architecture not only enables in-situ multi-bit, multi-channel, dot product analog acceleration needed for convolution, but re-purposes the on-chip digital correlated double sampling (CDS) circuit and single slope ADC (SS-ADC) typically available inconventionalCIStoimplementalltherequiredcomputationalaspectsforthefirst few layers of a state-of-the-art deep learning network. Furthermore, the proposed architecture is coupled with a circuit-algorithm co-design paradigm that captures thecircuitnon-linearities,limitations,andbandwidthreductiongoalsforimproved latency and energy-efficiency. The resulting paradigm is the first to demonstrate feasibility for enabling complex, intelligent image processing applications (beyond toy datasets), on high resolution images of Visual Wake Words (VWW) dataset, catering to a real-life TinyML application. We choose to evaluate the efficacy of P 2 M on TinyML applications, as they impose tight compute and memory bud- gets, that are otherwise difficult to meet with current in- and near-sensor process- ing solutions, particularly for high-resolution input images. Key highlights of the presented chapter are as follows: 120 Figure 7.1: Existing and Proposed Solutions to alleviate the energy, throughput, and bandwidth bottleneck caused by the segregation of Sensing and Compute. 1. We propose a novel processing-in-pixel-in-memory (P 2 M) paradigm for resource-constrained sensor intelligenceapplications, wherein novel memory- embedded pixels enable massively parallel dot product acceleration using in- situ input activations (photodiode currents) and in-situ weights all available within individual pixels. 2. We further develop a compact MobileNet-V2 based model optimized specif- ically for P 2 M-implemented hardware constraints, and benchmark its accu- racyandenergy-delayproduct(EDP)ontheVWWdataset,whichrepresents a common use case of visual TinyML. Embedding part of the deep learning network within pixel arrays in an in-situ manner can lead to a significant reduction in data bandwidth (and hence energy consumption) between sensor chip and downstream processing for the rest of the convolutional layers. This is because the first few layers of carefully designed CNNs, as explained in Section 7.2, can have a significant compressing property, i.e., theoutputfeaturemapshavereducedbandwidth/dimensionalitycomparedto the input image frames. In particular, our proposed P 2 M paradigm enables us to mapallthecomputationsofthefirstfewlayersofaCNNintothepixelarray. The paradigm includes a holistic hardware-algorithm co-design framework that cap- tures the specific circuit behavior, including circuit non-idealities, and hardware limitations, during the design, optimization, and training of the proposed machine 121 learning networks. The trained weights for the first few network layers are then mappedtospecifictransistorsizesinthepixel-array. Becausethetransistorwidths are fixed during manufacturing, the corresponding CNN weights lack programma- bility. Fortunately, it is common to use the pre-trained versions of the first few layers of modern CNNs as high-level feature extractors are common across many visiontasks[209]. Hence, thefixedweightsinthefirstfewCNNlayersdonotlimit the use of our proposed scheme for a wide class of vision applications. Moreover, we would like to emphasize that the memory-embedded pixel also work seamlessly well by replacing fixed transistors with emerging non-volatile memories. Finally, the presented P 2 M paradigm can be used in conjunction with existing near-sensor processing approaches for added benefits, such as, improving the energy-efficiency of the remaining convolutional layers. 7.2 P 2 M-constrained Algorithm-Circuit Co- Design In this section, we present our algorithmic optimizations to standard CNN back- bones that are guided by 1) P 2 M circuit constraints arising due to analog com- puting nature of the proposed pixel array and the limited conversion precision of on-chip SS-ADCs, 2) the need for achieving state-of-the-art test accuracy, and 3) maximizing desired hardware metrics of high bandwidth reduction, energy- efficiency and low-latency of P 2 M computing, and meeting the memory and com- pute budget of the VWW application. The reported improvement in hardware metrics (illustrated in Section 7.3.3), is thus a result of intricate circuit-algorithm co-optimization. 122 Figure 7.2: Algorithm-circuit co-design framework to enable our proposed P 2 M ap- proach optimize both the performance and energy-efficiency of vision workloads. We propose the use of 1 ○ large strides, 2 ○ large kernel sizes, 3 ○reduced number of channels, 4 ○ P 2 M custom convolution, and 5 ○ shifted ReLU operation to incorporate the shift term of the batch normalization layer, for emulating accurate P 2 M circuit behaviour. 7.2.1 Custom Convolution for the First Layer Modeling Circuit Non-Idealities From an algorithmic perspective, the first layer of a CNN is a linear convolu- tion operation followed by BN, and non-linear (ReLU) activation. The P 2 M cir- cuit scheme implements convolution operation in analog domain using modified memory-embedded pixels. The constituent entities of these pixels are transistors, which are inherently non-linear devices. As such, in general, any analog convolu- tion circuit consisting of transistor devices will exhibit non-ideal non-linear behav- ior with respect to the convolution operation. Many existing works, specifically in the domain of memristive analog dot product operation, ignore non-idealities 123 arising from non-linear transistor devices [210,211]. In contrast, to capture these non-linearities, we performed extensive simulations of the presented P 2 M circuit spanning wide range of circuit parameters like the width of weight transistors, photodiode current etc. based on commercial 22nm Globafoundries transistor technology node. The resulting SPICE results, i.e. the pixel output voltages corresponding to a range of weights and photodiode currents, were modeled using a behavioral curve-fitting function. The generated function was then included in ouralgorithmicframework, replacingtheconvolutionoperationinthefirstlayerof thenetwork. Inparticular, weaccumulatetheoutputofthecurve-fittingfunction, one for each pixel in the receptive field (we have 3 input channels, and a kernel size of 5× 5, and hence, our receptive field size is 75), to model each inner-product generated by the in-pixel convolutional layer. This algorithmic framework was then used to optimize the CNN training for the VWW dataset. 7.2.2 Circuit-Algorithm Co-optimization of CNN Back- bone subject to P 2 M Constrains The P 2 M circuit scheme maximizes parallelism and data bandwidth reduction by activating multiple pixels and reading multiple parallel analog convolution oper- ations for a given channel in the output feature map. The analog convolution operation is repeated for each channel in the output feature map serially. Thus, parallel convolution in the circuit tends to improve parallelism, bandwidth re- duction, energy-efficiency and speed. But, increasing the number of channels in the first layer increases the serial aspect of the convolution and degrades paral- lelism, bandwidth reduction, energy-efficiency, and speed. This creates an intri- cate circuit-algorithm trade-off, wherein the backbone CNN has to be optimized for having larger kernel sizes (that increases the concurrent activation of more pix- els, helping parallelism) and non-overlapping strides (to reduce the dimensionality inthedownstreamCNNlayers, therebyreducingthenumberofmultiply-and-adds and peak memory usage), smaller number of channels (to reduce serial operation 124 for each channel), while maintaining close to state-of-the-art classification accu- racy and taking into account the non-idealities associated with analog convolution operation. Also, decreasing number of channels decreases the number of weight transistors embedded within each pixel (each pixel has weight transistors equal to the number of channels in the output feature map), improving area and power consumption. Furthermore, the resulting smaller output activation map (due to reduced number of channels, and larger kernel sizes with non-overlapping strides) reduces the energy incurred in transmission of data from the CIS to the down- stream CNN processing unit and the number of floating point operations (and consequently, energy consumption) in downstream layers. In addition, we propose to fuse the BN layer, partly in the preceding convolu- tional layer, and partly in the succeeding ReLU layer to enable its implementation via P 2 M. Let us consider a BN layer with γ and β as the trainable parameters, which remain fixed during inference. During the training phase, the BN layer nor- malizes feature maps with a running mean µ and a running variance σ . However, during inference, µ and σ are computed from the mini-batch statistics and kept fixed [212], and hence, the BN layer implements a linear function, as shown below. Y =γ X− µ √ σ 2 +ϵ +β = γ √ σ 2 +ϵ · X + β − γµ √ σ 2 +ϵ =A· X +B (7.1) We propose to fuse the scale term A into the weights (value of the pixel embedded weight tensor is A· θ , where θ is the final weight tensor obtained by our training) that are embedded as the transistor widths in the pixel array. Additionally, we proposetouseashiftedReLUactivationfunction,followingthecovolutionallayer, asshowninFig. 7.2toincorporatetheshifttermB. Weuseacounter-basedADC implementation to implement the shifted ReLU activation. This can be easily achieved by resetting the counter to a non-zero value corresponding the the term B at the start of the convolution operation, as opposed to resetting the counter to zero. 125 Moreover,tominimizetheenergycostoftheanalog-to-digitalconversioninour P 2 M approach, we must also quantize the layer output to as few bits as possible subject to achieving the desired accuracy. We train a floating-point model with close to state-of-the-accuracy, and then perform quantization in the first convolu- tional layer to obtain low-precision weights and activations during inference [213]. Wealsoquantizethemean,variance,andthetrainableparametersoftheBNlayer, as all these affect the shift term B (please see Eq. 7.1), that should be quantized for the low-precision shifted ADC implementation. We avoid quantization-aware training [113] because it significantly increases the training cost with negligible re- duction in bit-precision for our model. With the bandwidth reduction obtained by all these approaches, the output feature map of the P 2 M-implemented layers can more easily be implemented in micro-controllers with extremely low memory foot- print,whileP 2 Mitselfgreatlyimprovestheenergy-efficiencyofthefirstlayer. Our approach can thus enable TinyML applications that usually have a tight compute and memory budget, as illustrated in Section 7.3.1 7.2.3 Quantification of bandwidth reduction To quantify the bandwidth reduction (BR) after the first layer obtained by P 2 M (BNandReLUlayersdonotyieldanyBR),letthenumberofelementsintheRGB inputimagebeI andintheoutputactivationmapaftertheReLUactivationlayer be O. Then, BR can be estimated as BR = O I 4 3 12 N b (7.2) Here, the factor 4 3 represents the compression from Bayer’s pattern of RGGB pixels to RGB pixels because we can either ignore the additional green pixel or designthecircuittoeffectivelytaketheaverageofthephoto-diodecurrentscoming from the green pixels. The factor 12 N b represents the ratio of the bit-precision between the image pixels captured by the sensor (pixels typically have a bit-depth 126 Hyperparameter Value kernel size of the convolutional layer (k) 5 padding of the convolutional layer (p) 0 stride of the convolutional layer (s) 5 number of output channels of the convolutional layer (co) 8 bit-precision of the P 2 M-enabled convolutional layer output (N b ) 8 Table 7.1: Model hyperparameters and their values to enable bandwidth reduction in the in-pixel layer. of 12 [214]) and the quantized output of our convolutional layer denoted as N b . Let us now substitute O = i− k+2∗ p s +1 2 ∗ c o , I =i 2 ∗ 3 (7.3) into Eq. 7.2, where i denotes the spatial dimension of the input image, and k, p, s denote the kernel size, padding and stride of the in-pixel convolutional layer, respectively. These hyperparameters, along with N b are obtained via a thorough algorithmic design space exploration with the goal of achieving the best accuracy, subject to meeting the hardware constraints and the memory and compute budget ofourTinyMLbenchmark. WeshowtheirvaluesinTable7.1,andsubstitutethem in Eq. 7.2 to obtain a BR of 21× . 7.3 Experimental Results 7.3.1 Benchmarking Dataset & Model This chapter focuses on the potential of P 2 M for TinyML applications, i.e., with models that can be deployed on low-power IoT devices with only a few kilobytes of on-chip memory [215–217]. In particular, the Visual Wake Words (VWW) dataset [218] presents a relevant use case for visual TinyML. It consists of high resolutionimagesthatincludevisualcuesto“wake-up“AI-poweredhomeassistant 127 devices,suchasAmazon’sAstro[219],thatrequiresreal-timeinferenceinresource- constrained settings. The goal of the VWW challenge is to detect the presence of a human in the frame with very little resources - close to 250KB peak RAM usage and model size [218]. To meet these constraints, current solutions involve downsampling the input image to medium resolution (224× 224) which costs some accuracy [213]. We choose MobileNetV2 [220] as our baseline CNN architecture with 32 and 320 channels for the first and last convolutional layers respectively that supports full resolution (560× 560) images. In order to avoid overfitting to only two classes in the VWW dataset, we decrease the number of channels in the last depthwise separable convolutional block by 3× . MobileNetV2, similar to other MobileNet class of models, is very compact [220] with size less than the maximum allowed in the VWW challenge. It performs well on complex datasets like ImageNet [221] and, as shown in Section 7.3, does very well on VWWs. To evaluate P 2 M on MobileNetV2, we create a custom model that replaces the first convolutional layer with our P 2 M custom layer that captures the systematic non-idealities of the analog circuits, the reduced number of output channels, and limitation of non-overlapping strides, as discussed in Section 7.2. We train both the baseline and P 2 M custom models in PyTorch using the SGD optimizer with momentum equal to 0.9 for 100 epochs. The baseline model has an initial learning rate (LR) of 0.03, while the custom counterpart has an initial LR of 0.003. Both the learning rates decay by a factor of 0.2 at every 35 and 45 epochs. After training a floating-point model with the best validation accuracy, weperformquantizationtoobtain8-bitintegerweights, activations, and theparameters(includingthemeanandvariance)oftheBNlayer. Allexperiments are performed on a Nvidia 2080Ti GPU with 11 GB memory. 7.3.2 Classification Accuracy Comparison between baseline and P 2 M custom models: We evaluated the perfor- manceofthebaselineandP 2 McustomMobileNet-V2modelsontheVWWdataset 128 Image Resolution Model Test Accuracy (%) Number of MAdds (G) Peak memory usage (MB) 560× 560 baseline 91.37 1.93 7.53 P 2 M custom 89.90 0.27 0.30 225× 225 baseline 90.56 0.31 1.2 P 2 M custom 84.30 0.05 0.049 115× 115 baseline 91.10 0.09 0.311 P 2 M custom 80.00 0.01 0.013 Table 7.2: Test accuracies, number of MAdds, and peak memory usage of baseline and P 2 McustomcompressedmodelwhileclassifyingontheVWWdatasetfordifferentinput image resolutions. in Table 7.2. Our baseline model currently yields the best test accuracy on the VWW dataset among the models available in literature that does not leverage any additional pre-training or augmentation. Note that our baseline model requires a significant amount of peak memory and MAdds ( ∼ 30× more than that allowed in the VWW challenge), however, serves a good benchmark for comparing accuracy. WeobservethattheP 2 M-enabledcustommodelcanreducethenumberofMAdds by ∼ 7.15× , and peak memory usage by ∼ 25.1× with 1.47% drop in the test ac- curacy compared to the uncompressed baseline model for an image resolution of 560× 560. With the memory reduction, our P 2 M model can run on tiny micro- controllers with only 270 KB of on-chip SRAM. Note that peak memory usage is calculated using the same convention as [218]. Notice also that both the baseline and custom model accuracies drop (albeit the drop is significantly higher for the custom model) as we reduce the image resolution, which highlights the need for high-resolution images and the efficacy of P 2 M in both alleviating the bandwidth bottleneck between sensing and processing, and reducing the number of MAdds for the downstream CNN processing. Comparison with SOTA models: Table7.3providesacomparisonoftheperfor- mancesofmodelsgeneratedthroughouralgorithm-circuitco-simulationframework with SOTA TinyML models for VWW. Our P 2 M custom models yield test accu- racies within 0.37% of the best performing model in the literature [222]. Note that 129 Authors Description Model architecture Test Accuracy (%) Saha et al. (2020) [213] RNNPooling MobileNetV2 89.65 Han et al. (2019) [222] ProxylessNAS Non-standard architecture 90.27 Banbury et al. (2021) [217] Differentiable NAS MobileNet-V2 88.75 Zhoue et al. (2021) [223] Analog compute-in-memory MobileNet-V2 85.7 This work P 2 M MobileNet-V2 89.90 Table7.3: PerformancecomparisonoftheproposedP 2 M-compatiblemodelswithstate- of-the-art deep CNNs on VWW dataset. (a) (b) Figure 7.3: (a) Effect of quantization of the in-pixel output activations, and (b) Effect of the number of channels in the 1 st convolutional layer for different kernel sizes and strides, on the test accuracy of our P 2 M custom model. we have trained our models solely based on the training data provided, whereas ProxylessNAS [222], that won the 2019 VWW challenge leveraged additional pre- training with ImageNet. Hence, for consistency, we report the test accuracy of ProxylessNAS with identical training configurations on the final network provided bytheauthors,similarto[213]. Notethat[223]leveragedmassivelyparallelenergy- efficient analog in-memory computing to implement MobileNet-V2 for VWW, but incurs an accuracy drop of 5.67% and 4.43% compared to our baseline and the previous state-of-the-art [222] models. This probably implies the need for intri- cate algorithm-hardware co-design and accurately modeling of the hardware non- idealities in the algorithmic framework, as shown in our work. 130 Effectofquantizationofthein-pixellayer : AsdiscussedinSection7.2,wequan- tizetheoutputofthefirstconvolutionallayerofourproposedmodelaftertraining to reduce the power consumption due to the sensor ADCs and compress the out- put as outlined in Eq. 7.2. We sweep across output bit-precisions of{4,6,8,16,32} to explore the trade-off between accuracy and compression/efficiency as shown in Fig. 7.3(a). We choose a bit-width of 8 as it is the lowest precision that does not yield any accuracy drop compared to the full-precision models. As shown in Fig. 7.3, the weights in the in-pixel layer can also be quantized to 8 bits with an 8-bit output activation map, with less than 0.1% drop in accuracy. Ablation study: We also study the accuracy drop incurred due to each of the three modifications (non-overlapping strides, reduced channels, and custom func- tion) in the P 2 M-enabled custom model. Incorporation of the non-overlapping strides(strideof5for5× 5kernelsfromastrideof2for3× 3inthebaselinemodel) leads to an accuracy drop of 0.58%. Reducing the number of output channels of thein-pixelconvolutionby4× (8channelsfrom32channelsinthebaselinemodel), on the top of non-overlapping striding, reduces the test accuracy by 0.33%. Addi- tionally, replacing the element-wise multiplication with the custom P 2 M function in the convolution operation reduces the test accuracy by a total of 0.56% com- paredtothebaselinemodel. Notethatwecanfurthercompressthein-pixeloutput by either increasing the stride value (changing the kernel size proportionately for non-overlapping strides) or decreasing the number of channels. But both of these approaches reduce the VWW test accuracy significantly, as shown in Fig. 7.3(b). Model type Sensing (pJ) ADC (pJ) SoC comm. (pJ) MAdds (pJ) Sensor output (e pix ) (e adc ) (ecom) (emac) pixel (N pix ) P 2 M (ours) 148 41.9 900 1.568 112× 112× 8 Baseline (C) 312 86.14 560× 560× 3 Baseline (NC) Table 7.4: Energy estimates for different hardware components. The energy values are measured for designs in 22nm CMOS technology. For the e mac , we convert the corresponding value in 45nm to that of 22nm by following standard scaling strategy [2]. 131 Notation Description Value B IO I/O band-width 64 B W Weight representation bit-width 32 N bank Number of memory banks 4 N mult Number of multiplication units 175 Tsens Sensor read delay 35.84 ms (P 2 M) 39.2 ms (baseline) T adc ADC operation delay 0.229 ms (P 2 M) 4.58 ms (baseline) t mult Time required to perform 1 mult. in SoC 5.48 ns t read Time required to perform 1 read from SRAM in SoC 5.48 ns Table 7.5: The description and values of the notations used for computation of delay. Note that we calculated the delay in 22nm technology for 32-bit read and MAdd opera- tionsbyapplyingstandardtechnologyscalingrulesinitialvaluesin65nmtechnology[3]. We directly evaluated the T read andT adc through circuit simulations in 22nm technology node. 7.3.3 EDP Estimation We develop a circuit-algorithm co-simulation framework to characterize the en- ergy and delay of our baseline and P 2 M-implemented VWW models. The total energy consumption for both these models can be partitioned into three major components: sensor (E sens ), sensor-to-SoC communication (E com ), and SoC en- ergy (E soc ). Sensor energy can be further decomposed to pixel read-out (E pix ) and analog-to-digital conversion (ADC) cost (E adc ). E soc , on the other hand, is primarily composed of the MAdd operations (E mac ) and parameter read (E read ) cost. Hence, the total energy can be approximated as: E tot ≈ (e pix +e adc )∗ N pix | {z } Esens +e com ∗ N pix | {z } Ecom +e mac ∗ N mac | {z } Emac +e read ∗ N read | {z } E read . (7.4) Here, e sens and e com represents per-pixel sensing and communication energy, re- spectively. e mac is the energy incurred in one MAC operation, e read represents a parameter’s read energy, and N pix denotes the number of pixels communicated from sensor to SoC. For a convolutional layer that takes an input I ∈ R h i × w i × c i 132 and weight tensor θ ∈R k× k× c i × co to produce outputO∈R ho× wo× co , the N mac and N read can be computed as, N mac =h o ∗ w o ∗ k 2 ∗ c i ∗ c o (7.5) N read =k 2 ∗ c i ∗ c o (7.6) The energy values we have used to evaluate E tot are presented in Table 7.4. While e pix is obtained from our circuit simulations, e adc and e com are obtained from [224] and [225] respectively. We ignore the value of E read as it corresponds to only a small fraction (<10 − 4 ) of the total energy, similar to [42,44,69]. Fig. 7.4(a) shows the comparison of energy costs for standard vs P 2 M-implemented models. In particular, P 2 M can yield an energy reduction of up to 7.81× . Moreover, the energy savings is larger when the feature map needs to be transferred from an edge device to the cloud for further processing, due to the high communication costs. Note, here we assumed two baseline scenarios one with compression and one without compression. The first baseline is MobileNetV2 which aggressively down-samples the input similar to P 2 M (h i /w i : 560 −→ h o /w o : 112). For the secondbaselinemodel,weassumedstandardfirstlayerconvolutionkernelscausing standard feature down-sampling (h i /w i :560−→ h o /w o :279). Toevaluatethedelayofthemodelsweassumesequentialexecutionofthelayer operations [3,43,226] and compute a single convolutional layer delay as [3] t conv ≈⌈ (k) 2 c i c o (B IO /B W )N bank ⌉∗ t read +⌈ (k) 2 c i c o N Mult ⌉h o ∗ w o ∗ t mult . (7.7) where the notations of the parameters and their values are shown in Table 7.5. Based on this sequential assumption, the approximate compute delay for a single forward pass for our P 2 M model can be given by T delay ≈ T sens +T adc +T conv . (7.8) 133 Figure 7.4: Comparison of normalized total, sensing, and SoC (a) energy cost and (b) delay between the P 2 M, and baseline models architectures (compressed C, and non- compressed NC). Note, the normalization of each component was done by diving the corresponding energy (delay) value with the maximum total energy (delay) value of the three components. Here, T sens and T adc correspond to the delay associated to the sensor read and ADC operation respectively. T conv corresponds to the delay associated with all the convolutional layers where each layer’s delay is computed by Eq. 7.7. Fig. 7.4(b)showsthecomparisonofdelaybetweenP 2 Mandthecorrespondingbaselines where the total delay is computed with the sequential sensing and SoC operation assumption. Inparticular,theproposedP 2 Mapproachcanyieldanimproveddelay of up to 2.15× . Thus the total EDP advantage of P 2 M can be up to 16.76× . On the other hand, even with the conservative assumption of total delay is estimated as max(T sens +T adc , T conv ), the EDP advantage can be up to∼ 11× . 7.4 Conclusions With the increased availability of high-resolution image sensors, there has been a growing demand for energy-efficient on-device AI solutions. To mitigate the large amount of data transmission between the sensor and the on-device AI accelera- tor/processor, we propose a novel paradigm called Processing-in-Pixel-in-Memory (P 2 M) which leverages advanced CMOS technologies to enable the pixel array to perform a wider range of complex operations, including many operations required by modern convolutional neural networks (CNN) pipelines, such as multi-channel, multi-bit convolution, BN and ReLU activation. Consequently, only the com- pressed meaningful data, for example after the first few layers of custom CNN 134 processing, is transmitted downstream to the AI processor, significantly reduc- ing the power consumption associated with the sensor ADC and required data transmission bandwidth. Our experimental results yield reduction of data rates after the sensor ADCs by up to ∼ 21× compared to standard near-sensor pro- cessing solutions, significantly reducing the complexity of downstream processing. This, in fact, enables the use of relatively low-cost micro-controllers for many low- power embedded vision applications and unlocks a wide range of visual TinyML applications that require high resolution images for accuracy, but are bounded by compute and memory usage. We can also leverage P 2 M for even more com- plexapplications,wheredownstreamprocessingcanbeimplementedusingexisting near-sensor computing techniques that leverage advanced 2.5 and 3D integration technologies [227]. 135 Chapter 8 ISP-less CV for P 2 M This chapter first provides the introduction and motivation behind the develop- ment of ISP-less vision models 8.1. Reviews of related works on ISP reversal and removal, and few-shot object detection are provided in Section 8.2. Section 8.3 presents our proposed ISP inversion pipeline. Section 8.4 presents our custom analog demosaicing implementation. Section 8.5 presents our novel application of few-shot learning to further increase the accuracy of ISP-less vision systems. Sec- tion 8.6 and 8.7 present our experimental setup and results respectively. Finally, some discussions and conclusions are provided in Section 8.8. 8.1 Introduction & Motivation Modern high-resolution cameras generate huge amount of visual data arranged in theformofrawBayercolorfilterarrays(CFA),alsoknownasamosaicpattern, as shown in Fig. 8.1, that need to be processed for downstream CV tasks [197,200]. An ISP unit, consisting of several pipelined processing stages, is typically used beforetheCVprocessingtoconverttherawmosaicedimagestoRGBcounterparts [228–231]. The ISP step that converts these single-channel CFA images to three- channel RGB images is called demosaicing. Historically, ISP has been proven to be extremely effective for computational photography applications, where the goal is to generate images that are aesthetically pleasing to the human eye [231,232]. However, is it important for high-level CV applications, such as face detection by smart security cameras, where the sensor data is unlikely to be viewed by any human? Existingworks[228–230]showthatmostISPstepscanbediscardedwitha smalldropinthetestaccuracyforlarge-scaleimagerecognitiontasks. Theremoval oftheISPcanpotentiallyenableexistingin-sensor[50,204,205]andin-pixel[51–54] 136 computing paradigms to process CV computations, such as CNNs partly in the sensor,andreducethebandwidthandenergyincurredinthedatatransferbetween the sensor and the CV system. Moreover, most low-power cameras with a few MPixels resolution, do not have an on-board ISP [233], thereby requiring the ISP to be implemented off-chip, increasing the energy consumption of the total CV system. Although the ISP removal can facilitate model deployments in resource- constrained edge devices, one key challenge is that most large-scale datasets, that are used to train CV models, are ISP-processed. Since there is a large co-variance shiftbetweentherawandRGBimages(pleaseseeFig. 8.1whereweshowthe his- togramofthepixelintensitydistributionsofRGBandrawimages),modelstrained on ISP-processed RGB images and inferred on raw images, thereby removing the ISP, exhibit a significant drop in the accuracy. One recent work has leveraged trainable flow-based invertible neural networks [234] to convert raw to RGB im- ages and vice-versa using open-source ISP datasets. These networks have recently yieldedSOTAtestperformanceinphotographictasks,whichweproposetomodify to invert the ISP pipeline, and build the raw version of any large-scale ISP pro- cessed database for high-level vision applications, such as object detection. This raw dataset can then be used to train CV models that can be efficiently deployed on low-power edge devices without any of the ISP steps, including demosaicing. To further improve the performance of these ISP-less models, we propose a novel hardware-software co-design approach, where a form of demosaicing is applied on therawmosaicedimagesinsidethepixelarrayusinganalogsummationduringthe pixel read-out operation, i.e., without a dedicated ISP unit. Our models trained on this demosaiced version of the visual wake words (VWW) lead to a 8.2% in- crease in the test accuracy compared to standard training on RGB images and inference on raw images (to simulate the ISP removal and the in-pixel/in-sensor implementation). Even compared to standard RGB training and inference, our models yield 0.7% (1.6%) higher accuracy (mAP) on the VWW (COCO) dataset. Lastly,weproposeanovelapplicationoffew-shotlearningtoimprovetheaccuracy 137 Figure 8.1: Difference in frequency distributions of pixel intensities between mosaiced raw, demosaiced, and ISP-processed images. of real raw images captured directly by a camera (which has limited number of annotations) with our generated raw images constituting the base dataset. The key contributions of this chapter can be summarized as follows. • Inspired by the energy and bandwidth benefits obtained by in-sensor com- puting approaches and the removal of most ISP steps in a CV pipeline, we present and release a large-scale raw image database that can be used to train accurate CV models for low-power ISP-less edge deployments. This dataset is generated by reversing the entire ISP pipeline using the recently proposed flow-based invertible neural networks and custom mosaicing. We demonstrate the utility of this dataset to train ISP-less CV models with raw images. • To improve the accuracy obtained with raw images, we propose a low- overhead form of in-pixel demosaicing that can be implemented directly on thepixelarrayalongsideotherCVcomputationsenabledbyrecentparadigms of in-pixel/in-sensor computing approaches and that also reduces the data bandwidth. 138 • We present a thorough evaluation of our approach with both simulated (our released dataset) and real (captured by a real camera) raw images, for a diverse range of use-cases with different memory/compute budgets. • To improve the accuracy of real raw images, we propose a novel application of few-shot learning, with the simulated raw images having a large number of labelled classes constituting the base dataset. Figure8.2: (a)ProposedISP-lessCVsystem,(b)InvertibleNNtrainingondemosaiced rawimage,withoutanywhitebalanceorgammacorrection,(c)Generationofrawimages using the trained inverse network and custom mosaicing, and (d) Application of in- pixel demosaicing and training of the ISP-less CV models. Note the In-Pixel Demosaic implementation in the pixel array is illustrated in Fig. 8.3. 8.2 Related Works 8.2.1 ISP Reversal & Removal Since most ISP steps are irreversible, and depend on the camera manufacturer’s proprietary color profile [235], it is difficult to invert the ISP pipeline. To mitigate 139 this challenge, a few recent works [236–238] proposed learning-based methods, but they result in large losses and the recovered RAW images may be significantly different from the originals captured by the camera. To reduce this loss, a more recentwork[234]usedastackofk invertibleandbijectivefunctionsf =f 1 · f 2 · ..f k to invert the ISP pipeline. For a raw input x, the RGB output y and the inverted raw input x is computed as y =f 1 ⊙ f 2 ⊙ ..f k (x) and x=f − 1 k ⊙ f − 1 k− 1 ⊙ ..f − 1 1 (y). The bijective function f i is implemented through affine coupling layers [234]. In each affine coupling layer, given a D dimensional input m and d<D, the output n is n 1:d =m 1:d +r(m d+1:D ) (8.1) n d+1:D =m d+1:D ⊙ exp(s(m 1:d ))+t(m 1:d ) (8.2) where s and t represent scale and translation functions from R d to R D− d that are realizedbyneuralnetworks,⊙ representstheHadamardproduct, and r represents an arbitrary function from R D− d to R d . The inverse step is m d+1:D =(n d+1:D − t(n 1:d ))⊙ exp(− s(n 1:d )) (8.3) m 1:d =n 1:d − r(m d+1:D ) (8.4) The authors then utilize invertible 1× 1 convolution, proposed in [239], as the learnable permutation function to revert the channel order for the subsequent affine coupling layer. Recent works have also investigated the role of the ISP in image classification and the impact of its’ removal/trimming on accuracy for energy and bandwidth benefits. For example, [228] demonstrated that removal of the whole ISP during edge inference results in a∼ 8.6% loss in accuracy with MobileNets [240] on Ima- geNet [12], which can mostly be recovered by using just the tone-mapping stage. Another work [229] attempted to integrate the ISP and CV processing using tone mappingandfeature-awaredownscalingblocksthatreduceboththenumberofbits 140 perpixelandthenumberofpixelsperframe. Amorerecentwork[241]usedknowl- edge distillation on an ISP neural network model to align the logit predictions of an off-the-shelf pretrained model for raw images with that for ISP-processed RGB images. 8.2.2 Few-Shot Object Detection In recent years, few-shot object detection (FSOD) has gained significant traction as ML accuracy in low-data scenarios continues to improve. There are two main- stream training paradigms in FSOD, meta-learning and finetune-based methods. Meta-learning methods attempt to capture aggregated information from multi- ple annotated data-rich support datasets. Thus, when required to train on a dataset with novel classes and less data, the model can leverage the prior knowl- edgelearnedfromsupportdatasetstogeneralizetonewclasses. Forexample,[242] used a re-weighting module to adjust coefficients of the query image meta features by capturing global features of the support images to be suitable for novel ob- ject detection. Authors in [243] proposed a Predictor-head Remodeling Network (PRN) module to generate class-attentive vectors to provide aggregated features between support and query images for the meta-learner predictor head. Addition- ally, [244] introduced an attention-based region proposal network to match the candidate proposals with the support images and a multi-relation detector which canmeasurethesimilaritybetweenproposalboxesfromthequeryandthesupport objects. Comparedtometa-learningwhichrequiresacomplicatedtrainingprocess, the finetune-based methods have a simpler pipeline. For instance, [245] proposed the two-stage fine-tune based approach (TFA) which only finetunes the bounding box classification and regression parts on a class-balanced training set, but out- performsmanymeta-learningmethods. Moreover, tomitigatemisclassifyingnovel instances as confusing base classes, [246] introduced contrastive learning into the FSOD pipeline, that helps the learned target features represent high intra-class similarity and inter-class variability. 141 8.3 Inverting ISP Pipeline Similar to [234], we propose to generate the raw demosaiced images from ISP- processed RGB images using the affine coupling layers described in Section 8.2.1. However, [234] models the ISP pipeline on the demosaiced, white-balanced and gamma-corrected raw image, and hence, the invertible ISP pipeline does not gen- erate the raw image that are captured directly by a camera. The authors apply gamma correction on the RAW data (i.e. without storing on disk) to compress the dynamic range for faster convergence. Hence, for ISP-less in-sensor CV sys- tems, the naive application of the invertible ISP pipeline proposed in [234] will require performing these operations in the sensor. This is challenging due to the limited compute/memory footprint available in the pixel array and the periph- ery. In particular, traditional demosaicing involves matrix operations that involve interpolation (nearest neighbour, bi-linear, bi-cubic, etc.) techniques which scale with the input resolution. Moreover, white balancing involves a variable gain am- plificationforeachpixellocationwhichrequirescomplexcontrollogic, andgamma correction involves logarithmic computation which is challenging to process using analog logic in advanced high-density pixels. For these reasons, we propose to train an invertible network on the demosaiced images from the MIT-Adobe 5K dataset [247]. Despite our focus on classifica- tion/detection tasks, we propose to use this photographic dataset to train the invertible ISP because we do not have large-scale ground truth raw-RGB image pairs for those tasks. We train using demosaiced images because the input size of the invertible neural network must be equal to its output size. Once trained, we use this network to obtain the raw demosaiced images from the ISP-processed RGB images from the large-scale classification/detection datasets. We then invert the demosaicing i.e., perform the mosaicing operation by se- lectingtheappropriatepixelcolorcorrespondingtoeachlocation, asshowninFig. 8.2. For example, to generate the red pixel in a particular mosaiced RGGB patch, 142 weselectthepixelintensityoftheredchannelinthesamelocationasinthedemo- saiced image. Although this final mosaiced image is obtained after inverting the entire ISP pipeline, it might still be slightly different from the raw image captured by a camera. This is partially because we do not explicitly model the latent distri- bution of the different ISP steps to stabilize the training of the invertible network. We mitigate this concern using few-shot learning. Figure 8.3: Implementation of the proposed (a) demosacing and (b) demosaicing cou- pled with in-pixel convolution for ISP-less CV. 8.4 Proposed Demosaicing Technique Although training on raw images in the Bayer CFA format increases the test ac- curacy of ISP-less CV applications, it might lack the representation capacity that multiple colors spanning different spectral bands might provide for each pixel lo- cation. Hence, a natural question is can we increase this capacity without an additional ISP unit? Since demosaicing is the ISP technique that yields separate RGB channels from the raw CFA format, one intuitive idea is to implement demo- saicing directly in the pixel array, and then process the CV computations required for CNNs using in-pixel/in-sensor computing. However, as explained above, tradi- tional demosaicing approaches involve complex operations which are hard to map onthepixelarray, especiallywhenthepixelarrayneedstoprocesstheinitialCNN 143 layersinthein-pixelcomputingparadigms. Hence,weproposealow-overheadcus- tom in-pixel demosaicing approach that significantly increases the test accuracy for our benchmarks compared to inference on raw images. We propose to implement a simple but effective custom demosaicing operation insidetheanalogpixelarray. LetusconsiderademosaicedRGBimagewithshape X× Y× 3, to be processed for CV applications. Then, our custom demosaicing technique requires the input mosaiced raw image to have a shape 2X× 2Y, such that each 2× 2 RGGB patch produces the corresponding 3 channels for a single pixel, thereby yielding a 25% reduction in data dimensionality. Functionally, the custom demoisaicing copies the red and blue pixel intensities from the camera to the demosaiced RGB channel output, while the two green pixels from the RGGB patch of the camera pixel array are averaged to produce one effective value for green pixel intensity. While the summation is performed by analog computation inside the pixel array described below, the division is performed in the digital domain after the analog to digital converter (ADC) in the pixel periphery by a simple logical right shift operation. The proposed implementation of the pixel array to accomplish this custom demosaicing functionality is shown in Fig. 8.3(a). We propose to include two select lines for each row of the pixel array - the first set of select lines called ‘Row-Select’ are connected to the select transistors of the red and blue pixels, while the second set of select lines called ‘Green-Select’ are connected only to the green pixels. Essentially, the pixels in RGGB Bayer pattern are connected in an interleaved manner to the two select lines. Therefore, the read-out of the red and blue pixels are controlled by the ‘Row-Select’ lines, while the ‘Green-Select’ lines control the read-out of the green pixels. Now consider activating two rows of ‘Row-Select’ lines (Row-Select-1 and Row-Select-2) in the 2x2 pixel array of Fig. 8.3(a). Thiswouldresultinread-outoftheredandbluepixelson‘Column-Line-1’ and ‘Column-Line-2, respectively. The two green pixels would remain deactivated as the ‘Green-Select’ lines are kept at low voltage. In a subsequent cycle, the two ‘Row-Select’ lines are kept at low voltage and the two ‘Green-Select’ lines 144 are activated by pulling them to high voltage. Further, the two ‘Column-Lines’ are connected together by closing the ‘Column-Switch’, shown in Fig. 8.3(a). Consequently, the voltage on the now connected ‘Column-Lines’ represents the accumulated response of the two green pixels, which are fed to column-ADCs for analog to digital conversion. Note, the proposed scheme is similar to pixel binning approaches[248–251],exceptthatinthiscasebinningisselectivelyperformedonly for the two green pixels in each patch of Bayer RGGB pattern using interleaved connections to ‘Row-Select’ and ‘Green-Select’ lines. In summary, in two cycles, wherein two rows of ‘Row-Select’ and ‘Green-Select’ lines are activated in each cycle, the proposed scheme can generate demosaiced red, blue, and green pixels. Note,sinceweareabletoreadtworows(consistingofRGGBpixels)intwocycles, the proposed scheme does not incur any overhead in terms of read-out speed (or frame-rate) of the camera. In yet another approach, we propose to combine the custom demoisacing and the computations of the first-layer of CNN inside pixel array using the P 2 M (Processing-in-pixel-in-memory) paradigm proposed in [54], as shown in Fig. 8.3(b). Modifying the P 2 M pixel array of [54], Fig. 8.3(b) presents a novel pixel arraythatcancombinedemosaicingandconvolutioncomputationsusingmemory- embedded pixels. Essentially, the CNN weights associated with two green pixels in a single patch of Bayer RGGB pattern are kept the same. This is achieved by keeping the sizes of the weight transistors the same across the two green pixels. Further, these transistor weights are kept such that each set of weight transistors for a single pixel has half of the effective algorithmic weight value associated with the green channel in the input convolutional layer. This ensures that the resultant analog dot product obtained from the P 2 M scheme [54] involve effective averaging oftheintensitiesofthegreenpixelsandthenmultiplyingitwiththecorresponding weights associated with the convolutional layer. Whilethein-pixelconvolutiononthedemosaicedimagecanleadtosignificantly higherbandwidthreduction[54](quantifiedlaterinSection8.7.5),theanalognon- idealities involved in the multiply-and-accumulate operation and weight mismatch 145 in the green pixels can lead to large errors, require re-training the entire CNN network, and introduce manufacturing challenges, which might require non-trivial changes to the design pipeline of sensors. 8.5 Few-Shot Learning Compared to the abundance of RGB image datasets, it is difficult to obtain large- scale annotated raw images. For example, to the best of our knowledge, the only rawimagedatabaseforclassification/detectiontasks,PASCALRAW,containsonly 4,259annotatedimages,with3objectclasses,whicharenotenoughtotrainadeep CV model. Even with a pre-trained model on a large-scale RGB dataset, it might be difficult to fine-tune on this small-scale raw dataset (due to the co-variance shift) and yield satisfactory performance. As described in Section 8.2.2, recent works have proposed a plethora of few- shot learning approaches that achieve great performance on datasets with some novel classes, and a few images per class. Our problem is not exactly same as a typical few-shot learning setup, given we can find a large-scale annotated RGB image database, having the same classes as the raw dataset. For example, the Microsoft COCO dataset consists of 80 classes, which can cover objects from a range of applications such as autonomous driving, aerial imagery recognition, and can be used for fine-tuning on a raw image database with a subset of their classes. We propose to leverage TFA [245] (see Section 8.2.2) for this fine-tuning process, and to the best of our knowledge, this is the first application of few-shot learning in improving the accuracy/mAP of raw images. However, naivelyapplyingTFAwithCOCOasthebasedatasetcanonlybring limited improvement in accuracy, due to the co-variance shift between the RGB and raw images. Note that in typical few-shot learning setups, such as TFA, the images in the base and novel datasets are assumed to have similar intensity distributions [245]. Hence, we propose a novel application of few-shot learning which leverages our simulated raw COCO dataset as the base class to increase 146 the test mAP on the real raw dataset. We choose a class-balanced subset of the real raw dataset as the samples with ‘novel class’ to perform TFA on the model pretrained on our raw COCO dataset to further improve the mAP. 8.6 Experimental Setup 8.6.1 Implementation Details WeevaluateourproposedmethodonthreeCNNbackbones/frameworkswithvary- ing complexities and use-cases as described below. For object detection experi- ments, we use the [157] framework, while for few-shot learning, we use the mm- fewshot [252] and FsDet [245] framework. Our training details are provided in the supplementary materials. MobileNetV2 [240]: A lightweight depthwise convolution neural network that has gainedsignificanttractionforbeingdeployedonresource-constrainededgedevices, such as mobile devices. In this work, we use a lower complexity version of Mo- bileNetV2, namely MobileNetV2-0.35x [213], which shrinks the output channel count by 0.35× to satisy the compute budget of 60M floating point operations (FLOPs) representing standard micro-controllers, where ISP-less CV may be the most relevant. Faster R-CNN [253]: A two-stage object detection framework that consists of a feature extraction, a region proposal, and a RoI pooling module. For our exper- iments, we use ResNet101 as the backbone network for feature extraction, since MobileNetV2 significantly degrades the test mAP compared to SOTA. YOLOv3 [254]: YOLOv3 (You Only Look Once, Version 3) is a real-time object detection framework that identifies specific objects in videos or images. We use MobileNetV2 as the backbone network for feature extraction in YOLOv3. 147 8.6.2 Dataset Details We evaluate our proposed approaches on the simulated raw versions of the Visual Wake Words (VWW) and COCO datasets, and a real raw dataset captured by a real camera introduced in [255]. The details of the datasets are below. VWW [218]: The Visual Wake Words (VWW) dataset consists of high resolution images that include visual cues to “wake-up” AI-powered resource-constrained home assistant devices that require real-time inference. The goal of the VWW challenge is to detect the presence of a human in the frame (a binary classification task with 2 labels) with very little resources-close to 250KB peak RAM usage and model size, which is only satisfied by MobileNetV2-0.35x, and hence, used in our experiments. Microsoft COCO: To evaluate on the multi-object detection task, we use the pop- ular Microsoft COCO dataset [256]. Specifically, we use an image resolution of 1333× 800 for the Faster-RCNN framework, and 416× 416 for the YoloV3 frame- work [254], the same as used in [254]. We use the 80 available classes used for our experiments. We evaluate the performance of each method using mAP averaged for IoU ∈ {0.5,0.75,[0.5 : 0.05 : 0.95]}, denoted as mAP@0.5, mAP@0.75 and mAP@[0.5, 0.95], respectively. Note that we also report the individual mAPs for small (area <32 2 pixels), medium (area between 32 2 and 96 2 pixels), and large (area >96 2 pixels) objects. PASCALRAW: This RAW image database was developed to simulate the effect of algorithmichardware implementationssuchasembeddedfeatureextractionat the image sensor or readout level on end-to-end object detection performance. The annotations of this dataset were made in accordance with the original PASCAL VOC guidelines [255]. For the few-shot learning experiments, we choose 29 images containing the class ‘bicycle’, 25 images containing the class ‘car’ and 21 images containing the class ‘person’ to construct a balanced training set where each class has 30 annotated objects (i.e., 30-shot), and use the remaining 4178 images as the test dataset. 148 8.7 Experimental Results 8.7.1 VWW Results For VWW, we compare the accuracy of the tinyML-based MobileNetV2-0.35x model with our proposed demosaicing and in-pixel computing technique against inference on mosaiced raw and RGB images in Table 9.3. We also compare our approachwithtraditionaldemosaicing(opencv libraryinPython),whitebalancing (rawpy package in Python), and gamma correction. Note that, as shown in Table 9.3, using identically distributed images during training yields the best accuracy during inference. Table 9.3 further illustrates that using the off-the-shelf model pre-trained on ISP-processed images yields an accuracy of 81.97% when deployed on an ISP-less CV system with our in-pixel demosaicing, which is 7.32% lower compared to the ISP-processed inference. Note that we cannot avoid the demosaicing step, because the pre-trained model is trained with 3 channel input images. With the generated mosaiced image database from our invertible pipeline, the accuracy gap (training and testing both on the mosaiced image) reduces to 2.82%. Additionally, with our in-pixel demosaicing on this mosaiced image, we yield an accuracy of 89.92%, which is even 0.63% higher than the RGB test accuracy. Appending the first layer convolution inside the pixel, coupled with the demosacing results in a little lower accuracy of 89.07%. 8.7.2 COCO raw Results The detailed results on COCO raw dataset are summarized in Table 8.2. Our experiments indicate that direct inference on the COCO demosaiced raw dataset using the model pre-trained on COCO ISP-processed RGB dataset yields an mAP of 33.8%, which is 7.2% lower compared to ISP-processed inference. Note that 149 Table8.1: EvaluationofourapproachonISP-lessCVsystemswithMobileNetV2-0.35x on VWW dataset. Demosaiced 1 denotes traditional demosaicing, while demosaiced 2 denotes our in-pixel demosaicing. WB, GC, and IPC denotes white balance, gamma correction, and in-pixel computing. Also, note that models trained on mosaiced images can only be tested with mosaiced images. Method Test Acc. (%) Training Inference Mosaiced demosaiced 2 IPC Mosaiced 87.47 - - demosaiced 1 - 88.84 88.04 demosaiced 2 - 89.92 89.07 demosaiced 1 +WB - 86.47 86.23 demosaiced 1 +WB+GC - 82.70 81.45 ISP-processed - 81.97 81.43 the mAP of small objects reduces significantly by nearly 35%. However, with finetuning on our COCO demosaiced raw dataset, the mAP increases to 42 .8%. Unlike in VWW, where models can be accurately trained from scratch, training and testing on the COCO mosaiced raw image leads to a reduced mAP of 29.4%. This reduction might be because the pre-trained model (where the backbone is also pre-trained on ImageNet) cannot be leveraged because of the difference in the number of input channels. Lastly, applying our proposed in-pixel demosaicing on the mosaiced raw dataset yields an mAP of 37.0%, which is 5.0% lower than ISP- processedinference, unlikeVWW.Thismightbebecauseourdemosaicingreduces the spatial resolution of the image, which might be detrimental for the complex object detection task. Interestingly, our approach is effective in detecting medium sized objects, and achieves the highest mAP of 48.6%. 150 Table 8.2: mAP on different versions of the COCO raw dataset to emulate ISP-less CV systems using a Faster R-CNN framework with ResNet101 backbone. mean average precision model 0.5:0.95 0.5 0.75 S M L baseline 33.8 50.5 37.0 16.6 36.6 46.7 demosaiced 1 42.8 64.1 47.1 25.6 46.9 55.0 mosaiced 29.4 45.7 31.8 12.7 32.1 42.9 demosaiced 2 37.8 57.7 39.8 20.2 48.6 53.2 1 ‘baseline’indicatestestingonourproposedCOCOrawdat- set with model pretrained on ISP-processed COCO dataset 2 ‘demosaiced 1 ’ indicates training and testing on our pro- posed COCO raw dataset 3 ‘mosaiced’ indicates training and testing on mosaiced im- ages obtained from the COCO raw dataset from ourinvert- ible ISP 4 ‘demosaiced 2 ’ means training and testing on our in-pixel demosaiced images 8.7.3 PASCALRAW Results YOLOv3 Table 8.3 shows the performance of six different methods with YOLOv3 on the PASCALRAW dataset. Direct inference on this dataset with models pre-trained on ISP-processed COCO dataset yields only 2.7% mAP due to the significant co- variance shift between the two datasets. Using the ISP-processed base dataset, we compare two different few-shot learning approaches, one where we use 30 shots for both the base and novel classes, and the other where we use 30 shots only for the novel classes. We observe the latter leads to 1.0% higher mAP compared to the former, which might be because the former may underfit to the three target classes due to its improved generalization. Note, due to the difference in the dataset distributions, few-shot learning fails to significantly increase the mAP, as observed from the modest mAP improvement from 2.7% to 6.2%. On the other hand, afterfinetuning on ourcustomdemosaicedCOCOrawdataset(withoutany few-shot learning), the mAP increases by more than 4× to 13.4%. This strongly 151 Table 8.3: Comparison of our proposed approach on PASCALRAW dataset. Framework Method mAP [0.5,0.95] 0.5 0.75 small medium large Yolov3 ISP-processed 2.7 8.2 1.2 0.2 2.2 4.4 ISP-processed+few-shot ∗ 5.2 15.4 2.4 0.6 5.5 7.9 ISP-processed+few-shot ∗∗ 6.2 17.0 3.3 0.2 3.9 11.2 demosaiced raw 13.4 38.5 5.4 0.9 12.2 22.9 demosaiced raw+few-shot ∗ 16.9 40.6 10.9 0.5 17.8 26.3 demosaiced raw+few-shot ∗∗ 20.8 47.4 14.5 0.9 17.3 30.4 Faster RCNN ISP-processed 1.2 4.2 0.2 0.0 1.3 3.5 ISP-processed+few-shot ∗ 5.9 14.8 3.3 0.0 3.8 8.6 ISP-processed+few-shot ∗∗ 9.5 26.0 4.2 0.0 6.6 15.0 demosaiced raw 9.3 29.9 2.2 1.7 10.5 19.5 demosaiced raw+few-shot ∗ 27.4 52.8 25.7 6.9 27.1 37.3 demosaiced raw+few-shot ∗∗ 29.8 58.1 28.0 8.0 28.1 40.6 * The experiments apply few-shot learning with 30 shots of both base classes and novel classes. ** The experiments apply few-shot learning with 30 shots of only the novel classes. demonstrates the effectiveness of our large-scale raw database. Lastly, applying few-shot learning with this base raw dataset further increases the mAP to 20.8%. Faster R-CNN We perform a series of similar experiments with Faster R-CNN model with ResNet101 backbone on the PASCALRAW dataset. As we can see in Table 8.3, the results are consistent with that from the YOLOv3 model, except that there is no mAP increase with fine-tuning on the COCO raw dataset compared to apply- ing few-shot learning with the ISP-processed base dataset. This might be because our demosaicing approach that incurs 4× spatial down-sampling might not be that competitive compared to the faster R-CNN framework with ultra-high in- put resolution. Applying few-shot learning with 30 shots of only the novel classes 152 Figure 8.4: Comparison of the (a) accuracy and (b) mAP of our proposed demo- saicingmethodwithdifferentISPpipelinesonCOCOdatasetwithFaster-RCNNframe- workwithResNet101backboneandVWWdatasetwithMobileNetV2-0.35xrespectively, where DM denotes our proposed demosaicing technique, WB and GC denote white bal- ancing and gamma correction respectively. The energy consumptions of our approaches are compared with the normal pixel read-out in (c) and (d) on VWW and COCO re- spectively, where IPC denotes in-pixel computing. Note, for (d), the energy unit is µJ for ‘sensor’ & ‘data comm.’, and 100µJ for ‘CNN’ & ‘total’. on our custom demosaiced COCO raw dataset yields an mAP of 29.8%, which is 28.6%highercomparedtodirectlyusingthemodelpretrainedontheISP-processed COCO dataset. 8.7.4 Comparison with Prior Works We compare the test accuracy and mAP obtained by our ISP-less CV models with existing similar works on the VWW and COCO dataset respectively in Fig. 8.4(a-b). As we can see, we yield similar performance on average compared to using the invertible ISP pipeline proposed in [234], while providing bandwidth and energy reduction quantified in Section 8.7.5. Even compared to testing on ISP-processed RGB images which require the entire ISP pipeline, we obtain 0.63% (1.6%)increaseinaccuracy(mAP)ontheVWW(COCO)dataset. Itisdifficultto directly compare our approach with other works [228,229], as they do not release the ISP model, and evaluate the impact of the removal of different ISP stages on in-the-wild datasets, such as ImageNet [12] and KITTI [257], which may not be a relevant use-case of ISP-less low-power edge deployment. 153 8.7.5 Bandwidth & Energy Benefits Removing the entire ISP pipeline, and applying the proposed in-pixel demosaicing operation directly on the raw images can lead to significant energy and bandwidth savings, thereby aiding the deployment of CNN models on ultra low-power edge devices. The complete image captured by the sensors is transmitted to a down- streamSoCprocessingtheISPandCVunitstypicallythroughenergy-hungryMIPI interfaces, which cost significant bandwidth [258]. As explained in Section 8.4, the demosaicing operation leads to a dimensionality reduction of 4 3 , which implies a 25% reduction in bandwidth. Quantizing the demosaiced outputs to 8-bits using custom ADCs (inputs to modern CNNs have unsigned 8-bit representation) leads to a ( 12 8 ) or 50% reduction in bandwidth, assuming the raw image has a bit-depth of 12 [214]. Lastly, appending the first convolution layer inside the sensor yields a 3× increase in bandwidth for MobileNetV2-0.35x. This is convolutional layer has a stride of 2, which implies a 4× dimensionality reduction, while there is a ( 8 3 ) dimensionality increase due to the 3 channels in the input demosaiced image and 8 output channels in the first convolutional layer. In summary, the total bandwidth/data transmission energy reduction due to our proposed demosaicing operation is 75%, while for the in-pixel computing approach (on the proposed demosaiced image as illustrated in Section 8.4) is 12× . Note that this energy benefit is in addition to the energy savings obtained by removing the ISP operations in an SoC, and transferring the ISP output to a CV processing unit. It is difficult to accurately quantify this saving as it depends on the underlying hardware implementation and dataflow, as well as the propri- etary implementation of ISP. That said, we compare the sensor (pixel+ADC), data communication, and the CNN energy consumption of our demosaicing and in-pixel computing approaches with normal pixel read-out in Fig. 8.4(c-d). While Fig. 8.4(c) represents the tinyML use-case on VWW using MobileNetV2-0.35x, Fig. 8.4(d) represents the more difficult use-case on COCO using Yolov3. We compute the pixel energies using our in-house circuit simulation framework, while 154 the ADC, data communication, and CNN energies are obtained from [54]. While our demosaicing approach incurs a sensor energy overhead of∼ 5% on average, the proposed in-pixel implementation reduces (increases) the sensor energy by 33% (23%) on VWW (COCO) with MobileNetV2-0.35x (YoloV3). The energy increase is due to the increased number of convolutional output channels (first layer) in the MobileNet backbone of YoloV3. 8.8 Discussions In this work, we propose an ISP-less computer vision paradigm to enable the deploymentofCNNmodelsonlow-poweredgedevicesthatinvolveprocessingclose tothesensornodeswithlimitedcompute/memoryfootprint. Ourproposalhastwo significant benefits: 1) We release a large-scale RAW image database that can be used to train and deploy CNNs for a wide range of vision tasks (including those relatedtophotography)and2)Ourhardware-softwareco-designapproachleadsto significantbandwidthsavingscomparedtotraditionalCVpipelines. Tothebestof our knowledge, this is the first work to address the widely overlooked ISP pipeline in near-sensor and in-sensor processing paradigms while also proposing novel in- pixelschemesforcustomdemosaicing,coupledwithconvolutioncomputation. Our proposed approach increases the test accuracy (mAP) of a tinyML (generic object detection) application by 7.32% (7.2%) compared to direct deployment of the off- the-shelf pre-trained models on ISP-less CV systems. Our approach, coupled with few-shot learning, has been shown to be effective in detecting real raw objects captured directly by a camera from the PASCALRAW dataset. 155 Chapter 9 Self-Attentive Pooling for Aggressive Compression in P 2 M This chapter first provides the introduction and motivation behind the develop- mentofself-attentivepoolingtechniquesthatenableaggressivestridinginthefirst few layers of a CNN in Section 9.1. Preliminaries on pooling techniques, model compression, and attention-based models are provided in Section 9.2. Section 9.3 provides a background of the multi-head self-attention module. Section 9.4 presents our proposed training framework, and Section 9.5 illustrates how our self- attentivepoolingtechniquecanbeappliedindifferentCNNbackbones. Section9.6 presents our experimental results of the accuracy, and compute and memory effi- ciencybenefitscomparedtoexistingpoolingapproaches. Finally, somediscussions and conclusions are provided in Section 9.7. 9.1 Introduction & Motivation Intherecentpast,CNNarchitectureshaveshownimpressivestrideinawiderange of complex vision tasks, such as object classification [8] and semantic segmenta- tion [102]. With the ever-increasing resolution of images captured by modern camera sensors, large activation maps in the initial CNN layers are consuming a large amount of on-chip memory, hindering the deployment of the CNN models on resource-constrained edge devices [54]. Moreover, these large activation maps in- crease the inference latency, which impedes real-time use cases [54]. Pooling is one of the most popular techniques that can reduce the resolution of these activation mapsandaggregateeffectivefeatures. Historically,poolinglayers(eitherasstrided convolution layers or standalone average/max pooling layers) have been used in 156 almost all the SOTA CNN backbones to reduce the spatial size of the activation maps, and thereby decrease the memory footprint of models [8,259,260]. Existing pooling techniques aggregate features mainly from the locality per- spective. For example, LIP [261] utilizes a convolution layer to extract locally- important aggregated features. For relatively simple objects with less diverse feature distribution, it might be sufficient to express them by aggregating local information. But for more complex objects, downsampling feature maps with only local information might be difficult because different local regions of an object might be correlated with each other. For example, an animal’s legs can be in dif- ferent local regions of an image, which might provide useful information to classify the animal as a biped or a tetrapod. Additionally, the features extracted from an object and its background might also be related. For example, with the sea in background, it is highly unlikely that we can find the class ‘car’ in the foreground, and more likely that we can find classes like ‘boat’ or ‘ship’ in the foreground. To reduce the significant on-chip memory consumed by the initial activation maps, large kernel sizes and strides are often required in the pooling layers to fit the models in resource-constrained devices. This might lead to loss of feature information when only leveraging locality for aggregation. Moreover, recently pro- posed in-sensor [50,225] and in-pixel [53–55] computing approaches can benefit from aggressive bandwidth reduction in the initial CNN layers via down sampling. We hypothesize that the accuracy loss typically associated with aggressive down- sampling can be minimized by considering both local and non-local information during down-sampling. To explore this hypothesis, we divide the activation map into patches and pro- pose a novel non-local self-attentive pooling method to aggregate features and capture long-range dependencies across different patches. The proposed method consists of a patch embedding layer, a multi-head self-attention layer, a spatial- channel restoration layer, followed by a sigmoid and an exponential activation function. The patch embedding layer encodes each patch into a one-pixel token that consists of multiple channels. The multi-head self-attention layer models 157 the long-range dependencies between different patch tokens. The spatial-channel restoration layer helps in decoding and restoring the patch tokens to non-local self-attention maps. The sigmoid and exponential activation functions rectify and amplify the non-local self-attention maps, respectively. Finally, the pooled acti- vation maps compute the patch-wise average of the element-wise multiplication of the input activation maps and the non-local self-attention maps. Our method surpasses the test accuracy (mAP) of all existing pooling tech- niques in CNNs for a wide range of on-device object recogniton (detection) tasks, particularlywhentheinitialactivationmapsneedtobesignificantlydown-sampled formemory-efficiency. Ourmethodcanalsobecoupledwithstructuredmodelcom- pression techniques, such as channel pruning, that can further reduce the compute and memory footprint of our models. In summary, the key highlights of this chapter can be summarized as • Inspiredbythepotentialbenefitsofnon-localfeatureaggregation,wepropose theuseofmulti-headself-attentiontoaggressivelydownsampletheactivation maps in the initial CNN layers that consume a significant amount of on-chip memory. • We propose the use of spatial channel restoration, weighted averaging, and custom activation functions in our self-attentive pooling approach. Addi- tionally, we jointly optimize our approach with channel pruning to further reduce the memory and compute footprint of our models. • We demonstrate the memory-compute-accuracy (mAP) trade-off benefits of our proposed approach through extensive experiments with different on- device CNN architectures on both object recogniton and detection tasks, and comparisons with existing pooling and memory-reduction approaches. Moreover, we provide visualization maps obtained by our non-local pooling technique which provides deeper insights on the efficacy of our approach. 158 Locality based Pooling I input activation ℱ⋅ ℱI pooling weight Weighted Pooling downsampled activation Self-attentive Pooling I input activation π⋅ pooling weight πI activation patches … … … patch tokens downsampled activation Weighted Pooling Figure 9.1: Illustration of locality based pooling and non-local self-attentive pooling. The pooling weight has the same shape with the input activation I, of which only a local region is displayed in this figure. F(·) denotes the locality based pooling and π (·) denotes the proposed non-local self-attentive pooling. For the locality based pooling, each pooling weight has limited sensitive field as shown in the red box. For the proposed non-local self-attentive pooling, the input activation is divided to several patches and encoded into a series of patch tokens. Based on these patch tokens, the pooling weights have global view, which makes it superior for capturing long-range dependencies and aggregating features. 9.2 Related Work 9.2.1 Pooling Techniques Most popular CNN backbones consist of pooling layers for feature aggregation. For example, VGG [7], Inception [259] and DenseNet [260] use either average/max pooling layers, while ResNet [8], MobileNet [240], and their variants use convolu- tionswithstridegreaterthan1astrainablepoolinglayersinahierarchicalfashion atregularlocationsforfeaturedown-sampling. However, thesenaivepoolingtech- niques might not be able to extract useful and relevant features, particularly when the pooling stride needs to be large for significant down-sampling. This has re- sultedinaplethoraofnovelpoolinglayersthathavebeenintroducedintherecent past. In particular, mixed [262] and hybrid pooling [263] use a combination of av- erage and max pooling, which can be learnt during training. L p pooling [264] extracts features from the local window using the L p norm, where the parameter p can be learnt during training. Another work proposed detail-preserving pool- ing (DPP) [265] where the authors argued that there are fine-grained details in an activation map that should be preserved while the redundant features can be 159 discarded. However, thedetailscoreisanarbitraryfunctionofthestatisticsofthe pixel values in the receptive field that may not be optimal. A more recent work introduced local importance pooling (LIP) [261] that uses a trainable convolution filter that captures the local importance of different receptive fields and is used to scale the activations before pooling. Gaussian based pooling [266] formulates the pooling operator as a probabilistic model to flexibly represent the activation maps. RNNPool[213]usesrecurrentneuralnetworks(RNNs)toaggregatefeatures oflarge1Dreceptivefieldsacrossdifferentdimensions. Inordertoextractcontext- aware rich features for fine-grained visual recognition, another recent work [267] called CAP proposed a novel attentive pooling that correlates between different regions of the convolutional feature map to help discriminate between subcate- gories and improve accuracy. In particular, CAP is applied late in the network (after all convolutional layers) and is not intended to reduce the model’s memory footprint, in contrast to our work which applies pooling to down-sample the large activation maps early in the network. Interestingly, CAP transforms each feature using a novel form of attention (involving only query and key) rather than the traditional self-attention module adopted in this work. Lastly, while CAP uses bi-linear pooling, global average pooling, and LSTM, our approach uses a patch embedding, spatial channel-restoration, and weighted pooling. 9.2.2 Model Compression Pruningisoneofthewell-knownformsofmodelcompression[268,269]thatcanef- fectively reduce the DNN inference costs [268]. A recent surge in pruning methods has opened various methods for pruning subnetworks, including iterative magni- tude pruning (IMP) [270], reinforcement learning driven methods [271], additional optimization based methods [272]. However, these methods require additional training iterations and thus demand significantly more training compute cost. In this work, we adopt a more recent method of model pruning, namely sparse learning [273], that can effectively yield a pruned subnetwork while training from 160 scratch. In particular, as this method always updates a sparse subnetwork to non-zero and ensures meeting the target pruning ratio, we can safely avoid the fine tuning stage yet obtain good accuracy. Readers interested in pruning and sparse learning can refer to [274] for more details. Recently, neural architecture search [275] also enabled significant model compression, particularly for memory- limited devices. A recent work [276] proposed patch-based inference and network redistribution to shift the receptive field to later stages to reduce the memory overhead. 9.2.3 Low-Power Attention-based Models There are a few self-attention-based transformer models in the literature that aim to reduce the compute/memory footprint for edge deployments. MobileVit [277] proposed a light-weight and general-purpose vision transformer, combining the strengths of CNNs and ViTs. LVT [278] proposed two enhanced self-attention mechanisms for low- and high-level features to improve the models performance on mobile devices. MobileFormer [279] parallelized MobileNet and Transformer withatwo-waybridgeforinformationsharing, whichachievedSOTAperformance in accuracy-latecy trade-off on ImageNet. For other vision tasks, such as se- mantic segmentation and cloud point downsampling, recent works have proposed transformer-based models for mobile devices. For example, LighTN [280] pro- posed a single-head self-correlation module to aggregate global contextual features and a down sampling loss function to guide training for cloud point recognition. TopFormer [281] utilized tokens pyramid from various scales as input to generate scale-aware semantic features for semantic segmentation. 9.3 Background Inthissection, weexplainthemulti-headself-attention[183]modulethatwasfirst introduced by the ViT architecture [282] in computer vision. 161 In ViT, the input image I ∈ R H× W× C is reshaped into a sequence of non- overlapping patches I p ∈ R ( H· W P 2 )× (P 2 ·C) , where (H × W) is the size of the input RGB image, and C is the number of channels, and P 2 is the number of pixels in a patch. The flattened 2D image patches are then fed into the multi-head self- attention module. Specifically, the patch sequence I p are divided into m heads I p = {I 1 p ,I 2 p ,...,I m p } ∈ R N× Cp m , where N = ( H·W P 2 ) is the number of patches and C p = P 2 · C is the number of channels in I p . These tokens are fed into the multi- head self-attention module MSA(·): I a =LN(MSA(I p ))+I p , (9.1) where LN(·) is the layer normalization [283,284]. In the j-th head, the token series I j p ∈R N× Cp m is first projected onto L(I j p ) ∈ R N× d k by a linear layer. Then three weight matrices {W q ,W k ,W v } ∈ R D× d k are used to obtain the query, key, and value tokens as Q j = W q L(I j p ),K j = W k L(I j p ),V j =W v L(I j p ), respectively. D is the hidden dimension and d k =D/m. The output I j a ∈R N× D of the self-attention layer is given by: I j a =softmax( Q j K j T √ d k )V j . (9.2) Finally, the results of the m heads are concatenated and back projected onto the original space: I a =concat(I 1 a ,I 2 a ,...,I m a )W O , (9.3) where W O ∈R Cp× D is the projection weight and the final output I a ∈R N× Cp . 9.4 Proposed Method The weights of local pooling approaches are associated with only a local region of the input feature maps as shown in Fig. 9.1. These pooling approaches are limited by the locality of the convolutional layer, and need a large number of layers to 162 acquire a large sensitive field. To mitigate this issue, we can intuitively encode the globalandnon-localinformationintothepoolingweights, asshowninFig.9.1. To realize this intuition, we propose a form of self-attentive pooling that is based on a multi-head self-attention mechanism which captures the non-local information as self-attentionmapsthatperformfeaturedown-sampling. Then,wejointlyoptimize the proposed pooling method with channel pruning to further reduce the memory footprint of the entire CNN models. Input Activation Output Activation Patch Embedding Weighted Pooling Multi-Head Self-Attention Spatial-Channel Restoration Interpolation (up-sampling) Conv 1 ×1 Batch Norm & Sigmoid Strided Conv Batch Norm & ReLU Sequentialization exp Positional Encoding Avg Pooling ⋅ Figure 9.2: Architecture of the non-local self-attentive pooling. 9.4.1 Non-Local Self-Attentive Pooling The overall structure of the proposed method is shown in Fig. 9.2. It consists of four main modules: patch embedding, multi-head self-attention, spatial-channel restoration, and weighted pooling. 1)Patchembeddingisusedtocompressspatial-channelinformation. Weuse a strided convolution layer to encode and compact local information for different patches along the spatial and channel dimensions of the input. More precisely, the input to the embedding is a feature map denoted as x∈R h× w× cx with resolution 163 (h× w) and c x input channels. The output of the embedding is a token series x p ∈ R ( h· w ϵ 2 p )× (ϵ r·cx) , where ϵ p is the patch size and ϵ r sets the number of output channels as ϵ r · c x . The patch embedding consists of a strided convolution layer with kernel size and stride both equal to ϵ p followed by a batch norm layer and a ReLU function [285]. For each patch indexed by [n i ,n j ], the patch embedding layer output can be formulated as: x p [n i ,n j ]=ϕ relu ϵ p X i=0 ϵ p X j=0 w c i,j · x (n i ·ϵ p+i,n j ·ϵ p+j) +b c (9.4) where w c ,b c are the weight and bias of the convolution kernel, respectively, and ϕ relu denotes the ReLU activation function. After patch embedding, a learnable positional encoding [282] is added to the token series x p to mitigate the loss of positional information caused by sequentialization. 2) Multi-head self-attention is used to model the long-range dependencies between different patch tokens. While the input patch token series x p is fed into themodule,theoutputx attn isaself-attentivetokensequencewiththesameshape as x p . 3)Spatial-channelrestorationdecodesspatialandchannelinformationfrom the self-attentive token sequence x attn . The token sequence x attn ∈R ( h· w ϵ 2 p )× (ϵ r·cx) is firstreshapedto R h ϵ p × w ϵ p × (ϵ r·C) ,andthenexpandedtotheoriginalspatialresolution (h,w) via bilinear interpolation. A subsequent convolutional layer with 1× 1 kernel size projects the output to the same number of channels c x as the input tensor x. A batch norm layer normalizes the response of the output attention map x r ∈R h× w× cx . A sigmoid function is then used to rectify the output range of x r to [0,1], followed by an exponential function to amplify the self-attentive response. 4) Weighted pooling is used to generate the down-sampled output feature map from the output of the spatial-channel restoration block, denoted as π (x) in Fig. 9.2. In particular, assuming a kernel and stride size of (s× s) in our pooling 164 method, and considering a local region in x from (p,q) to (p+s,q+s), the pooled output corresponding to this region can be estimated as O = P i=p+s i=p P j=q+s j=q π i,j (x)x i,j P i=p+s i=p P j=q+s j=q π i,j (x) (9.5) where π i,j (x) denotes the value of π (x) at the index (i,j). Similarly, the whole output activation map can be estimated from each local region separated with a stride of s. 9.4.2 Optimizing with Channel Pruning To further reduce the activation map dimension we leverage the popular channel pruning[273]method. Inparticular,channelpruningensuresallthevaluesinsome oftheconvolutionalfilterstobezero. Thisinturnmakestheassociatedactivation map channels redundant. Let us assume a layer l with corresponding 4D weight tensorθθθ l ∈R M× N× h× w . Here,handw aretheheightandwidthof2Dkernelofthe tensor, with M and N representing the number of filters and channels per filter, respectively. To perform channel pruning of the layer weights, we first convert the weighttensorθθθ l toa2Dweightmatrix,withM andN× h× w beingthenumberof rowsandcolumns,respectively. WethenpartitionthismatrixintoN sub-matrices of M rows and h× w columns, one for each channel. To rank the importance of thechannels, forachannelc, wethencomputetheFrobeniusnorm(F-norm)ofits associated sub-matrix, meaning effectively compute O c l = |θθθ :,c,:,: l | 2 F . Based on the fraction of non-zero weights that need to be rewired during an epoch i, denoted by the pruning rate p i , we compute the number of channels that must be pruned from each layer, c p i l , and prune the c p i l channels with the lowest F-norms. We then leverage the normalized momentum contributed by a layer’s non-zero channels to compute its layer importance that are then used to measure the number of zero-F- norm channels r i l ≥ 0 that should be re-grown for each layer l. Note that we first pre-train CNN models with our self-attentive pooling, and then jointly fine-tune our pooled models with this channel pruning technique. While the pooling layers 165 Inner Stage Pooling Pool Block 1 Block 2 Block 3 Block … Stage 11 Conv 33 Conv 11 Conv relu relu Pool - stride Residual Block Outer Stage Pooling Block 1 Block 2 Block … Stage Pool - stride Figure 9.3: Illustration for two ways of using pooling methods. are applied to all down-sampling layers, the channel pruning is only applied on the initial activation maps (only in the first stage in CNN backbones illustrated in Fig. 3) to maximize its’ impact of reducing the memory footprint of the models. 9.5 Self-Attentive Pooling in CNN Backbones The proposed pooling method can be used in any backbone networks, such as VGG [7], MobileNet [240] and ResNet [8]. Generally, a backbone network can be roughly divided into several stages and, the down-sampling layer (either as a strided convolution or max/average pooling), if present in a stage, is only applied at the first block. Specifically, there are two ways to replace this down-sampling layer with our (or any other SOTA) pooling method in the backbone network, i.e., outer-stage pooling and inner-stage pooling, as shown in Fig. 9.3. Outer stage pooling means that the activation is down-sampled by the pooling layer after each stage, which helps to reduce the size of the final output activation map in each stage and customize the pooling layer to learn the stage information. Inner stage pooling means the activation is down-sampled after the first block of each stage, which helps to reduce the initial activation map. We optimize the use of these pooling methods for each backbone evaluated, as specified in Section 6.1. 166 Table 9.1: Hyper parameter settings of different pooling techniques Methods Parameter Settings Strided Conv. kernel size: 3× 3 LIP kernel size: 1× 1 Ours ϵ p ∈{1,2,4,8}, ϵ r ∈{0.25,1},m:2 Channel Pruning 2× 9.6 Experiments 9.6.1 Experimental Setup The proposed pooling method is compared to several pooling methods, such as strided convolution, LIP, GaussianPool, and RNNPool. All these methods are widely used in deep learning and, to the best of our knowledge, yield SOTA performance. Our proposed method is implemented in PyTorch with the hyper- parameter settings, along with for those we compare with, listed in Table 9.1. Specifically, we evaluate the pooling approaches on two compute- and memory- efficient backbone networks. MobileNetv2 and ResNet18. For both, we keep the same pooling settings except the first pooling layer, where we employ ag- gressive striding for memory reduction. For example, in MobileNetV2, we use strides (s1,2,2,2,1,2,1), where s1∈{1,2,4}. More details are in supplementary materials. To evaluate the performance of the pooling methods on multi-object feature aggregation,weusetwoobjectdetectionframeworks,namelySSD[286]andFaster R-CNN [253]. To holistically evaluate the pooling methods, we use three image recognition datasets, namely STL-10, VWW, and ImageNet, which have varying complexities and use-cases. Their details are in supplementary materials. To 167 evaluate on the multi-object detection task, we use the popular Microsoft COCO dataset [256]. Specifically, we use an image resolution of 300 × 300 for the SSD framework, the same as used in [286], 608× 608 for the YoloV3 framework, the same as used in [254], and 1333× 800 for the Faster RCNN framework. Eight classes related to autonomous driving which includes {’person’, ’bicycle’, ’car’, ’motorcycle’, ’bus’, ’train’, ’truck’, ’trafficlight’ }areusedforourexperiments. We evaluate the performance of each pooling method using mAP averaged for IoU ∈{0.5,0.75,[0.5 : 0.05 : 0.95]}, denoted as mAP@0.5, mAP@0.75 and mAP@[0.5, 0.95], respectively. We also report the individual mAPs for small (area less than 32 2 pixels), medium (area between 32 2 and 96 2 pixels), and large (area more than 96 2 pixels) objects. 9.6.2 Accuracy & mAP Analysis The experimental results on the image recognition benchmarks are illustrated in Table 9.2, 9.3, and 9.5, where each pooling method is applied on the different backbone networks described in Section 6.1.2. Note that the resulting network is names as ‘pooling method’s name’–‘backbone network’s name’. For example, ‘Strided Conv.–MobileNetV2’ means we use the strided convolution as the pooling layer in the MobileNetV2 backbone network. On the STL10 dataset, when evaluated with the MobileNetV2 and ResNet18 backbone network, the proposed method outperforms the existing pooling ap- proaches by approximately 0.7% for s1=1. In contrast, in ImageNet, the accuracy gainrangesfrom0.86%to1.66%(1.2%onaverage)fors1=1. SinceVWWisarel- atively simple task, the accuracy gain of our proposed method is only 0.14∼ 0.7% across different values of s1. Further analysis of the memory-accuracy trade-off with channel pruning and other s1 values is presented in Section 9.6.4. The object detection experimental results for s1=1 are listed in Table 9.4. When evaluated on SSD framework, our proposed method outperforms the SOTA 168 Table 9.2: Comparison of different pooling methods for different CNN backbones on STL10 dataset. Metrics Top 1 Acc. (%) Methods ∗ 1 st Pool Stride s1=1 s1=2 s1=4 Strided Conv.–MobileNetV2 79.69 72.49 36.49 LIP–MobileNetV2 79.16 68.23 36.50 GaussianPool–MobileNetV2 81.50 74.56 33.31 RNNPool–MobileNetV2 81.62 74.62 37.42 Ours–MobileNetV2 81.75 75.39 40.66 Ours+CP ∗∗ –MobileNetV2 82.38 74.12 37.44 Strided Conv.–MobileNetV2-0.35x 69.89 63.72 31.45 LIP–MobileNetV2-0.35x 73.02 65.91 33.97 GaussianPool–MobileNetV2-0.35x 71.67 67.88 35.03 RNNPool–MobileNetV2-0.35x 72.90 67.41 35.09 Ours–MobileNetV2-0.35x 77.99 69.30 36.68 Ours+CP-MbNetV2-0.35x 77.43 68.08 33.30 Strided Conv.–ResNet18 79.80 76.05 66.49 LIP–ResNet18 81.94 80.53 78.55 GaussianPool–ResNet18 81.57 78.70 74.61 RNNPool–ResNet18 81.80 80.26 78.62 Ours–ResNet18 82.25 81.11 79.39 Ours+CP–ResNet18 82.68 79.81 76.19 * Methods are named by pooling method’s name-backbone’s name. MbNetV2 indi- cates the MobileNetV2 backbone network. ’Ours’ indicates the standard proposed pool- ing method. ** ’Ours+CP’ is the proposed method with 2× channel pruning in the backbone network before the 1 st pooling layer. pooling approach by 0.5%∼ 1% for mAP@0.5, 0.3%∼ 0.5% for mAP@0.75 and 0.5%∼ 0.8% for mAP@[0.5,0.95], which illustrates the superiority of our method onmulti-objectfeatureaggregation. WhenevaluatedonFasterRCNNframework, the proposed method also achieves the state-of-the-art performance on mAP@0.5, mAP@0.75 and mAP@[0.5, 0.95] with approximately 0.1%∼ 0.6% mAP gain. 169 Table 9.3: Comparison of different pooling methods for MobileNetV2-0.35X on VWW dataset. Metrics Top 1 Acc. (%) Methods ∗ 1 st Pool Stride s1=1s1=2s1=4 Strided Conv.–MobileNetV2-0.35x 91.72 83.52 78.83 LIP–MobileNetV2-0.35x 91.24 83.30 79.48 GaussianPool–MobileNetV2-0.35x 91.09 82.81 79.51 RNNPool–MobileNetV2-0.35x 90.85 83.41 79.20 Ours–MobileNetV2-0.35x 91.86 83.87 80.21 Ours+CP ∗∗ -MbNetV2-0.35x 91.60 82.46 76.11 Allresults,exceptthoseforImageNetandCOCO(duetocomputeconstraints), arereportedasthemeanfromthreerunswithdistinctseeds,andthevariancefrom these runs is <0.1% which is well below our accuracy gains. 9.6.3 Qualitative Results & Visualization To intuitively illustrate the superiority of the proposed method, we visualize the heatmap corresponding to different attention mechanisms onto the images from STL10 dataset, as shown in Fig. 9.4. Specifically, the heatmap is calculated by GradCam [287], that computes the gradient of the ground-truth class for each of the pooling layers. The heatmap valueisdirectlyproportionaltothepoolingweightsataparticularlocation, which impliesthattheregionswithhighheatmapvaluescontaineffectivefeaturesthatare retained during down-sampling. Compared with LIP, the representative locality- based pooling method, our proposed method is more concerned about the details of an image and the long-range dependencies between different local regions. As shown in the first and the second columns, LIP focuses only on the main local 170 Table 9.4: Comparison on COCO dataset. Framework Methods mAP @0.5 @0.75 @[0.5,0.95] large medium small SSD Strided Conv.–MobileNetV2 36.30 23.00 21.90 44.60 14.40 0.80 LIP–MobileNetV2 37.50 23.10 22.30 44.80 15.30 0.90 GaussianPool–MobileNetV2 37.00 24.00 22.80 46.50 16.00 0.70 Ours–MobileNetV2 38.00 24.50 23.30 47.00 16.50 0.80 Strided Conv.–ResNet18 38.80 24.70 23.40 47.00 15.70 1.10 LIP–ResNet18 40.60 25.10 24.20 47.80 18.00 1.70 GaussianPool–ResNet18 40.40 24.90 24.10 47.20 17.70 1.40 Ours–ResNet18 41.60 25.40 24.90 48.80 19.30 1.60 Faster RCNN Strided Conv.–ResNet18 63.60 40.80 38.70 52.70 36.70 21.00 LIP–ResNet18 65.30 42.00 39.90 52.10 39.00 23.90 GaussianPool–ResNet18 55.30 33.10 31.80 44.40 29.10 16.00 Ours–ResNet18 65.50 42.60 40.00 51.50 39.90 22.80 regions with large receptive fields. In contrast, our method focuses on the features from different local regions, such as the dog’s mouth, ear, legs in the first column and the bird and branches in the second column. These non-local features are related and might be established long-rang dependencies for feature aggregation. As shown in the fifth and sixth columns, our pooling method mainly focuses on the texture of the cat’s fur, which might be a discriminative feature for classi- fication/detection, while LIP focuses on the general shape of a cat. This kind of general information might fail to guide feature aggregation when required to compress and retain effective detailed information. 9.6.4 Compute & Memory Efficiency Assumingthesameinputandoutputdimensionsfordown-sampling, anddenoting theFLOPscountofourself-attentivepoolingandtheSOTALIPlayeras F SA and F LIP respectively, F SA ≈ 3 n 2 F LIP . Hence, adopting a patch size n > 1 makes our pooling costs cheaper than that of LIP. In particular, we use higher patch 171 Table 9.5: Comparison of different pooling methods for MobileNetV2-0.35x on Ima- geNet dataset. Metrics Top 1 Acc. (%) Methods 1 st Pool Stride s1=1 s1=2 Strided Conv.–MobileNetV2 70.02 60.18 LIP–MobileNetV2 71.62 61.86 GaussianPool–MobileNetV2 72.02 61.24 RNNPool–MobileNetV2 70.97 59.24 Ours–MobileNetV2 72.88 62.89 Strided Conv.–MobileNetV2-0.35x 56.64 49.20 LIP–MobileNetV2-0.35x 58.24 49.95 GaussianPool–MobileNetV2-0.35x 59.26 49.91 RNNPool–MobileNetV2-0.35x 57.80 49.10 Ours–MobileNetV2-0.35x 60.92 51.16 sizes (ranging from 2 to 8) for the initial pooling layers and a patch size of 1 for the later layers (see Table 9.1). This still keeps our total FLOPs count of the entire model lower than LIP, as shown in Table 9.6, because the FLOPs count of both the pooling methods is significantly higher in the initial layers compared to the later layers due to the large size of the activation maps. Note that, in most standard backbones, the channel dimension only increases by a factor of 2, when each of the spatial dimension reduces by a factor of 2, which implies that the total size of the activation map progressively reduces as we go deeper into the network. Our method also consumes 11.66% lower FLOPs, on average, compared to strided convolution based pooling, as shown in Table 9.6. The memory consumption of the whole CNN network is similar for both self- attentive pooling and LIP with identical backbone configurations and identical down-sampling in the pooling layers. Though our self-attentive pooling consists of 172 Table 9.6: Comparison of the total FLOPs count of the whole CNN backbone with different pooling methods on the STL10 dataset. Architecture Ours (G) LIP (G) GP (G) Sd. Conv. (G) MbNetV2 0.272 0.264 0.295 0.303 MbNetV2-0.35x 0.06 0.061 0.059 0.065 ResNet18 1.82 1.93 1.77 2.07 Image Ours LIP Figure 9.4: Visualization results for local importance based pooling and the pro- posed non-local self-attentive pooling. The images are from the STL10 dataset and the heatmaps in each technique highlight the regions of interest, i.e., the regions with high heatmap value will be regarded as effective information and retained while down- sampling. significantly more trainable parameters for the query, key, and value computation compared to local trainable pooling layers, they are fixed during inference, and can be saved off-line in the on-chip memory. Also, the memory consumed by these parameters is still significantly lower compared to that by the initial activation maps, and hence, it does not significantly increase the memory overhead. Note that the reduction of s1 by a factor of 2 approximately halves the total mem- ory consumption, enabling the CNN deployment in devices with tighter memory budgets. As illustrated in Tables 9.2, 9.3, and 9.5, the accuracy gain of our pro- posed pooling method compared to the SOTA grows as we increase s1. A similar 173 trend is also observed as we go from MobileNetV2 to MobileNetV2-0.35x to re- duce the memory consumption. For example, the accuracy gain further increases from 0.25% to 4.97% when evaluated on STL10 which implies the non-local self- attentionmapcanextractmorediscriminativefeaturesfromamemory-constrained model. For ImageNet, with the aggressive down-sampling of the activation maps in the initial layers (providing up to 22× reduction in memory consumption where 11× is due to MobileNetV2-0.35x and 2× is due to aggressive striding), the test accuracy gap with the SOTA techniques at iso-memory increases from 1.2% on averageto1.43%. Allthese motivate theapplicabilityof ourapproachin resource- constraineddevices. Channelpruningcanfurtherreducethememoryconsumption of our models without too much reduction in the test accuracy. We consider 2× channel pruning in the 1 st stage of all the backbone networks, as illustrated in Ta- ble 9.2 and 9.3. As we can see, addition of channel pruning with s1=1 can retain (or sometimes even outperform) the accuracy obtained by our proposed pooling technique. However, channel pruning does not improve the accuracy obtained for more aggressive down-sampling with our pooling technique (s1=2,4). Hence, the nominaldown-samplingschedule(s1=1)withchannelpruningisthemostsuitable configuration to reduce the memory footprint. 9.6.5 Ablation Study We conduct ablation studies of our proposed pooling method when evaluated with ResNet18 backbone on the STL10 dataset. Our results are shown in Ta- ble 9.7. Note that bn1 and bn2 denote the BN layers in the patch embedding and multi-head self-attention modules respectively, and pe denotes the positional encoding layer. SelfAttn directly uses the muti-head self-attention module before each strided convolution layer without spatial-channel restoration and weighted pooling. Removing either of the BN layer results in a slight drop in test ac- curacy. We hypothesize the batch norm (BN) layers normalize the input data distribution, whichhelpsthenon-linearactivationextractbetterfeaturesandhelp 174 Table 9.7: Ablation Study of our Proposed Pooling Technique. Metrics Top 1 Acc. (%) Methods 1 st Pool Stride s1=1 s1=2 w\o(bn1)–ResNet18 (Outer Stage) 80.34 78.45 w\o(bn2)–ResNet18 (Outer Stage) 82.01 80.36 w\o(exp)–ResNet18 (Outer Stage) 81.95 80.00 w\o(pe)–ResNet18 (Outer Stage) 82.01 79.73 w\o(sigmoid)–ResNet18 (Outer Stage) 10.00 10.00 SelfAttn-MobileNetV2 (Inner Stage) 13.44 - SelfAttn-MobileNetV2 (Outer Stage) 13.23 - SelfAttn-ResNet18 (Inner Stage) 26.17 - SelfAttn-ResNet18 (Outer Stage) 58.71 - Ours–ResNet18 (Outer Stage) 82.25 81.11 Ours–ResNet18 (Inner Stage) 81.45 79.17 Ours–MobileNetV2 (Outer Stage) 79.45 68.81 Ours–MobileNetV2 (Inner Stage) 81.75 75.39 speed up convergence. Note that this argument is valid for BN layers in CNNs, not particular to self-attentive pooling. Our pooling method without exponen- tial function degenerates significantly. This might be because each value in the attention map after the sigmoid function is limited in 0∼ 1, without amplifying the response of effective features. Removal of the positional encoding also slightly reduces the accuracy which illustrates the importance of positional information. We hypothesize the position encoding layer merges the positional information into patch tokens, thereby compensating the broken spatial relationship between dif- ferent tokens. Also, our pooling method without sigmoid yields only statistical test accuracy. This is because, without the sigmoid rectification, the output of the spatial-channel restoration module goes to infinity after the amplification by exponentialfunction, resultingingradientexplosion. Comparedtousingonlyself- attention module (instead of our proposed pooling technique) before the strided convolution, our proposed method is more effective. As illustrated in Table 9.7, 175 our accuracy increase is due to the proposed methods, not only the self-attention mechanism. 1 9.7 Conclusion & Societal Implications In this chapter, we propose self-attentive pooling which aggregates non-local fea- tures from the activation maps, thereby enabling the extraction of more complex relationships between the different features, compared to existing local pooling layers. Our approach outperforms the existing pooling approaches with popu- lar memory-efficient CNN backbones on several object recognition and detection benchmarks. Hence, we hope that our approach can enable the deployment of ac- curateCNNmodelsonvariousresource-constrainedplatformssuchassmarthome assistants and wearable sensors. 1 Wedonotfindtheself-attentionmoduletobeeffectiveprobablybecausewedonotpre-train it on large datasets, such as JFT-300M [282]. 176 Chapter 10 Future Work This chapter concludes this dissertation proposal, and presents some interesting future research directions in neuromoprhic and in-sensor computing catering to efficient edge intelligence. In particular, we partition the planned future work in two major categories: 1. future work in efficient inference with SNNs and 2. future work in efficient inference with in-sensor computing. Section 10.1 and 10.2 detail a concrete plan for future work in these two areas respectively. Finally, this dissertation concludes in Section 10.3 10.1 Future Work in SNNs Beyond this disseration research, there are several unexplored yet important areas where SNNs can be optimized. Only then can they expand beyond their neuro- morphic niche and replace traditional DNNs for extreme edge efficiency. 10.1.1 SNNs for beyond-classification tasks It is hard to train SNNs with low time steps for complex CV tasks (e.g. multi-object tracking). Although these tasks involve multiple image frames, the associated temporal information is not leveraged by standard LIF-based SNN models. Such tasks are also more susceptible to spatial precision loss due to spike-based activations compared to static CV tasks. It is thus important to develop novel SNNs (possibly with feedback connections) and associated training techniques that can accurately learn task-specific temporal dynamics using a small number of time steps to enable such complex tasks. 177 10.1.2 SNNs with efficient backbones and ViTs Most existing SNN models are based on older CNN backbones such as VGG [155] and ResNet [156]. Based on my experience and existing literature [85], they lead to a large drop in accuracy when implemented with more compact edge-oriented backbones, including MobileNet [220], EfficientNet [288], and RegNet [289], at low time steps. The last two years have also witnessed the dominance of vision transformers (ViT) [282] that have outperformed SOTA CNNs in complex CV tasks. This implies that there are large gaps in both the accuracy and efficiency advantages of the traditional CV and SNN models, that can be mitigated. These gapscanbemitigatedbyinvestigatingthekeysourcesofprecisionlossintheSNNs based on these modern backbones and jointly optimizing the neural architecture and SGL-based backpropagation training. 10.1.3 SNNs for dynamic vision sensing (DVS) Recently, DVS/neuromorphic tasks such as event stream classification or optical flow estimation has gained significant traction because the neuromorphic cameras have high temporal resolution (µ s range) and consume less power due to their asynchronous and sparse spike-based event generation. For these tasks, SNNs havealsoattractedattentionduetoimprovedtrainingalgorithms[290–292]. How- ever, realizing a fully event-driven pipeline that accumulates the temporal spikes emitted by the DVS cameras within the SNN is highly challenging from an ac- curacy point of view, yet desired for significant energy-efficiency and throughput gains. ImplementingsuchSNNpipelinesforthesecomplextasksrequiresextensive algorithm-hardware co-design. 178 10.2 Future work in in-sensor computing ThoughP 2 Mhasshownimmensepotentialusingmanufacturing-friendlyhardware technologies, we are still a long way from developing smart yet cost-effective sen- sors. This is partly due to the current manufacturing challenges which are mostly beyond the scope of this dissertation. However, with the recent manufacturing advances of sensors and 3D interconnects, this dissertation research and the future research illustrated below can prepare us for a prospect where SOTA DL models can be (almost) completely embedded inside commercial sensor chips. 10.2.1 Distributed Computing and Sensor Fusion Embedding only a few layers in the sensor chip can result in only marginal en- ergy and cost savings (from a manufacturing point of view), especially for a deep network (e.g. ResNet152), even with significant strides in ISP removal, aggressive pooling, and hardware-software co-design. This concern can be largely mitigated withadistributedcomputingparadigmwheretheDL(e.g. CNNorViT)computa- tionissplitbetweenthepixelchipthatprocessesthefirstfewlayersasillustratedin P 2 M,alogicchipheterogeneously(e.g. monolithic3Dorµ TSVthatconsumessig- nificantly lower energy compared to standard camera interfaces) integrated with the sensor that processes some intermediate layers (number of layers subject to peak memory and compute constraints), and off-chip hardware, such as FPGA that processes the remaining layers. We can process many more layers in the logic chip compared to the pixel chip which can significantly reduce the spatial dimen- sions of the network, thereby reducing the data bandwidth, and alleviating the compute burden of the off-chip hardware. A SuperNet-based neural architecture search(NAS)frameworkcanbedevelopedthatcanyieldtheoptimalnetworkcon- figurations(numberoflayers, stride, channels, bit-precision, etc.) andthenetwork splittingpoints, subjecttothecomputeandmemoryconstraintsofeachhardware. This also motivates the development of such distributed computing enabled pro- totype hardware with the help of advanced sensor PDKs and appropriate foundry 179 support. In this context, it is also worth exploring sensor-fusion applications (e.g. autonomous driving and satellite imaging), where the first few/several layers of each feature extraction network can be embedded inside each heterogeneously in- tegrated sensor chip. This will broaden the energy efficiency promise of in-sensor computing. 10.2.2 Frame Skipping Skipping frames, based on their relative importance to the accuracy of the down- stream task, may be the most effective technique to improve the energy efficiency ofin-sensorcomputing. Itcanreducethesensorandtheoverallsystemenergycon- sumption by (approximately) the rate of frame skipping, assuming the associated overhead is manageable. For complex CV tasks such as multi-object tracking from the large-scale BDD100 dataset, alternate frames can be skipped naively without any impact on the final accuracy. This implies there are significant redundancies in the scenes captured by the sensors. This motivates the development of training frameworks that can achieve significant frame skipping ( ∼ 10× ) by penalizing the processing of each frame depending on the motion off-set of the detected objects and the training loss of the downstream task in the previous frame. 10.2.3 Low-level CV tasks AlthoughthisdissertationresearchhasshownthattheISPcanbelargelyremoved for object recognition/detection tasks, there exist many important low-level CV tasks(e.g. imagesuper-resolution,imageinpainting)thatrequireanISP.Toenable these tasks, we can integrate the ISP and CV models that can yield on-sensor in- ference with limited footprint. In particular, the NAS framework illustrated above can be extended to obtain the optimal architectural blocks specific to the ISP and CV models subject to the downstream task and hardware-specific energy/latency constraints. 180 10.3 Conclusions Most of this dissertation and short-term research plans are focused on CV ap- plications. However, humans leverage multiple modalities in their lives to make intelligent decisions. Moreover, most of the DL models that have made headlines in the recent past, such as DALL-E [293] for text-to-image generation, are multi- modal. Hence, a tangible long-term research goal can be to deploy these multi- modal models individually inside each sensor that captures a particular modality. We all envision an interconnected future where sensing and intelligence can aid in every aspect of our lives. This includes smart sensors being implanted inside our bodies, embedded in infrastructure including roads and houses, swayed in grain fields, etc. This would require significant advancements in sensor (microphones, cameras, etc.) design and architecture, interconnects, advanced manufacturing, and multi-modal DL algorithms. This dissertation can be the stepping stone to such multidisciplinary research and can help realize this dream! 181 Bibliography [1] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014, pp. 10–14. [2] A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74–81, 2017. [3] M. Ali, A. Jaiswal, S. Kodge, A. Agrawal, I. Chakraborty, and K. Roy, “IMAC:In-memorymulti-bitmultiplicationandaccumulationin6Tsramar- ray,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 8, pp. 2521–2531, 2020. [4] S. Deng and S. Gu, “Optimal conversion of conventional artificial neural networks to spiking neural networks,” in ICLR, 2021. [5] N. Rathi and K. Roy, “DIET-SNN: Direct input encoding with leakage and threshold optimization in deep spiking neural networks,” arXiv preprint arXiv:2008.03658, 2020. [6] W. Ponghiran and K. Roy, “Hybrid analog-spiking long short-term memory for energy efficient computing on edge devices,” in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021, pp. 581–586. [7] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksforlarge- scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [8] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. 182 [10] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolu- tional networks,” arXiv preprint arXiv:1312.6229, 2013. [11] H. Tao, W. Li, X. Qin, and D. Jia, “Image semantic segmentation based on convolutional neural network and conditional random field,” in 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI). IEEE, 2018, pp. 568–572. [12] A.Krizhevskyet al.,“ImageNetclassificationwithdeepconvolutionalneural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [13] A.Coates,B.Huval,T.Wang,D.Wu, B.Catanzaro, andN.Andrew, “Deep learning with COTS HPC systems,” in International conference on machine learning, 2013, pp. 1337–1345. [14] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. [15] S.Han, X.Liu, H.Mao, J.Pu, A.Pedram, M.A.Horowitz, andW.J.Dally, “EIE: efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016. [16] Y.-H.Chen,T.Krishna,J.S.Emer,andV.Sze,“Eyeriss: Anenergy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016. [17] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accel- erator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019. [18] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–133, 1943. [19] F. Rosenblatt, “The perceptron: a probabilistic model for information stor- age and organization in the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958. [20] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski, “A theoretical frame- work for back-propagation,” in Proceedings of the 1988 connectionist models summer school, vol.1. CMU,Pittsburgh, Pa: MorganKaufmann, 1988, pp. 21–28. 183 [21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [22] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, no. 1, p. 1929–1958, jan 2014. [23] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger, “Understanding batch normalization,” in Advances in Neural Information Processing Sys- tems, 2018, pp. 7694–7705. [24] R. Sutton, “Two problems with back propagation and other steepest de- scent learning procedures for networks,” in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, 1986, pp. 823–832. [25] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016. [26] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999. [27] A.Burkitt,“Areviewoftheintegrate-and-fireneuronmodel: I.homogeneous synaptic input,” Biological cybernetics, vol. 95, pp. 1–19, 08 2006. [28] C. Lee, S. S. Sarwar, P. Panda, G. Srinivasan, and K. Roy, “Enabling spike- basedbackpropagationfortrainingdeepneuralnetworkarchitectures,”Fron- tiers in Neuroscience, vol. 14, p. 119, 2020. [29] S. S. Chowdhury, C. Lee, and K. Roy, “Towards understanding the effect of leak in spiking neural networks,” arXiv preprint arXiv:2006.08761, 2020. [30] N. Rathi, G. Srinivasan, P. Panda, and K. Roy, “Enabling deep spiking neural networks with hybrid conversion and spike timing dependent back- propagation,” arXiv preprint arXiv:2005.01807, 2020. [31] G. Indiveri and T. Horiuchi, “Frontiers in neuromorphic engineering,” Fron- tiers in Neuroscience, vol. 5, p. 118, 2011. [32] M.PfeifferandT.Pfeil, “Deeplearningwithspikingneurons: Opportunities and challenges,” Frontiers in Neuroscience, vol. 12, p. 774, 2018. [33] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neural net- worksforenergy-efficientobjectrecognition,” International Journal of Com- puter Vision, vol. 113, pp. 54–66, 05 2015. 184 [34] P. U. Diehl et al., “Conversion of artificial recurrent neural networks to spik- ing neural networks for low-power neuromorphic hardware,” in 2016 IEEE International Conference on Rebooting Computing (ICRC). IEEE, 2016, pp. 1–8. [35] I. M. Comsa et al., “Temporal coding in spiking neural networks with alpha synaptic function,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, no. 1, 2020, pp. 8529–8533. [36] S.R.KheradpishehandT.Masquelier,“Temporalbackpropagationforspik- ing neural networks with one spike per neuron,” International Journal of Neural Systems, vol. 30, no. 06, May 2020. [37] Y. Wu et al., “Direct training for spiking neural networks: Faster, larger, better,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 33, 2019, pp. 1311–1318. [38] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural net- works using backpropagation,” Frontiers in Neuroscience, vol. 10, p. 508, 2016. [39] Y. Kim and P. Panda, “Revisiting batch normalization for training low-latency deep spiking neural networks from scratch,” arXiv preprint arXiv:2010.01729, 2020. [40] A. Sengupta et al., “Going deeper in spiking neural networks: VGG and residual architectures,” Frontiers in Neuroscience, vol. 13, p. 95, 2019. [41] P. Panda, S. A. Aketi, and K. Roy, “Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax,andhybridization,” Frontiers in Neuroscience,vol.14,p.653,2020. [42] G. Datta and P. A. Beerel, “Can deep neural networks be converted to ultra low-latency spiking neural networks?” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022, pp. 718–723. [43] G. Datta, S. Kundu, A. R. Jaiswal, and P. A. Beerel, “ACE-SNN: Algorithm-hardware co-design of energy-efficient & low-latency deep spiking neural networks for 3D image recognition,” Frontiers in Neuroscience, vol. 16, 2022. [Online]. Available: https://www.frontiersin.org/articles/10. 3389/fnins.2022.815258 [44] G.Datta,S.Kundu,andP.A.Beerel,“Trainingenergy-efficientdeepspiking neural networks with single-spike hybrid input encoding,” in 2021 Interna- tional Joint Conference on Neural Networks (IJCNN), vol. 1, no. 1, 2021, pp. 1–8. 185 [45] G. Datta, Z. Liu, and P. A. Beerel, “Hoyer regularizer is all you need for ultralow-latencyspikingneuralnetworks,”arXivpreprintarXiv:2212.10170, 2022. [46] G. Datta, H. Deng, R. Aviles, and P. A. Beerel, “Towards energy-efficient, low-latencyandaccuratespikingLSTMs,” arXiv preprint arXiv:2110.05929, 2022. [47] “On-device AI with Developer-Ready Software Stacks,” https://developer. qualcomm.com/blog/device-ai-developer-ready-software-stacks, 2021, ac- cessed: 03-31-2022. [48] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train one network and specialize it for efficient deployment,” in International Conference on Learning Representations, 2020. [Online]. Available: https://arxiv.org/pdf/1908.09791.pdf [49] A. Y.-C. Chiou and C.-C. Hsieh, “An ULV PWM CMOS imager with adaptive-multiple-sampling linear response, HDR imaging, and energy har- vesting,” IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 298–306, 2019. [50] Z. Chen, H. Zhu, E. Ren, Z. Liu, K. Jia, L. Luo, X. Zhang, Q. Wei, F. Qiao, X. Liu, and H. Yang, “Processing near sensor architecture in mixed-signal domain with CMOS image sensor of convolutional-kernel-readout method,” IEEE Transactions on Circuits and Systems I: Regular Papers,vol.67,no.2, pp. 389–400, 2020. [51] L. Mennel, J. K. Symonowicz, S. Wachter, D. K. Polyushkin, A. J. Molina- Mendoza,andT.Mueller,“Ultrafastmachinevisionwith2Dmaterialneural network image sensors,” Nature, vol. 579, pp. 62–66, 2020. [52] L. Bose, P. Dudek, J. Chen, S. J. Carey, and W. W. Mayol-Cuevas, “Fully embedding fast convolutional networks on pixel processor arrays,” in Com- puter Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, Au- gust 23-28, 2020, Proceedings, Part XXIX, vol. 12374. Springer, 2020, pp. 488–503. [53] G. Datta, Z. Yin, A. P. Jacob, A. R. Jaiswal, and P. A. Beerel, “To- wards energy-efficient hyperspectral image processing inside camera pixels,” inComputerVision–ECCV2022Workshops,L.Karlinsky,T.Michaeli,and K. Nishino, Eds. Cham: Springer Nature Switzerland, 2023, pp. 303–316. [54] G. Datta, S. Kundu, Z. Yin, R. T. Lakkireddy, J. Mathai, A. P. Jacob, P. A. Beerel,andA.R.Jaiswal,“P2M:Aprocessing-in-pixel-in-memoryparadigm 186 for resource-constrained TinyML applications,” Scientific Reports , vol. 12, no. 14396, 2022. [55] G. Datta, S. Kundu, Z. Yin, J. Mathai, Z. Liu, Z. Wang, M. Tian, S. Lu, R. T. Lakkireddy, A. Schmidt, W. Abd-Almageed, A. Jacob, A. Jaiswal, and P. Beerel, “P2M-DeTrack: Processing-in-Pixel-in-Memory for energy-efficient and real-time multi-object detection and tracking,” in 2022 IFIP/IEEE 30th International Conference on Very Large Scale Inte- gration (VLSI-SoC), 2022, pp. 1–6. [56] G.Datta, Z.Liu, M.A.-A.Kaiser, S.Kundu, J.Mathai, Z.Yin, A.P.Jacob, A. R. Jaiswal, and P. A. Beerel, “In-sensor & neuromorphic computing are all you need for energy efficient computer vision,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. [57] G. Datta, Z. Liu, Z. Yin, L. Sun, A. R. Jaiswal, and P. A. Beerel, “Enabling ISPless low-power computer vision,” in Proceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision (WACV),January2023, pp. 2430–2439. [58] M. A.-A. Kaiser, G. Datta, S. Sarkar, S. Kundu, Z. Yin, M. Garg, A. P. Jacob, P. A. Beerel, and A. R. Jaiswal, “Technology-circuit-algorithm tri-design for processing-in-pixel-in-memory (P2M),” in Proceedings of the GreatLakesSymposiumonVLSI2023. ACM,jun2023.[Online].Available: https://doi.org/10.1145%2F3583781.3590235 [59] M. A.-A. Kaiser, G. Datta, Z. Wang, A. P. Jacob, P. A. Beerel, and A. R. Jaiswal, “Neuromorphic-P2M: processing-in-pixel-in-memory paradigm for neuromorphic image sensors,” Frontiers in Neuroinformatics, vol. 17, 2023. [Online]. Available: https://www.frontiersin.org/articles/10. 3389/fninf.2023.1144301 [60] F. Chen, G. Datta, S. Kundu, and P. A. Beerel, “Self-attentive pooling for efficientdeeplearning,”in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),January2023,pp.3974–3983. [61] S. Zhou, X. LI, Y. Chen, S. T. Chandrasekaran, and A. Sanyal, “Temporal- coded deep spiking neural network with easy training and robust perfor- mance,” arXiv preprint arXiv:1909.10837, 2020. [62] S. Park, S. Kim, B. Na, and S. Yoon, “T2FSNN: Deep spiking neural networks with time-to-first-spike coding,” arXiv preprint arXiv:2003.11741, 2020. 187 [63] M. Zhang, J. Wang, B. Amornpaisannon, Z. Zhang, V. Miriyala, A. Bela- treche, H. Qu, J. Wu, Y. Chua, T. E. Carlson, and H. Li, “Rectified linear postsynaptic potential function for backpropagation in deep spiking neural networks,” 2020. [64] J. Kim, H. Kim, S. Huh, J. Lee, and K. Choi, “Deep neural networks with weighted spikes,” Neurocomputing, vol. 311, pp. 373–386, 2018. [65] S. Park, S. Kim, H. Choe, and S. Yoon, “Fast and efficient information transmission with burst spikes in deep spiking neural networks,” in 2019 56th ACM/IEEE Design Automation Conference (DAC), vol. 1, no. 1, 2019, pp. 1–6. [66] D. A. Almomani, M. Alauthman, M. Alweshah, O. Dorgham, and F. Al- balas, “A comparative study on spiking neural network encoding schema: implemented with cloud computing,” Cluster Computing, vol. 22, 06 2019. [67] I.Garg, S.S.Chowdhury, andK.Roy, “DCT-SNN:UsingDCTtodistribute spatial information over time for learning low-latency spiking neural net- works,” arXiv preprint arXiv:2010.01795, 2020. [68] A.Tavanaei,M.Ghodrati,S.R.Kheradpisheh,T.Masquelier,andA.Maida, “Deep learning in spiking neural networks,” Neural Networks, vol. 111, p. 47–63, Mar 2019. [69] S. Kundu, G. Datta, M. Pedram, and P. A. Beerel, “Spike-thrift: Towards energy-efficient deep spiking neural networks by limiting spiking activity via attention-guided compression,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV),January2021,pp. 3953–3962. [70] B. Rueckauer et al., “Conversion of continuous-valued deep networks to ef- ficient event-driven networks for image classification,” Frontiers in Neuro- science, vol. 11, p. 682, 2017. [71] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer, “Fast- classifying, high-accuracy spiking deep networks through weight and thresh- old balancing,” in 2015 International Joint Conference on Neural Networks (IJCNN), vol. 1, no. 1, 2015, pp. 1–8. [72] Y.Hu,H.Tang,andG.Pan,“Spikingdeepresidualnetwork,”arXivpreprint arXiv:1805.01352, 2018. [73] P. Panda and K. Roy, “Unsupervised regenerative learning of hierarchical features in spiking deep networks for object recognition,” arXiv preprint arXiv:1602.01510, 2016. 188 [74] G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass, “Long short-term memory and learning-to-learn in networks of spiking neurons,” arXiv preprint arXiv:1803.09574, 2018. [75] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization tospikingneuralnetworks,”IEEESignalProcessingMagazine,vol.36,no.6, pp. 51–63, 2019. [76] N. Rathi and K. Roy, “DIET-SNN: Direct input encoding with leakage and threshold optimization in deep spiking neural networks,” arXiv preprint arXiv:2008.03658, 2020. [77] S. R. Kheradpisheh, M. Ganjtabesh, S. J. Thorpe, and T. Masquelier, “STDP-based spiking deep convolutional neural networks for object recog- nition,” Neural Networks, vol. 99, p. 56–67, Mar 2018. [78] M. Mozafari, S. R. Kheradpisheh, T. Masquelier, A. Nowzari-Dalini, and M. Ganjtabesh, “First-spike-based visual categorization using reward- modulated STDP,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6178–6190, 2018. [79] G. Gallego, T. Delbruck, G. M. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza, “Event-basedvision: Asurvey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, pp. 1–1, 2020. [80] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- nov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 06 2014. [81] S. Lu and A. Sengupta, “Exploring the connection between binary and spik- ing neural networks,” arXiv preprint arXiv:2002.10064, 2020. [82] Y.Shen,M.Ferdman,andP.Milder,“Escher: Acnnacceleratorwithflexible buffering to minimize off-chip transfer,” in 2017 IEEE 25th Annual Inter- national Symposium on Field-Programmable Custom Computing Machines (FCCM), vol. 1, no. 1, 2017, pp. 93–100. [83] B. Chen, F. Cai, J. Zhou, W. Ma, P. Sheridan, and W. D. Lu, “Efficient in-memory computing architecture based on crossbar arrays,” in 2015 IEEE International Electron Devices Meeting (IEDM), vol. 1, no. 1, 2015, pp. 1–4. [84] Y.-H.Chen,J.Emer,andV.Sze,“Eyeriss: Aspatialarchitectureforenergy- efficient dataflow for convolutional neural networks,” in ACM SIGARCH Computer Architecture News, vol. 44, 06 2016. 189 [85] Y. Li et al., “A free lunch from ANN: Towards efficient, accurate spiking neural networks calibration,” arXiv preprint arXiv:2106.06984, 2021. [86] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” University of Toronto, 2009. [87] N.-D. Ho et al., “TCL: an ANN-to-SNN conversion with trainable clipping layers,” arXiv preprint arXiv:2008.04509, 2021. [88] B. Han et al., “Deep spiking neural network: Energy efficiency through time based coding,” in European Conference on Computer Vision (ECCV). Springer International Publishing, 2020, pp. 388–404. [89] W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian, “Incor- porating learnable membrane time constant to enhance learning of spiking neural networks,” arXiv preprint arXiv:2007.05785, 2020. [90] S. Kundu, G. Datta, M. Pedram, and P. A. Beerel, “Towards low-latency energy-efficientdeepSNNsviaattention-guidedcompression,” arXivpreprint arXiv:2107.12445, 2021. [91] P.Merolla et al., “Amillionspiking-neuronintegratedcircuitwithascalable communication network and interface,” Science, vol. 345, pp. 668–673, 2014. [92] S. B. Furber et al., “The spinnaker project,” Proceedings of the IEEE, vol. 102, no. 5, pp. 652–665, 2014. [93] Y.Chen, Z.Lin, X.Zhao, G.Wang, andY.Gu, “Deeplearning-basedclassi- fication of hyperspectral data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 2094–2107, 2014. [94] Y. Wan, Y. Fan, and M. Jin, “Application of hyperspectral remote sensing for supplementary investigation of polymetallic deposits in huaniushan ore region, northwestern china,” Scientific Reports , vol. 11, p. 440, 01 2021. [95] A.Papp,J.Pegoraro,D.Bauer,P.Taupe,C.Wiesmeyr,andA.Kriechbaum- Zabini, “Automatic annotation of hyperspectral images and spectral signal classification of people and vehicles in areas of dense vegetation with deep learning,” Remote Sensing, vol. 12, no. 13, 2020. [96] Z. Zheng, Y. Zhong, A. Ma, and L. Zhang, “FPGA: Fast patch-free global learning framework for fully end-to-end hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 8, pp. 5612–5626, 2020. 190 [97] F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images with support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 8, pp. 1778–1790, 2004. [98] M. Pal, “Random forests for land cover classification,” in 2003 IEEE In- ternational Geoscience and Remote Sensing Symposium. Proceedings (IEEE Cat. No.03CH37477), vol. 6, no. 1, 2003, pp. 3510–3512 vol.6. [99] J. Xia, N. Yokoya, and A. Iwasaki, “Hyperspectral image classification with canonicalcorrelationforests,”IEEETransactionsonGeoscienceandRemote Sensing, vol. 55, no. 1, pp. 421–431, 2017. [100] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink, “Sparse multinomial logistic regression: fast algorithms and generalization bounds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 957–968, 2005. [101] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time objectdetectionwithregionproposalnetworks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, p. 1137–1149, Jun. 2017. [102] K. He, G. Gkioxari, P. Doll´ ar, and R. Girshick, “Mask R-CNN,” arXiv preprint arXiv:1703.06870, 2018. [103] V. K. Repala and S. R. Dubey, “Dual CNN models for unsupervised monoc- ular depth estimation,” arXiv preprint arXiv:1804.06324, 2019. [104] A. Ben Hamida, A. Benoit, P. Lambert, and C. Ben Amar, “3-D deep learn- ingapproachforremotesensingimageclassification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 8, pp. 4420–4434, 2018. [105] H. Lee and H. Kwon, “Going deeper with contextual CNN for hyperspec- tral image classification,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4843–4855, 2017. [106] S. K. Roy, G. Krishna, S. R. Dubey, and B. B. Chaudhuri, “HybridSN: Exploring 3-D–2-D CNN feature hierarchy for hyperspectral image classifi- cation,” IEEE Geoscience and Remote Sensing Letters, vol. 17, no. 2, pp. 277–281, 2020. [107] Y. Luo, J. Zou, C. Yao, X. Zhao, T. Li, and G. Bai, “HSI-CNN: A novel convolution neural network for hyperspectral image,” in 2018 International Conference on Audio, Language and Image Processing (ICALIP), vol. 1, no. 1, 2018, pp. 464–469. 191 [108] D. Li, X. Chen, M. Becchi, and Z. Zong, “Evaluating the energy efficiency of deep convolutional neural networks on CPUs and GPUs,” in 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), vol. 1, no. 1, 2016, pp. 477–484. [109] Hien Van Nguyen, A. Banerjee, and R. Chellappa, “Tracking via object re- flectance using a hyperspectral video camera,” in 2010 IEEE Computer So- ciety Conference on Computer Vision and Pattern Recognition - Workshops, vol. 1, no. 1, 2010, pp. 44–51. [110] M.PfeifferandT.Pfeil, “Deeplearningwithspikingneurons: Opportunities and challenges,” Frontiers in Neuroscience, vol. 12, p. 774, 2018. [111] S. R. Jain, A. Gural, M. Wu, and C. H. Dick, “Trained quantization thresh- oldsforaccurateandefficientfixed-pointinferenceofdeepneuralnetworks,” arXiv preprint arXiv:1903.08066, 2020. [112] H. Wu, P. Judd, X. Zhang, M. Isaev, and P. Micikevicius, “Integer quantiza- tion for deep learning inference: Principles and empirical evaluation,” arXiv preprint arXiv:2004.09602, 2020. [113] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi- narized neural networks: Training deep neural networks with weights and activationsconstrainedto+1or-1,” arXiv preprint arXiv:1602.02830, 2016. [114] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quan- tization: Towards lossless CNNs with low-precision weights,” arXiv preprint arXiv:1702.03044, 2017. [115] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A survey of accelerator architectures for deep neural networks,” Engineering, vol. 6, no. 3, pp. 264– 274, 2020. [116] Q. Dong, S. Jeloka, M. Saligane, Y. Kim, M. Kawaminami, A. Harada, S. Miyoshi, M. Yasuda, D. Blaauw, and D. Sylvester, “A 4+2T SRAM for searching and in-memory computing with 0.3-V v ddmin ,” IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 1006–1015, 2018. [117] A.Agrawal,A.Jaiswal,C.Lee,andK.Roy,“X-SRAM:Enablingin-memory boolean computations in CMOS static random access memories,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 12, pp. 4219–4232, 2018. 192 [118] M.Kang,S.Lim,S.Gonugondla,andN.R.Shanbhag,“Anin-memoryVLSI architecture for convolutional neural networks,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems,vol.8,no.3,pp.494–505,2018. [119] A. Biswas and A. P. Chandrakasan, “CONV-SRAM: An energy-efficient SRAM with in-memory dot-product computation for low-power convolu- tional neural networks,” IEEE Journal of Solid-State Circuits, vol. 54, no. 1, pp. 217–230, 2019. [120] A.Agrawal,A.Jaiswal,D.Roy,B.Han,G.Srinivasan,A.Ankit,andK.Roy, “Xcel-RAM:Acceleratingbinaryneuralnetworksinhigh-throughputSRAM compute arrays,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 8, pp. 3064–3076, 2019. [121] A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8T SRAM cell as a multibit dot-product engine for beyond von neumann computing,” IEEE TransactionsonVeryLargeScaleIntegration(VLSI)Systems,vol.27,no.11, pp. 2556–2567, 2019. [122] Y. Kim and P. Panda, “Revisiting batch normalization for training low-latency deep spiking neural networks from scratch,” arXiv preprint arXiv:2010.01729, 2021. [123] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con- nections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143. [124] Z. Meng, F. Zhao, M. Liang, and W. Xie, “Deep residual involution network forhyperspectralimageclassification,” Remote Sensing,vol.13,no.16,2021. [125] T.Alipour-Fard,M.E.Paoletti,J.M.Haut,H.Arefi,J.Plaza,andA.Plaza, “Multibranch selective kernel networks for hyperspectral image classifica- tion,” IEEE Geoscience and Remote Sensing Letters, vol. 1, no. 1, pp. 1–5, 2020. [126] W.Song, S.Li, L.Fang, andT.Lu, “Hyperspectralimageclassificationwith deepfeaturefusionnetwork,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 6, pp. 3173–3184, 2018. [127] Z. Zhong, J. Li, Z. Luo, and M. Chapman, “Spectral–spatial residual net- workforhyperspectralimageclassification: A3-Ddeeplearningframework,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 847–858, 2018. 193 [128] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” arXiv preprint arXiv:1810.05270, 2018. [129] B. Moons, K. Goetschalckx, N. Van Berckelaer, and M. Verhelst, “Minimum energy quantized neural networks,” in 2017 51st Asilomar Conference on Signals, Systems, and Computers, vol. 1, no. 1, 2017, pp. 1921–1925. [130] W. Simon, J. Galicia, A. Levisse, M. Zapater, and D. Atienza, “A fast, reliable and wide-voltage-range in-memory computing architecture,” in 2019 56th ACM/IEEE Design Automation Conference (DAC), vol. 1, no. 1, 2019, pp. 1–6. [131] S. Ganesan, “Area, delay and power comparison of adder topologies,” Maseters’ Thesis, University of Texas at Austin, 2015. [132] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 en- vision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy- frequency-scalableconvolutionalneuralnetworkprocessorin28nmfdsoi,” in 2017 IEEE International Solid-State Circuits Conference (ISSCC), vol. 1, no. 1, 2017, pp. 246–247. [133] S. K. Gonugondla, C. Sakr, H. Dbouk, and N. R. Shanbhag, “Fundamen- tal limits on energy-delay-accuracy of in-memory architectures in inference applications,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 10, pp. 3188–3201, 2022. [134] M. Ali, I. Chakraborty, U. Saxena, A. Agrawal, A. Ankit, and K. Roy, “A 35.5-127.2 TOPS/W dynamic sparsity-aware reconfigurable-precision compute-in-memory SRAM macro for machine learning,” IEEE Solid-State Circuits Letters, vol. 4, no. 1, pp. 129–132, 2021. [135] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, S. Agarwal, M. Marinella, M. Foltin, J. P. Strachan, D. Milojicic, W.-M. Hwu, and K. Roy, “PAN- THER: A programmable architecture for neural network training harnessing energy-efficientReRAM,” IEEETrans.Comput.,vol.69,no.8,p.1128–1142, aug 2020. [136] B.Boldrini,W.Kessler,K.Rebner,andR.Kessler,“Hyperspectralimaging: A review of best practice, performance and pitfalls for in-line and on-line applications,” Journal of Near Infrared Spectroscopy, 10 2012. [137] P. Wang, X. He, G. Li, T. Zhao, and J. Cheng, “Sparsity-inducing bina- rized neural networks,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12192–12199, Apr. 2020. 194 [138] J. Diffenderfer and B. Kailkhura, “Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=U mat0b9iv [139] S. S. Chowdhury, N. Rathi, and K. Roy, “One timestep is all you need: Training spiking neural networks with ultra low latency,” arXiv preprint arXiv:2110.05929, 2021. [140] T. Bu, W. Fang, J. Ding, P. DAI, Z. Yu, and T. Huang, “Optimal ANN-SNNconversionforhigh-accuracyandultra-low-latencyspikingneural networks,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=7B3IJMM1k M [141] Y. Kim, Y. Li, H. Park, Y. Venkatesha, R. Yin, and P. Panda, “Explor- ing lottery ticket hypothesis in spiking neural networks,” arXiv preprint arXiv:2207.01382, 2022. [142] Y. Kim, Y. Li, H. Park, Y. Venkatesha, and P. Panda, “Neural architecture search for spiking neural networks,” arXiv preprint arXiv:2201.10355, 2022. [143] P. O. Hoyer, “Non-negative matrix factorization with sparseness con- straints.” Journal of machine learning research, vol. 5, no. 9, 2004. [144] H. Yang, W. Wen, and H. Li, “Deephoyer: Learning sparser neural net- work with differentiable scale-invariant sparsity measures,” in International Conference on Learning Representations, 2020. [145] M.Kurtz,J.Kopinsky,R.Gelashvili,A.Matveev,J.Carr,M.Goin,W.Leis- erson, S. Moore, N. Shavit, and D. Alistarh, “Inducing and exploiting acti- vation sparsity for fast inference on deep neural networks,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 5533–5543. [146] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur- passing human-level performance on imagenet classification,” arXiv preprint arXiv:1502.01852, 2015. [147] M.Rastegari,V.Ordonez,J.Redmon,andA.Farhadi,“Xnor-net: Imagenet classification using binary convolutional neural networks,” in European con- ference on computer vision. Springer, 2016, pp. 525–542. [148] H. Zheng, Y. Wu, L. Deng, Y. Hu, and G. Li, “Going deeper with directly- trainedlargerspikingneuralnetworks,” Proceedings of the AAAI Conference on Artificial Intelligence , vol. 35, no. 12, pp. 11062–11070, May 2021. 195 [149] Y. Li, Y. Guo, S. Zhang, S. Deng, Y. Hai, and S. Gu, “Differentiable spike: Rethinking gradient-descent for training spiking neural networks,” in Ad- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelz- imer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 23426–23439. [150] S. Deng, Y. Li, S. Zhang, and S. Gu, “Temporal efficient training of spiking neural network via gradient re-weighting,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id= XNtisL32jv [151] Q. Meng, M. Xiao, S. Yan, Y. Wang, Z. Lin, and Z.-Q. Luo, “Training high- performance low-latency spiking neural networks by differentiation on spike representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 12444–12453. [152] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, “Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 722–737. [153] A. Krizhevsky, “Learning multiple layers of features from tiny images,” pp. 32–33, 2009. [Online]. Available: https://www.cs.toronto.edu/ ∼ kriz/ learning-features-2009-TR.pdf [154] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. [155] K.SimonyanandA.Zisserman,“Verydeepconvolutionalnetworksforlarge- scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [156] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [157] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open MMLab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019. [158] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” International Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, Jun. 2010. 196 [159] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural infor- mation processing systems, vol. 28, 2015. [160] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar, “Focal loss for dense object detection,” arXiv preprint arXiv:1708.02002, 2017. [161] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [162] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “Reactnet: Towards precise binary neural network with generalized activation functions,” in European conference on computer vision. Springer, 2020, pp. 143–159. [163] S. Kim, S. Park, B. Na, and S. Yoon, “Spiking-yolo: Spiking neural network for energy-efficient object detection,” 2019. [164] Z. Wang, Z. Wu, J. Lu, and J. Zhou, “Bidet: An efficient binarized object detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [165] J. Wu, Y. Chua, M. Zhang, G. Li, H. Li, and K. C. Tan, “A tandem learn- ing rule for effective training and rapid inference of deep spiking neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 1, no. 1, pp. 1–15, 2021. [166] T. Bu, J. Ding, Z. Yu, and T. Huang, “Optimized potential initialization for low-latency spiking neural networks,” Proceedings of the AAAI Conference on Artificial Intelligence , vol. 36, no. 1, pp. 11–20, Jun. 2022. [167] W. Zhang and P. Li, “Temporal spike sequence learning via backpropaga- tion for deep spiking neural networks,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12022–12033. [168] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, “Deep residual learning in spiking neural networks,” in Advances in Neural In- formation Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 21056–21069. [169] D. Wang, P. K. Chundi, S. J. Kim, M. Yang, J. P. Cerqueira, J. Kang, S.Jung, S.Kim, andM.Seok, “Always-on, sub-300-nw, event-drivenspiking neuralnetworkbasedonspike-drivenclock-generationandclock-andpower- gating for an ultra-low-power intelligent device,” in 2020 IEEE Asian Solid- State Circuits Conference (A-SSCC), vol. 1, 2020, pp. 1–4. 197 [170] C. Frenkel, M. Lefebvre, J.-D. Legat, and D. Bol, “A 0.086-mm 2 12.7-pj/sop 64k-synapse 256-neuron online-learning digital spiking neuromorphic pro- cessor in 28-nm CMOS,” IEEE Transactions on Biomedical Circuits and Systems, vol. 13, no. 1, pp. 145–158, 2019. [171] J.-J.LeeandP.Li,“Reconfigurabledataflowoptimizationforspatiotemporal spiking neural computation on systolic array accelerators,” in 2020 IEEE 38th International Conference on Computer Design (ICCD), vol. 1, 2020, pp. 57–64. [172] M. Davies, N. Srinivasa, T. H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Di- mou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. K. Lin, A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. H. Weng, A. Wild, Y. Yang, and H. Wang, “Loihi: A neuromorphic manycore processorwithon-chiplearning,”IEEEMicro,vol.38,no.1,pp.82–99,2018. [173] R.Yin,A.Moitra,A.Bhattacharjee,Y.Kim,andP.Panda,“Sata: Sparsity- aware training accelerator for spiking neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 1, no. 1, pp. 1–1, 2022. [174] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. Shanbhag, “True gradient-based training of deep binary activated neural networks via con- tinuous binarization,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 2018, pp. 2346–2350. [175] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “Adder- net: Do we really need multiplications in deep learning?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [176] W. Li, H. Chen, M. Huang, X. Chen, C. Xu, and Y. Wang, “Winograd algo- rithm for addernet,” in Proceedings of the 38th International Conference on Machine Learning,ser.ProceedingsofMachineLearningResearch,M.Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 6307–6315. [177] Y. Li, Y. Kim, H. Park, T. Geller, and P. Panda, “Neuromorphic data augmentation for training spiking neural networks,” arXiv preprint arXiv:2002.10064, 2022. [178] Y. Kim and P. Panda, “Optimizing deeper spiking neural networks for dy- namic vision sensing,” Neural Networks, vol. 144, pp. 686–698, 2021. 198 [179] Y. Kim, J. Chough, and P. Panda, “Beyond classification: Directly training spikingneuralnetworksforsemanticsegmentation,” Neuromorphic Comput- ing and Engineering, 2022. [180] A. L. Rezaabad and S. Vishwanath, “Long short-term memory spiking net- works and their applications,” in ICONS, 2020. [181] W. Ponghiran and K. Roy, “Spiking neural networks with improved inher- ent recurrence dynamics for sequential learning,” Proceedings of the AAAI ConferenceonArtificialIntelligence ,vol.36,no.7,pp.8001–8008,Jun.2022. [182] N. Moritz, T. Hori, and J. L. Roux, “Unidirectional neural network archi- tectures for end-to-end automatic speech recognition,” in INTERSPEECH, 2019. [183] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L.Kaiser, andI.Polosukhin, “Attentionisallyouneed,”Advances in neural information processing systems, vol. 30, 2017. [184] Z. Zhou, Y. Zhu, C. He, Y. Wang, S. YAN, Y. Tian, and L. Yuan, “Spik- former: When spiking neural network meets transformer,” in ICLR, 2023. [185] T. Bu, J. Ding, Z. Yu, and T. Huang, “Optimized potential initialization for low-latency spiking neural networks,” Proceedings of the AAAI Conference on Artificial Intelligence , vol. 36, no. 1, pp. 11–20, Jun. 2022. [186] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [187] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018. [188] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public domain dataset for human activity recognition using smartphones,” in 21 st European Symposium on Artificial Neural Networks, Computational Intelli- gence and Machine Learning, 2013. [189] B. Yin, F. Corradi, and S. M. Bohte, “Accurate and efficient time- domainclassificationwithadaptivespikingrecurrentneuralnetworks,” arXiv preprint arXiv:2103.12593, 2021. [190] A. Jeffares, Q. Guo, P. Stenetorp, and T. Moraitis, “Spike-inspired rank coding for fast and accurate recurrent neural networks,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=iMH1e5k7n3L 199 [191] Y. Zhao, R. Yang, G. Chevalier, and M. Gong, “Deep residual bidir- LSTMforhumanactivityrecognitionusingwearablesensors,”arXivpreprint arXiv:1708.08989, 2017. [192] R.Costa,I.A.Assael,B.Shillingford,N.deFreitas,andT.Vogels,“Cortical microcircuitsasgated-recurrentneuralnetworks,” in Advances in Neural In- formation Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30, 2017. [193] G. Bellec, D. Salaj, A. Subramoney, R. A. Legenstein, and W. Maass, “Long short-termmemoryandlearning-to-learninnetworksofspikingneurons,” in NeurIPS, 2018, pp. 795–805. [194] M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neu- ral networks,” in Proceedings of the 33rd International Conference on In- ternational Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, p. 1120–1128. [195] T. Pellegrini, R. Zimmer, and T. Masquelier, “Low-activity supervised con- volutionalspikingneuralnetworksappliedtospeechcommandsrecognition,” arXiv preprint arXiv:2011.06846, 2020. [196] D. Salaj, A. Subramoney, C. Kraisnikovic, G. Bellec, R. Legenstein, and W. Maass, “Spike frequency adaptation supports network computations on temporally dispersed information,” eLife, vol. 10, p. e65459, jul 2021. [197] J. Xie, Y. Zheng, R. Du, W. Xiong, Y. Cao, Z. Ma, D. Cao, and J. Guo, “Deep learning-based computer vision for surveillance in ITS: Evaluation of state-of-the-art methods,” IEEE Transactions on Vehicular Technology, vol. 70, no. 4, pp. 3027–3042, 2021. [198] U. Iqbal, P. Perez, W. Li, and J. Barthelemy, “How computer vision can facilitate flood management: A systematic review,” International Journal of Disaster Risk Reduction, vol. 53, p. 102030, 2021. [199] A.Gomez,A.Salazar,andF.Vargas,“Towardsautomaticwildanimalmoni- toring: Identificationofanimalspeciesincamera-trapimagesusingverydeep convolutional neural networks,” arXiv preprint arXiv:1603.06169, 2016. [200] “Scaling CMOS Image Sensors,” https://semiengineering.com/ scaling-cmos-image-sensors/, 2020, accessed: 04-20-2020. [201] T. J. Sejnowski, “The unreasonable effectiveness of deep learning in artifi- cial intelligence,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30033–30038, 2020. 200 [202] E. Fossum, “CMOS image sensors: electronic camera-on-a-chip,” IEEE Transactions on Electron Devices, vol. 44, no. 10, pp. 1689–1698, 1997. [203] M. Buckler, S. Jayasuriya, and A. Sampson, “Reconfiguring the imaging pipelineforcomputervision,” 2017 IEEE International Conference on Com- puter Vision (ICCV), pp. 975–984, 2017. [204] R. Pinkham, A. Berkovich, and Z. Zhang, “Near-sensor distributed DNN processing for augmented and virtual reality,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 11, no. 4, pp. 663–676, 2021. [205] “Sony to Release World’s First Intelligent Vision Sensors with AI Pro- cessing Functionality,” https://www.sony.com/en/SonyInfo/News/Press/ 202005/20-037E/, 2020, accessed: 12-01-2022. [206] R. Song, K. Huang, Z. Wang, and H. Shen, “A reconfigurable convolution- in-pixelCMOSimagesensorarchitecture,” arXiv preprint arXiv:2101.03308, 2021. [207] A.JaiswalandA.P.Jacob,“Integratedpixelandthree-terminalnon-volatile memorycellandanarrayof cellsfordeepin-sensor, in-memorycomputing,” Jul. 20 2021, uS Patent 11,069,402. [208] S. Angizi, S. Tabrizchi, and A. Roohi, “PISA: A binary-weight processing-in-sensor accelerator for edge image processing,” arXiv preprint arXiv:2202.09035, 2022. [209] M. Jogin, Mohana, M. S. Madhulika, G. D. Divya, R. K. Meghana, and S. Apoorva, “Feature extraction using convolution neural networks (CNN) and deep learning,” in 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT), vol. 1, no. 1, 2018, pp. 2319–2323. [210] S.Jain,A.Sengupta,K.Roy,andA.Raghunathan,“RxNN:Aframeworkfor evaluatingdeepneuralnetworksonresistivecrossbars,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 40, no. 2, p. 326–338, feb 2021. [211] C.LammieandM.R.Azghadi,“Memtorch: Asimulationframeworkfordeep memristive cross-bar architectures,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), vol. 1, no. 1, 2020, pp. 1–5. [212] N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger, “Understanding batch normalization,” in Advances in Neural Information Processing Sys- tems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. 201 [213] O. Saha, A. Kusupati, H. V. Simhadri, M. Varma, and P. Jain, “RNNPool: Efficient non-linear pooling for RAM constrained inference,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 20473–20484. [Online]. Available: https://proceedings. neurips.cc/paper/2020/file/ebd9629fc3ae5e9f6611e2ee05a31cef-Paper.pdf [214] CMOS Image Sensor, 1.2 MP, Global Shutter, ON Semiconductor, 3 220, rev. 10. [215] P. P. Ray, “A review on TinyML: State-of-the-art and prospects,” Journal of King Saud University - Computer and Information Sciences, 2021. [216] B. Sudharsan, S. Salerno, D.-D. Nguyen, M. Yahya, A. Wahid, P. Yadav, J.G.Breslin,andM.I.Ali,“TinyMLbenchmark: Executingfullyconnected neural networks on commodity microcontrollers,” in 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), vol. 1, no. 1, 2021, pp. 883–884. [217] C. Banbury, C. Zhou, I. Fedorov, R. Matas, U. Thakker, D. Gope, V. Janapa Reddi, M. Mattina, and P. Whatmough, “Micronets: Neural network architectures for deploying TinyML applications on commodity mi- crocontrollers,” in Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds., vol. 3, 2021, pp. 517–532. [218] A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes, “Visual wake words dataset,” arXiv preprint arXiv:1906.05721, 2019. [219] “Meet Astro, a home robot unlike any other,” https://www.aboutamazon. com/news/devices/meet-astro-a-home-robot-unlike-any-other, 2021, ac- cessed: 09-28-2021. [220] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017. [221] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.Karpathy,A.Khosla,M.Bernstein,A.C.Berg,andL.Fei-Fei,“Imagenet large scale visual recognition challenge,” arXiv preprint arXiv:1409.0575, 2015. [222] S. Han, J. Lin, K. Wang, T. Wang, and Z. Wu, “Solution to visual wakeup words challenge’19 (first place),” https://github.com/mit-han-lab/VWW, 2019. 202 [223] C.Zhou, F.G.Redondo, J.B¨ uchel, I.Boybat, X.T.Comas, S.R.Nandaku- mar,S.Das,A.Sebastian,M.L.Gallo,andP.N.Whatmough,“Analognets: ML-HW co-design of noise-robust TinyML models and always-on analog compute-in-memory accelerator,” arXiv preprint arXiv:2111.06503, 2021. [224] S. K. Gonugondla, C. Sakr, H. Dbouk, and N. R. Shanbhag, “Fundamen- tal limits on energy-delay-accuracy of in-memory architectures in inference applications,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 10, pp. 3188–3201, 2022. [225] V.Kodukula,S.Katrawala,B.Jones,C.-J.Wu,andR.LiKamWa,“Dynamic temperature management of near-sensor processing for energy-efficient high- fidelity imaging,” Sensors, vol. 1, no. 3, 2021. [226] M.Kang,S.Lim,S.Gonugondla,andN.R.Shanbhag,“Anin-memoryVLSI architecture for convolutional neural networks,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems,vol.8,no.3,pp.494–505,2018. [227] M. F. Amir and S. Mukhopadhyay, “3D stacked high throughput pixel par- allel image sensor with integrated ReRAM based neural accelerator,” 2018 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), pp. 1–3, 2018. [228] P. Hansen, A. Vilkin, Y. Khrustalev, J. Imber, D. Hanwell, M. Mattina, and P. N. Whatmough, “ISP4ML: Understanding the role of image sig- nal processing in efficient deep learning vision systems,” arXiv preprint arXiv:1911.07954, 2019. [229] C.-T.Wu,L.F.Isikdogan,S.Rao,B.Nayak,T.Gerasimow,A.Sutic,L.Ain- kedem, and G. Michael, “VisionISP: Repurposing the image signal processor for computer vision applications,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, sep 2019. [230] E. S. Lubana, R. P. Dick, V. Aggarwal, and P. M. Pradhan, “Minimalis- tic image signal processing for deep learning applications,” in 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 4165–4169. [231] A. Mosleh, A. Sharma, E. Onzon, F. Mannan, N. Robidoux, and F. Heide, “Hardware-in-the-loop end-to-end optimization of camera image processing pipelines,”inIEEEConferenceonComputerVisionandPatternRecognition (CVPR), June 2020. [232] P.Chaudhari,F.Schirrmacher,A.Maier,C.Riess,andT.K¨ ohler,“Merging- ISP:Multi-exposurehighdynamicrangeimagesignalprocessing,”inPattern 203 Recognition, C. Bauckhage, J. Gall, and A. Schwing, Eds. Cham: Springer International Publishing, 2021, pp. 328–342. [233] “AP0201AT: Image Signal Processor, 2 MP ,” https://www.onsemi.com/ products/sensors/image-signal-processors-isps/ap0201at, 2021. [234] Y. Xing, Z. Qian, and Q. Chen, “Invertible image signal processing,” in CVPR, 2021. [235] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” 2018. [236] Y.-L.Liu,W.-S.Lai,Y.-S.Chen,Y.-L.Kao,M.-H.Yang,Y.-Y.Chuang,and J.-B. Huang, “Single-image HDR reconstruction by learning to reverse the camerapipeline,”inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [237] A. Punnappurath and M. S. Brown, “Learning raw image reconstruction- aware deep image compressors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 1013–1019, 2020. [238] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “CycleISP: Real image restoration via improved data synthesis,” in CVPR, 2020. [239] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates,f Inc., 2018. [240] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- bileNetV2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition,2018,pp.4510– 4520. [241] E. Schwartz, A. Bronstein, and R. Giryes, “ISP distillation,” arXiv preprint arXiv:2101.10203, 2021. [242] B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell, “Few-shot object detection via feature reweighting,” in Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), October 2019. [243] X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin, “Meta R-CNN: Towards general solver for instance-level low-shot learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9577–9586. 204 [244] Q. Fan, W. Zhuo, C.-K. Tang, and Y.-W. Tai, “Few-shot object detec- tion with attention-RPN and multi-relation detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4013–4022. [245] X.Wang, T.E.Huang, T.Darrell, J.E.Gonzalez, andF.Yu, “Frustratingly simple few-shot object detection,” in International Conference on Machine Learning (ICML), July 2020. [246] B. Sun, B. Li, S. Cai, Y. Yuan, and C. Zhang, “FSCE: Few-shot object detection via contrastive proposal encoding,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2021. [247] V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic global tonal adjustment with a database of input / output image pairs,” in CVPR 2011, vol. 1, no. 1, 2011, pp. 97–104. [248] N. E. Bock, “Methods for pixel binning in an image sensor,” Jul. 22 2008, uS Patent 7,402,789. [249] S. Taspinar, M. Mohanty, and N. Memon, “Effect of video pixel-binning on source attribution of mixed media,” in ICASSP 2021 - 2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, no. 1, 2021, pp. 2545–2549. [250] S. Sharif, R. A. Naqvi, and M. Biswas, “Beyond joint demosaicking and denoising: An image processing pipeline for a pixel-bin image sensor,” arXiv preprint arXiv:2104.09398, 2021. [251] X. Jin and K. Hirakawa, “Analysis and processing of pixel binning for color image sensor,” EURASIP Journal on Applied Signal Processing, vol. 2012, no. 1, Dec. 2012. [252] mmfewshot Contributors, “OpenMMLab few shot learning toolbox and benchmark,” https://github.com/open-mmlab/mmfewshot, 2021. [253] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real- time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015. [254] J.RedmonandA.Farhadi,“YOLOv3: Anincrementalimprovement,” arXiv preprint arXiv:1804.02767, 2018. [255] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes 205 Challenge 2012 (VOC2012) Results,” http://www.pascal- network.org/challenges/VOC/voc2012/workshop/index.html, 2012. [256] T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P.Perona,D.Ramanan,P.Doll´ ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Eu- ropean conference on computer vision. Springer, 2014, pp. 740–755. [257] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 1, 2012, pp. 3354–3361. [258] J. Gomez, S. Patel, S. S. Sarwar, Z. Li, R. Capoccia, Z. Wang, R. Pinkham, A. Berkovich, T.-H. Tsai, B. De Salvo, and C. Liu, “Distributed on-sensor computesystemforAR/VRdevices: Asemi-analyticalsimulationframework for power estimation,” arXiv preprint arXiv:2203.07474, 2022. [259] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V.Vanhoucke,andA.Rabinovich,“Goingdeeperwithconvolutions,”in2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 1, 2015, pp. 1–9. [260] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, no. 1, 2017, pp. 2261–2269. [261] Z. Gao, L. Wang, and G. Wu, “Lip: Local importance-based pooling,” in Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2019, pp. 3355–3364. [262] D. Yu, H. Wang, P. Chen, and Z. Wei, “Mixed pooling for convolutional neural networks,” 10 2014, pp. 364–375. [263] C.-Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutionalneuralnetworks: Mixed,gated,andtree,”inProceedings of the 19th International Conference on Artificial Intelligence and Statistics , ser. Proceedings of Machine Learning Research, A. Gretton and C. C. Robert, Eds., vol. 51. Cadiz, Spain: PMLR, 09–11 May 2016, pp. 464–472. [264] C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio, “Learned-norm pooling for deep feedforward and recurrent neural networks,” 09 2014, pp. 530–546. [265] F. Saeedan, N. Weber, M. Goesele, and S. Roth, “Detail-preserving pooling indeepnetworks.”inCVPR. IEEEComputerSociety,2018,pp.9108–9116. [266] T. Kobayashi, “Gaussian-based pooling for convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 32, 2019. 206 [267] A.Behera,Z.Wharton,P.Hewage,andA.Bera,“Context-awareattentional pooling(cap)forfine-grainedvisualclassification,”in TheThirty-FifthAAAI Conference on Artificial Intelligence . AAAI, 2021. [268] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” arXiv preprint arXiv:1510.00149, 2015. [269] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. [270] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” arXiv preprint arXiv:1803.03635, 2018. [271] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800. [272] H. Li, N. Liu, X. Ma, S. Lin, S. Ye, T. Zhang, X. Lin, W. Xu, and Y. Wang, “ADMM-based weight pruning for real-time deep learning acceleration on mobile devices,” in Proceedings of the 2019 on Great Lakes Symposium on VLSI, 2019, pp. 501–506. [273] S. Kundu, M. Nazemi, P. A. Beerel, and M. Pedram, “DNR: A tunable ro- bust pruning framework through dynamic network rewiring of DNNs,” in Proceedings of the 26th Asia and South Pacific Design Automation Confer- ence, 2021, p. 344–350. [274] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks.” J. Mach. Learn. Res., vol. 22, no. 241, pp. 1–124, 2021. [275] J. Lin, W.-M. Chen, J. Cohn, C. Gan, and S. Han, “Mcunet: Tiny deep learning on iot devices,” in Annual Conference on Neural Information Pro- cessing Systems (NeurIPS), 2020. [276] J. Lin, W.-M. Chen, H. Cai, C. Gan, and S. Han, “Mcunetv2: Memory- efficient patch-based inference for tiny deep learning,” in Annual Conference on Neural Information Processing Systems (NeurIPS), 2021. [277] S.MehtaandM.Rastegari, “MobileViT:Light-weight, general-purpose, and mobile-friendlyvisiontransformer,”inInternational Conference on Learning Representations, 2022. 207 [278] C. Yang, Y. Wang, J. Zhang, H. Zhang, Z. Wei, Z. Lin, and A. Yuille, “Lite vision transformer with enhanced self-attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008. [279] Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu, “Mobile-former: Bridging mobilenet and transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279. [280] X. Wang, Y. Jin, Y. Cen, T. Wang, B. Tang, and Y. Li, “Lightn: Light- weighttransformernetworkforperformance-overheadtradeoffinpointcloud downsampling,” arXiv preprint arXiv:2202.06263, 2022. [281] W.Zhang,Z.Huang,G.Luo,T.Chen,X.Wang,W.Liu,G.Yu,andC.Shen, “Topformer: Tokenpyramidtransformerformobilesemanticsegmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2022, pp. 12083–12093. [282] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un- terthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly, J.Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recog- nition at scale,” in International Conference on Learning Representations, 2021. [283] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao, “Learning deep transformer models for machine translation,” arXiv preprint arXiv:1906.01787, 2019. [284] A.BaevskiandM.Auli,“Adaptiveinputrepresentationsforneurallanguage modeling,” arXiv preprint arXiv:1809.10853, 2018. [285] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018. [286] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on com- puter vision. Springer, 2016, pp. 21–37. [287] R.R.Selvaraju,M.Cogswell,A.Das,R.Vedantam,D.Parikh,andD.Batra, “Grad-cam: Visual explanations from deep networks via gradient-based lo- calization,”inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626. 208 [288] M.TanandQ.Le,“EfficientNet: Rethinkingmodelscalingforconvolutional neuralnetworks,”inProceedingsofthe36thInternationalConferenceonMa- chine Learning, ser. Proceedings of Machine Learning Research, K. Chaud- huri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 6105–6114. [289] J. Xu, Y. Pan, X. Pan, S. Hoi, Z. Yi, and Z. Xu, “Regnet: Self-regulated network for image classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 1, no. 1, pp. 1–6, 2022. [290] Y. Li, Y. Kim, H. Park, T. Geller, and P. Panda, “Neuromorphic data augmentation for training spiking neural networks,” arXiv preprint arXiv:2203.06145, 2022. [291] K. Chaney, A. Panagopoulou, C. Lee, K. Roy, and K. Daniilidis, “Self- supervised optical flow with spiking neural networks and event based cam- eras,”in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 1, no. 1, 2021, pp. 5892–5899. [292] J. Hagenaars, F. Paredes-Valles, and G. de Croon, “Self-supervised learn- ing of event-based optical flow with spiking neural networks,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Asso- ciates, Inc., 2021, pp. 7167–7179. [293] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” arXiv preprint arXiv:2102.12092, 2021. 209
Abstract (if available)
Abstract
The increasing need for on-chip edge intelligence on various energy-constrained platforms is challenged by the high computation and memory requirements of deep neural networks (DNNs). This has motivated this dissertation research that focuses on two key thrusts in achieving energy and latency efficient edge intelligence. The first thrust is focused on neuromorphic spiking neural networks (SNN) where we propose novel DNN-to-SNN conversion and SNN fine-tuning algorithms that can increase the latency and energy efficiency for several static computer vision (CV) tasks. These algorithms involve single-spike hybrid input encoding, shifting and scaling threshold and post-activation values, quantization-aware spike time dependent backpropagation (STDB), and Hoyer regularized training with Hoyer spike layers. Beyond static tasks, we also propose novel SNN training algorithms and hardware implementations, involving novel activation functions with optimal bias shifts and a pipelined parallel processing scheme, that leverage both the temporal and sparse dynamics of SNNs to reduce the inference latency and energy of large-scale streaming/sequential tasks while achieving state-of-the-art (SOTA) accuracy. The second thrust is focused on hardware-algorithm co-design of in-sensor computing that can bring the SOTA DNNs, including these SNNs, closer to the sensors, further reducing their energy consumption, and enabling real-time processing. Here, we propose a novel processing-in-pixel-in-memory (P2M) paradigm for resource-constrained sensor intelligence applications, that embeds the computational aspects of all modern CNN layers inside the CMOS image sensors, compresses the input activation maps via reduced number of channels and aggressive strides, and thereby, mitigates the associated bandwidth, latency, and energy bottlenecks. Moreover, to enable such aggressive compression required for P2M, we propose a novel non-local self-attentive pooling method that efficiently aggregates dependencies between non-local activation patches during down-sampling and that can be used as a drop-in replacement to standard pooling layers. Additionally, to enable P2M, we need to bypass the image signal processing (ISP) pipeline, that degrades the test accuracy performed on raw images. To mitigate this concern, we propose an ISP reversal pipeline which can convert the RGB images of any dataset to its raw counterparts, and enable model training on raw images, thereby improving the accuracy with P2M-implemented systems. Coupled with our optimized SNNs, our P2M paradigm can reduce the bandwidth and the total system energy consumption each by an order of magnitude compared to SOTA vision pipelines.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Dendritic computation and plasticity in neuromorphic circuits
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Design of modular multiplication
PDF
Semiconductor devices for vacuum electronics, electrochemical reactions, and ultra-low power in-sensor computing
PDF
Memristive device and architecture for analog computing with high precision and programmability
PDF
Improving efficiency to advance resilient computing
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Memristor for parallel and analog data processing in the era of big data
PDF
Dynamic neuronal encoding in neuromorphic circuits
PDF
High performance and ultra energy efficient computing using superconductor electronics
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Astrocyte-mediated plasticity and repair in CMOS neuromorphic circuits
Asset Metadata
Creator
Datta, Gourav
(author)
Core Title
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
09/06/2023
Defense Date
04/14/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
energy-constrained,hardware-algorithm co-design,in-sensor computing,ISP,neuromorphic,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Nakano, Aiichiro (
committee member
), Pedram, Massoud (
committee member
), Yang, Joshua (
committee member
)
Creator Email
gdatta@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113303426
Unique identifier
UC113303426
Identifier
etd-DattaGoura-12324.pdf (filename)
Legacy Identifier
etd-DattaGoura-12324
Document Type
Thesis
Format
theses (aat)
Rights
Datta, Gourav
Internet Media Type
application/pdf
Type
texts
Source
20230907-usctheses-batch-1092
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
energy-constrained
hardware-algorithm co-design
in-sensor computing
ISP
neuromorphic