Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving efficiency to advance resilient computing
(USC Thesis Other)
Improving efficiency to advance resilient computing
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IMPROVING EFFICIENCY TO ADV ANCE RESILIENT COMPUTING by Ji Li A Dissertation Presented to the Committee: Dr. Jeffrey Draper (chair), Dr. Sandeep Gupta, Dr. Aiichiro Nakano, and Dr. Shahin Nazarian (co-chair) In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2018 Copyright 2018 Ji Li To my dearest parents, Longhua Li and Yingjie Miao ii Acknowledgments First and foremost, I would like to give my deepest gratitude to my Ph.D. advisors, Prof. Jeffrey Draper and Prof. Shahin Nazarian, for being constant sources of guidance, sup- port, expertise, and inspiration during the past four years. They have put their trust in me and provided me the invaluable opportunity to do research in a Ph.D. program when I was a master student with no publication to demonstrate my research ability. Over the difficult times of research, Prof. Draper had always been supportive and had worked actively to provide me with the academic guidance to help me solve the problems thor- oughly. He showed his support by attending every conference presentation I had, no matter how far away the city was and how busy his schedule was during that time. My Ph.D. research with Prof. Draper has been a stimulating, rewarding, and pleasant jour- ney under his guidance. Prof. Nazarian introduced me to the world of research when I was a master student in his courses. He has taught me, both consciously and uncon- sciously, how good research is done and guided me through every step to do meaningful research. Throughout my Ph.D. years, he had kept reminding me of exploring new ideas and always pushing ideas to their limits. His unbounded passion for research, ingenious ideas for new topics, and dedication in making technical contributions will always be an inspiration to me. Next, I would like to thank the other committee members in my qualifying exam and dissertation defense, including Prof. Sandeep K. Gupta, Prof. Aiichiro Nakano, Prof. iii Paul Bogdan, and Prof. Xuehai Qian. Thanks a million to Prof. Sandeep K. Gupta for his kindness, help, and valuable suggestions for my defense, and great support for both my M.S. and Ph.D. studies at USC. Thanks to Prof. Aiichiro Nakano for his excellent teaching in Scientific Computing and Visualization course and his strong support in both my qualifying exam and defense. Also thanks to Prof. Paul Bogdan, who has always been nice to me, and Prof. Xuehai Qian for his invaluable inputs to the stochastic computing based deep learning system project. During my graduate journey at USC, I am honored to have the privilege to work with top researchers inside and outside USC, including Prof. Yanzhi Wang, Prof. Xue Lin, Prof. Massoud Pedram, Prof. Paul Bogdan, Prof. Bo Yuan, Prof. Xuehai Qian, Prof. Peter A. Beerel, Prof. Naehyuck Chang, Prof. Qinru Qiu, Prof. Qi Zhu, Prof. Jintong Hu, Prof. Weiwei Zheng, and Prof. Yongpan Liu. Without them I could not accomplish what I have done. Special thanks go to Prof. Yanzhi Wang for the four- year-long collaboration with the outcome of 16 papers. I am very fortunate to work with Yanzhi at my first semester, who was a legendary senior student with more than 100 publications at that time. His dedication and attentiveness in writing papers, and his broad research insights as well as his passion in publishing top tier papers have significantly affected my working style. I will always appreciate Prof. Wang in every possible way, from reading papers to improving English, from developing algorithms to presenting ideas, from being a productive researcher to being a good collaborator. My sincere thanks also go out to my collaborators, inside and outside of the USC. They include Tiansong Cui, Qing Xie, Zihao Yuan, Huimei Cheng, Yang Zhang, Raghav Mehta, Alireza Shaefei, Kun Yue, Mengshu Sun, Yuyang Huang, Shrey Bagga, Nishant Mathur, Yiwei Zhao at USC; Mingxi Cheng at Duke University; Caiwen Ding, Zhe Li, Ao Ren, Hongjia Li, Ruizhe Cai, Ning Liu at Syracuse University; Feiyang Kang at iv Zhejiang University; and Hanchen Yang at Beijing University of Posts and Telecommu- nications. Special thanks to Tiansong Cui for his great support and help throughout my Ph.D. years, and Qing Xie for showing me how to be an excellent researcher, including how to draw beautiful graphs using Excel. Without Tiansong and Qing, I could not have made through the Ph.D. program. Also special thanks to Caiwen Ding, who supported me when I was in Syracuse. My sincere thanks also go to my colleagues and friends in the USC, Lihang Zhao, Yuankun Xue, Fangzhou Wang, Xuan Zuo, Luhao Wang, Haozhe Xu, Di Zhu, Siyu Yue, Qingzhou Liu, Yihang Liu, Liang Chen, Fanqi Wu, Yuqiang Ma, Yu Cao, Valeriu Balaban, Gaurav Gupta, Tian Zhu, Zihao Lu, Haipeng Zha, Jizhe Zhang, Wentao Zhang, Jianwei Zhang, Yihuan Shao, Han Zou, Weining Xin, Da Cheng, Bo Zhang, Yao Xiao, Shuang Chen, Mohammad Javad Dousti, Ting-Ru Lin, and Woojoo Lee. Finally, I would like to express my sincerest appreciation to my parents and grand- parents for their unconditional support and love. Without their understanding and encour- agement, this dissertation would not be possible. v Contents Dedication ii Acknowledgments iii List of Figures xi List of Tables xv Abstract xvii 1 Introduction 1 1.1 Thesis Contribution in Soft Error Rate (SER) Evaluation . . . . . . . . 3 1.1.1 Accelerated SER Estimation for Combinational Circuits . . . . 3 1.1.2 Schematic and Layout Co-Simulation for Multiple Cell Upset (MCU) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Fast and Comprehensive SER Evaluation Framework . . . . . . 4 1.2 Thesis Contribution in Emerging High Performance Resilient Systems . 5 1.2.1 Deep Reinforcement Learning (DRL)-Based Power Manage- ment for Data Centers using Deep Neural Networks . . . . . . . 5 1.2.2 Stochastic Computing (SC)-Based Deep Convolutional Neural Network (DCNN) Block Design and Optimization . . . . . . . 6 1.2.3 Highly-Scalable SC-Based DCNN Design and Optimization . . 7 1.2.4 SC-Based DCNN Hardware-Driven Nonlinear Activation Design 7 1.2.5 SC-Based DCNN Softmax Regression Design . . . . . . . . . . 8 1.2.6 Normalization and Dropout for SC-Based DCNNs . . . . . . . 8 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Background of Soft Error Rate (SER) Evaluation, Cloud Computing, Stochas- tic Computing (SC), and Deep Convolutional Neural Network (DCNN) 10 2.1 SER Evaluation Background . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Radiation-Induced Soft Error Basics . . . . . . . . . . . . . . . 10 2.1.2 Review of the State-of-the-Art . . . . . . . . . . . . . . . . . . 12 vi 2.2 Background of Cloud Computing, DCNN and SC . . . . . . . . . . . . 14 2.2.1 Resource Allocation in Cloud Computing . . . . . . . . . . . . 14 2.2.2 DCNN Architecture Overview . . . . . . . . . . . . . . . . . . 16 2.2.3 Stochastic Computing . . . . . . . . . . . . . . . . . . . . . . 18 2.2.4 Network Accuracy vs. Hardware Accuracy . . . . . . . . . . . 22 2.2.5 Review of the State-of-the-Art . . . . . . . . . . . . . . . . . . 22 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Limitation of the Previous Works on SER Evaluation . . . . . . 25 2.3.2 Limitation of Prior Approaches in Cloud Resource Allocation and Open Problems in SC-Based DCNNs . . . . . . . . . . . . 26 3 Accelerated Soft Error Rate (SER) Estimation for Combinational Circuits 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Parasitic Transient Current Pulse Model . . . . . . . . . . . . . 31 3.3.2 Generation-Lookup Tables (LUTs) and Propagation-Lookup Tables (LUTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.3 Flip-Flop Characterization . . . . . . . . . . . . . . . . . . . . 34 3.4 Propagation Methodology . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.1 SER Estimation Method . . . . . . . . . . . . . . . . . . . . . 35 3.4.2 Top-Down Memoization Algorithm to Accelerate Propagation . 37 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Schematic and Layout Co-Simulation for Multiple Cell Upset (MCU) Mod- eling 44 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Improved Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 Combinational Soft Error Rate (SER) Estimation . . . . . . . . . . . . 48 4.3.1 Characterization Phase of Combinational SER . . . . . . . . . 48 4.3.2 Computation Phase of Combinational SER . . . . . . . . . . . 49 4.4 Sequential SER Estimation . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.1 Characterization Phase of Sequential SER . . . . . . . . . . . . 51 4.4.2 Computation Phase of Sequential SER . . . . . . . . . . . . . . 52 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5 Fast and Comprehensive Soft Error Rate (SER) Evaluation Framework 57 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Flowchart of the Proposed Framework . . . . . . . . . . . . . . . . . . 60 5.3 Combinational Logic Characterization . . . . . . . . . . . . . . . . . . 63 vii 5.3.1 Generation and Propagation . . . . . . . . . . . . . . . . . . . 63 5.3.2 Latching Window Characterization . . . . . . . . . . . . . . . 65 5.4 Sequential Element Characterization . . . . . . . . . . . . . . . . . . . 65 5.4.1 Non-hardened Flip-Flop (FF) Characterization . . . . . . . . . 66 5.4.2 Hardened-FF Characterization . . . . . . . . . . . . . . . . . . 66 5.5 Combinational SER Computation . . . . . . . . . . . . . . . . . . . . 67 5.5.1 Combinational SER Computation Method . . . . . . . . . . . . 68 5.6 Sequential SER Computation using Time Frame Expansion . . . . . . . 70 5.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6 DRL-Cloud: Deep Reinforcement Learning-Based Resource Provisioning and Task Scheduling for Cloud Service Providers 76 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 System Model for DRL-Based Energy Cost Minimization in Cloud Com- puting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.1 User Workload Model . . . . . . . . . . . . . . . . . . . . . . 79 6.2.2 Cloud Platform Model . . . . . . . . . . . . . . . . . . . . . . 80 6.2.3 Energy Consumption Model . . . . . . . . . . . . . . . . . . . 81 6.2.4 Realistic Price Model . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.5 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 82 6.3 DRL-Cloud: DRL-Based Cloud Resource Provisioning and Task Schedul- ing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.1 Task Decorrelation . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3.2 Two-Stage RP-TS Processor Based on Deep Q-Learning . . . . 84 6.3.3 Semi-Markov Decision Process (SMDP) Formulation . . . . . . 86 6.4 Deep Q-learning Algorithm for DRL-Cloud With Experience Replay . . 87 6.4.1 Training Details for Deep Q-Networks . . . . . . . . . . . . . . 87 6.4.2 System Control Algorithm and the Two-Stage RP-TS Processor Algorithm with Experience Replay . . . . . . . . . . . . . . . 88 6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5.2 Experiments on Small-Scale Workloads and Platforms . . . . . 91 6.5.3 Experiments on Large-Scale Workloads and Platforms . . . . . 92 6.5.4 Long-Term Experiments and Convergence . . . . . . . . . . . . 93 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7 Stochastic Computing (SC) Based Deep Convoluntional Neural Network (DCNN) Block Design and Optimization 95 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.2 Hardware-Based DCNN Design and Optimization using SC . . . . . . 98 7.2.1 Approximate Parallel Counter (APC)-Based Neuron . . . . . . 98 viii 7.2.2 Multiplexer (MUX)-Based Neuron . . . . . . . . . . . . . . . . 101 7.2.3 Pooling Operation . . . . . . . . . . . . . . . . . . . . . . . . 103 7.2.4 Structure Optimization for the Entire DCNN Architecture . . . 104 7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8 Highly-Scalable Stochastic Computing Based Deep Convolutional Neural Network (SC-DCNN) Design and Optimization 110 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 8.2 Design and Optimization for Function Blocks and Feature Extraction Blocks in SC-DCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.2.1 Inner Product/Convolution Block Design . . . . . . . . . . . . 113 8.2.2 Pooling Block Designs . . . . . . . . . . . . . . . . . . . . . . 117 8.2.3 Activation Function Block Designs . . . . . . . . . . . . . . . 120 8.2.4 Design and Optimization for Feature Extraction Blocks . . . . . 121 8.3 Weight Storage Scheme and Optimization . . . . . . . . . . . . . . . . 124 8.3.1 Efficient Filter-Aware SRAM Sharing Scheme . . . . . . . . . 125 8.3.2 Weight Storage Method . . . . . . . . . . . . . . . . . . . . . 125 8.3.3 Layer-Wise Weight Storage Optimization . . . . . . . . . . . . 127 8.4 Overall SC-DCNN Optimizations and Results . . . . . . . . . . . . . . 129 8.4.1 Optimization Results on Feature Extraction Blocks . . . . . . . 130 8.4.2 Overall Optimizations and Results on SC-DCNNs . . . . . . . 133 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 9 Hardware-Driven Nonlinear Activation for Stochastic Computing Based Deep Convolutional Neural Networks 136 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 9.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 9.2.1 Activation Function Studies . . . . . . . . . . . . . . . . . . . 139 9.2.2 Hardware-Based DCNN Studies . . . . . . . . . . . . . . . . . 139 9.3 Overview of Hardware-Based DCNN . . . . . . . . . . . . . . . . . . 140 9.3.1 General Architecture of DCNNs . . . . . . . . . . . . . . . . . 140 9.3.2 Hardware-Based Neuron Cell . . . . . . . . . . . . . . . . . . 141 9.4 Proposed Hardware-Driven Nonlinear Activation for DCNNs . . . . . . 142 9.4.1 Stochastic Computing for Neuron Design . . . . . . . . . . . . 142 9.4.2 Proposed Neuron Design and Nonlinear Activation . . . . . . . 143 9.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.5.1 Performance Evaluation and Comparison among the Proposed Neuron Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.5.2 Comparison with Binary ASIC Neurons . . . . . . . . . . . . . 150 9.5.3 DCNN Performance Evaluation and Comparison . . . . . . . . 151 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 ix 10 Softmax Regression Design for Stochastic Computing Based Deep Convo- lutional Neural Networks 155 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 10.2 DCNN Architecture and Softmax Regression Function . . . . . . . . . 156 10.2.1 Deep Convolutional Neural Network . . . . . . . . . . . . . . . 156 10.2.2 Stochastic Computing (SC) . . . . . . . . . . . . . . . . . . . 157 10.2.3 Softmax Regression (SR) Function . . . . . . . . . . . . . . . 158 10.3 SC-Softmax Regression Design . . . . . . . . . . . . . . . . . . . . . 159 10.3.1 Overall Structure . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.3.2 SC-exponential . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10.3.3 SC-normalization . . . . . . . . . . . . . . . . . . . . . . . . . 162 10.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 10.4.1 Performance analysis for SC-SR . . . . . . . . . . . . . . . . . 165 10.4.2 Comparison with Binary ASIC SR . . . . . . . . . . . . . . . . 165 10.4.3 DCNN Accuracy Evaluation . . . . . . . . . . . . . . . . . . . 166 10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 11 Normalization and Dropout for Stochastic Computing-Based Deep Convo- lutional Neural Networks 167 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.2 Proposed Stochastic Computing-Based Feature Extraction Block . . . . 170 11.2.1 APC-Based Inner Product . . . . . . . . . . . . . . . . . . . . 170 11.2.2 Pooling Design . . . . . . . . . . . . . . . . . . . . . . . . . . 172 11.2.3 Activation Design . . . . . . . . . . . . . . . . . . . . . . . . 174 11.3 Proposed Normalization and Dropout for SC-based DCNNs . . . . . . 175 11.3.1 Proposed Stochastic Normalization Design . . . . . . . . . . . 175 11.3.2 Integrating Dropout into SC-Based DCNN . . . . . . . . . . . 177 11.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.4.1 Performance Evaluation of the Proposed FEB Designs . . . . . 179 11.4.2 Performance Evaluation of the Proposed SC-LRN Designs . . . 180 11.4.3 Impact of SC-LRN and Dropout on the Overall DCNN Perfor- mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 12 Conclusion 185 Reference List 188 x List of Figures 2.1 The general DCNN architecture. . . . . . . . . . . . . . . . . . . . . . 16 2.2 Illustration of the convolution process. . . . . . . . . . . . . . . . . . . 17 2.3 Three types of basic operations (function blocks) in DCNN. (a) Inner Product, (b) pooling, and (c) activation. . . . . . . . . . . . . . . . . . 19 2.4 Stochastic multiplication. (a) Unipolar multiplication and (b) bipolar multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Stochastic addition. (a) OR gate, (b) MUX, (c) APC, and (d) two-line representation-based adder. . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Stochastic hyperbolic tangent. . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Current source model of a particle strike at a circuit node. . . . . . . . . 29 3.2 Overall flow of our SER estimation framework. . . . . . . . . . . . . . 30 3.3 Parasitic transient current pulse model. (a) double exponential current shape (b) probability of charge deposition. . . . . . . . . . . . . . . . . 32 3.4 Simulation setup for establishing G-LUTs and P-LUTs considering var- ious factors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 SER analysis for one node. . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 System diagram of the SER estimation framework. . . . . . . . . . . . 47 4.2 Overall flow of the proposed SER estimation framework. . . . . . . . . 48 4.3 Simulation setup for the characterization phase of combinational SER. . 50 4.4 Two independent current sources attached to a pair of cross-coupled storage nodesN 1 andN 2 in a DICE FF. . . . . . . . . . . . . . . . . . 52 4.5 An example error map for radiation-hardened structures. . . . . . . . . 53 xi 4.6 Modeling MCU effects from layout, whereN 1 andN 2 are cross-coupled storage nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 System diagram of soft error generation and propagation in a circuit. . . 61 5.2 The flowchart of our SER estimation framework. . . . . . . . . . . . . 63 5.3 Simulation setup for characterizing a combinational standard cell con- sidering various factors. . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Latching window masking mechanism. (a) an example where pulse A and C are masked (b) simulation setup for characterizing latching win- dows in FFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Hardened-FF characterization. (a) an example simulation setup for a FERST latch (b) a sample error map. . . . . . . . . . . . . . . . . . . . 67 5.6 Time frame expansion for sequential circuits. . . . . . . . . . . . . . . 70 6.1 System model of the cloud platform and structure for DRL-Cloud. The system model is defined in Section 6.2. The structure and algorithm of the proposed DRL-Cloud are described in Section 6.3 and Section 6.4. . 80 6.2 The structure of the DRL-Cloud framework: the details of task decor- relation is described in Section 6.3.1, and the details of the two-stage RP-TS processor is described in Section 6.3 and Section 6.4. . . . . . . 84 6.3 Runtime and energy cost comparisons with baselines for small scale workload and platform configuration. Energy cost is normalized with regard to the energy cost of Greedy method for 100 servers and 5; 000 requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.4 (a) Convergence of DRL-Cloud. (b) Energy cost comparison with RR in long-run (29 days) on large scale workloads and platform configuration. 92 7.1 Various hardware neuron designs. (a) APC-based neuron, and (b) MUX- based neuron. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2 Using the fixed bit stream length 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for an APC-based neuron. . 100 7.3 The length of bit stream versus accuracy under different input numbers for an APC-based neuron. . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.4 Using the fixed bit stream length 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for a MUX-based neuron. . 103 xii 7.5 The length of bit stream versus accuracy under different input numbers for a MUX-based neuron. . . . . . . . . . . . . . . . . . . . . . . . . . 103 7.6 A 4-to-1 pooling example. . . . . . . . . . . . . . . . . . . . . . . . . 104 7.7 The impact of errors in different layers on the overall DCNN test error. 106 7.8 Structure optimization method for the entire DCNN. . . . . . . . . . . 107 8.1 16-bit Approximate Parallel Counter. . . . . . . . . . . . . . . . . . . . 116 8.2 The Proposed Hardware-Oriented Max Pooling. . . . . . . . . . . . . . 119 8.3 Output comparison of Stanh vs tanh. . . . . . . . . . . . . . . . . . . . 121 8.4 The structure of a feature extraction block. . . . . . . . . . . . . . . . . 122 8.5 Structure of optimized Stanh for MUX-Max-Stanh. . . . . . . . . . . . 123 8.6 Filter-Aware SRAM Sharing Scheme. . . . . . . . . . . . . . . . . . . 126 8.7 The impact of inaccuracies at each layer on the overall SC-DCNN net- work accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.8 The impact of precision of weights at different layers on the overall SC- DCNN network accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.9 Input size versus absolute inaccuracy for (a) MUX-Avg-Stanh, (b) MUX- Max-Stanh, (c) APC-Avg-Btanh, and (d) APC-Max-Btanh with differ- ent bit stream lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 8.10 Input size versus (a) area, (b) path delay, (c) total power, and (d) total energy for four different designs of feature extraction blocks. . . . . . . 129 9.1 A general DCNN architecture. . . . . . . . . . . . . . . . . . . . . . . 140 9.2 A hardware-based neuron cell in DCNN. . . . . . . . . . . . . . . . . . 141 9.3 Stochastic computing for neuron design: (a) XNOR gate for bipolar multiplication, (b) binary adder for average pooling, and (c) FSM-based tanh for stochastic inputs. . . . . . . . . . . . . . . . . . . . . . . . . 144 9.4 The result comparison between the proposed SC neuron (bit streamm = 1024) and the corresponding original software neuron: (a) SC-tanh vs Tanh, (b) SC-logistic vs Logistic, and (c) SC-ReLU vs ReLU. . . . . . . 149 xiii 9.5 Input size versus absolute inaccuracy under different bit stream lengths for (a) SC-tanh neuron, (b) SC-logistic neuron, and (c) SC-ReLU neuron. 150 9.6 Input size versus (a) area, (b) total power, and (c) total energy for the neuron designs using tanh, logistic and ReLU activation functions. . . . 151 10.1 Stochastic computing for neuron design: (a) XNOR gate for bipolar multiplication, (b) binary adder, and (c) unipolar division. . . . . . . . 158 10.2 Structure for SC based Softmax Regression function. . . . . . . . . . . 160 10.3 Input size versus (a) total power, (b) area, and (c) total energy for the proposed SC-SR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 10.4 Input size versus absolute inaccuracy under different bit stream lengths for SC-SR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 11.1 APC-based inner product. . . . . . . . . . . . . . . . . . . . . . . . . 170 11.2 Using the fixed bit stream length of 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for an FEB using APC for inner product, MUX for average pooling and the Btanh proposed in [KKY + 16] for activation. . . . . . . . . . . . . . . . . . . . . . . . . . 171 11.3 The length of bit stream versus accuracy under different input numbers for an FEB using APC inner product, MUX based average pooling and Btanh activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 11.4 Pooling design in SC: (a) average pooling and (b) near-max pooling. . . 173 11.5 Stochastic square circuit using a DFF and an XNOR gate. . . . . . . . 174 11.6 The overall stochastic normalization design. . . . . . . . . . . . . . . 175 11.7 Input size versus (a) total power, (b) area, and (c) total energy for the FEB design (4-to-1 pooling). . . . . . . . . . . . . . . . . . . . . . . . 180 11.8 Input size versus (a) total power, (b) area, and (c) total energy for the FEB design (9-to-1 pooling). . . . . . . . . . . . . . . . . . . . . . . . 181 11.9 Performance of the proposed LRN: (a) number of adjacent neurons ver- sus absolute inaccuracy under different bit stream lengths and (b) differ- ent K values versus absolute inaccuracy under different. . . . . . . . 181 xiv List of Tables 3.1 Experimental results of various ISCAS85 benchmark circuits . . . . . . 42 4.1 Experimental results of various ISCAS85 benchmark circuits . . . . . . 54 5.1 Experimental Results of Various ISCAS89 Combinational and Sequen- tial Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.1 Comparison of Energy Cost, Runtime and Reject Task Number between DRL-Cloud and Round-robin . . . . . . . . . . . . . . . . . . . . . . 93 7.1 Comparison between APC-Based Neuron and MUX-Based Neuron using 1024 Bit Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.2 Comparison among Various Hardware-Based DCNNs and Software- Based DCNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 8.1 Inaccuracies of OR Gate-Based Inner Product Block . . . . . . . . . . 115 8.2 Inaccuracies of MUX-Based Inner Product Block . . . . . . . . . . . . 115 8.3 Inaccuracies of the APC-Based Compared with the Conventional Paral- lel Counter-Based Inner Product Blocks . . . . . . . . . . . . . . . . . 117 8.4 Relative Result Deviation of Hardware-Oriented Max Pooling Block Compared with Software-Based Max Pooling . . . . . . . . . . . . . . 119 8.5 The Relationship Between State Number and Relative Inaccuracy of Stanh120 8.6 Comparison among Various SC-DCNN Designs Implementing LeNet 5 132 8.7 Comparison with Existing Hardware Platforms . . . . . . . . . . . . . 132 9.1 Naming Conventions in a Stochastic Computing Based Neuron . . . . . 145 9.2 Neuron Cell Performance Comparison with 8 Bit Fixed Point Binary Implementation whenn = 25 andm = 1024 . . . . . . . . . . . . . . 151 xv 9.3 Comparison among Software DCNN, Binary ASIC DCNN, and Various SC Based DCNN Designs Implementing LeNet 5 . . . . . . . . . . . . 152 10.1 Naming Conventions in a SC based SR . . . . . . . . . . . . . . . . . . 161 10.2 Network Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10.3 Performance Comparison with 8 Bit Fixed Point Binary Design when n = 800 andq = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 11.1 Comparison between APC-based FEB and MUX-based FEB using MUX for average pooling and tanh activation under 1024 bit stream . . . . . . 172 11.2 Precision of the improved max pooling for an FEB with 16-bit input size under 1024 bit stream . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.3 Absolute error of FEBs with 4-to-1 pooling (commonly used in LeNet-5 [LJB + 95]) under different bit stream lengths and input sizes. . . . . . . 180 11.4 Absolute error of FEBs with 9-to-1 pooling (commonly used in AlexNet [KSH12]) under different bit stream lengths and input sizes. . . . . . . 180 11.5 SC-LRN versus Binary-LRN hardware cost . . . . . . . . . . . . . . . 182 11.6 AlexNet Accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . 183 xvi Abstract This thesis is dedicated to improving the efficiency of resilient computing through both a classic approach and a novel approach in parallel involving emerging resilient systems. The first part of this thesis is focused on one of the most important problems in resilient computing, i.e., evaluating the impacts of radiation-induced soft errors, which is one of the major threats to the resilience of modern electronic systems. A fast and comprehensive Soft Error Rate (SER) evaluation framework is developed for conven- tional computing circuits in three steps. The first step is an accelerated SER estimation algorithm for combinational logic, which accelerates the most computationally expen- sive process of the SER estimation framework, i.e., the propagation of Single-Event Transient (SET) pulses, by using dynamically maintained lookup tables (LUTs). Sim- ulation results demonstrate 560.2X times speedup is achieved with less than 3% differ- ence in terms of SER results compared with the baseline algorithm. With the aggressive downscaling of the process technology, multiple upsets can be induced by a single par- ticle strike due to the charge sharing and parasitic bipolar effects, which is called Mul- tiple Cell Upsets (MCUs). Hence, the second step integrates the MCU modeling into the framework by proposing a schematic and layout co-simulation method. The third step introduces an efficient time frame expansion method for analyzing feedback loops in sequential logic. Simulation results show that presented SER evaluation framework xvii can analyze the largest ISCAS89 benchmark circuit with more than 3,000 flip-flops and 17,000 gates in 119.23s. The second part of the thesis is focused on the emerging resilient Deep Convolu- tional Neural Network (DCNN) and Deep Neural Network (DNN) systems, which com- pletely tolerate radiation-induced soft errors. The thesis proposes a deep reinforcement learning-based framework (DRL-Cloud) to solve the cloud resource allocation problem by utilizing the resilient hierarchical DNN architecture, which cannot be solved effi- ciently by the traditional methods when problem scale is large. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement, compared with the state-of-the-art energy efficient algorithms. Then, the thesis adopts the Stochastic Computing (SC) technology to achieve significantly improved area, power and energy efficiency, in order to bring the DCNN resilient architecture to resource-constrained Internet-of-Things (IoT) and wearable devices. Basic operational blocks in DCNNs are first designed and optimized, then the entire DCNN network is designed with joint optimizations for feature extraction blocks and optimized weight storage schemes. The LeNet5 implemented in SC-based DCNN achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation. Non-linear activation is design for SC-DCNNs, which achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198,200X and 96,443X of the energy for the LeNet-5 implementation, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. Finally, softmax regression is designed for SC-DCNNs, that can reach the same level of accuracy with the improvement of 295X, 62X, 2,617X in terms of power, area and energy, respectively, compared with the binary version under long bit stream. xviii Chapter 1 Introduction Resilience is a major roadblock for high-performance computing (HPC) executions on future exascale systems, as the increased likelihood of much higher error rates results in systems that fail frequently and make little progress in computations or in systems that may return erroneous results [CGG + 14, SWA + 14]. Meanwhile, hardware failure mechanisms are impacting the resilience of commercial electronic systems at ground level [MBS10]. Therefore, it is imperative to develop resilient computing techniques for both high-end computing systems and commercial electronic systems, in order to keep applications running to correct solutions despite the underlying hardware failures. Among all the hardware failure mechanisms, radiation-induced soft errors have become one of the most challenging issues [KMH12, WDT + 14], which can lead to silent data corruptions and system failures, with potentially disastrous results in mission- critical systems such as mainstream servers, automobiles and spacecrafts [Nic10]. Hence, the first part of the thesis is dedicated to a classical resilient computing prob- lem: what is the Soft Error Rate (SER) of a circuit? In the process, Deep Neural Network (DNN) and Deep Convolutional Neural Network (DCNN) have emerged as high performance resilient systems, which com- pletely tolerate radiation-induced soft errors. More importantly, DNN and DCNN have achieved breakthroughs in many application fields that require detection and recogni- tion, such as image classification, pattern recognition, and natural language process- ing [LBH15]. Nevertheless, there are two challenges faced by these high performance resilient systems: (i) how to extend the success of such resilient systems from detection 1 and recognition tasks to complicated control problems which have broader impacts, and (ii) how to promote the adoption of such resilient systems that are usually implemented in high-performance server clusters to the widespread IoT and wearable devices with limited computation capacities. Accordingly, the second part of this thesis is dedicated to solve the aforementioned challenges. A Deep Reinforcement Learning (DRL)-based framework is proposed, which utilizes the resilient DNNs together with the reinforcement learning method to solve one complicated control problem, i.e., cloud computing resource allocation prob- lem, which cannot be resolved efficiently by previous algorithms. Then, a Stochastic Computing (SC)-based DCNN architecture is proposed, which maps the latest DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness. Unlike traditional binary comput- ing systems, SC-based DCNN architecture is resilient to radiation-induced soft errors, and the main source of errors is the inaccuracy in SC components and hardware-based network design. Hence, the accuracy improvement of the state-of-the-art DCNNs is treated as the main objective together with the power/area/energy efficiency in this part. In conclusion, this thesis is dedicated to improving the efficiency of resilient com- puting through both a classical approach, i.e., fast and comprehensive SER evaluation framework for conventional computing circuits, and another novel approach in paral- lel involving the extension of the emerging resilient DNNs for complicated control problems with broader impacts and improving the efficiency of resilient DCNNs for widespread deployment in IoT/wearable devices. 2 1.1 Thesis Contribution in Soft Error Rate (SER) Eval- uation The first part of this thesis presents a comprehensive accelerated SER estimation frame- work for combinational and sequential circuits. The proposed SER assessment frame- work comprehensively considers Single-Event Transients (SETs) in combinational com- ponents as well as Flip Flops (FFs) without redundancy, the radiation-induced Multiple Cell Upsets (MCUs) in several representative radiation-hardened structures, and soft error propagation in the feedback paths of sequential logic. The contributions made in the related chapters are described as follows. 1.1.1 Accelerated SER Estimation for Combinational Circuits In Chapter 3, we propose an efficient SER estimation framework of combinational cir- cuits in the presence of SETs, which significantly reduces the runtime with improved scalability and preserves the solution quality in terms of accuracy at the same time. We carry out a detailed analysis on the soft error vulnerabilities in CMOS combinational circuits and determine the key parameters that need to be extracted during the character- ization process. A top-down memoization algorithm is proposed to effectively acceler- ate the computationally expensive propagation process. The proposed SER estimation framework is also compatible for the FinFET technology since the Lookup Table (LUT) data structure is highly flexible. Experimental results on various benchmarks demon- strate that the proposed framework achieves up to 560.2X speedup with less than 3% SER difference compared to the baseline algorithm. 3 1.1.2 Schematic and Layout Co-Simulation for Multiple Cell Upset (MCU) Modeling In Chapter 4, we jointly consider radiation-induced soft errors in combinational logic and sequential elements for advanced CMOS technologies. MCU effects are consid- ered for radiation-hardened sequential elements, and we evaluate two representative radiation-hardened FF structures, i.e., feedback redundant SEU-tolerant (FERST) and dual interlocked storage cell (DICE). Simulation results on a variety of benchmarks demonstrate that both combinational and sequential components contribute to the total SER. We propose a general schematic and layout co-simulation method for evaluating SER caused by MCUs in redundant storage structures. Simulation results demonstrate that the SER that considers MCUs is significantly higher than the SER without con- sidering MCUs in radiation-hardened structures, indicating the importance of modeling MCUs in advance technologies. We further compare the area and soft error resilience among different FF structures, which can be used to guide circuit designers to choose the best FF structure based on their needs. 1.1.3 Fast and Comprehensive SER Evaluation Framework In Chapter 5, we consider the propagation of the soft errors, which are generated in combinational logic or sequential elements and get latched in the stage FFs, through the subsequent sequential logic. We propose an efficient and comprehensive SER assess- ment framework for combinational and sequential circuits with feedback loops, which is featured by the improved runtime and scalability. The time frame expansion method is used to efficiently calculate the SER contributed by the propagation of soft errors in the sequential logic. Results on ISCAS89 combinational and sequential benchmarks demonstrate that MCU effects cannot be ignored in hardened FFs, and the runtime of 4 the proposed SER estimation framework is of the order of hundreds of seconds, even for a relatively large scale combinational and sequential circuit (with more than 3,000 FFs and more than 17,000 gates). 1.2 Thesis Contribution in Emerging High Performance Resilient Systems While conventional binary computing systems are becoming increasingly susceptible to radiation-induced soft errors in advanced technology nodes, DNN and DCNN have emerged as promising resilient systems that tolerate radiation-induced soft errors, and have become the dominant approach for almost all recognition and detection tasks [LBH15]. In order to promote these resilient systems from recognition tasks to con- trol tasks that have broader impacts, the second part of this thesis first utilize DNN systems to solve the important cloud computing resource allocation problem, which cannot be solve efficiently by the traditional approaches when problem scale is large. Then, the remainder of the second part of this thesis presents the comprehensive design and optimization framework of SC-based DCNNs, using a bottom-up approach. The contributions made in the corresponding chapters are discussed below. 1.2.1 Deep Reinforcement Learning (DRL)-Based Power Manage- ment for Data Centers using Deep Neural Networks In chapter 6, we present DRL-Cloud, a novel Deep Reinforcement Learning (DRL)- based Resource Provisioning (RP) and Task Scheduling (TS) system, to minimize energy cost for large-scale Cloud Service Providers (CSPs) with very large number of 5 servers that receive enormous numbers of user requests per day. A deep Q-learning- based two-stage RP and TS processor is designed to automatically generate the best long-term decisions by learning from the changing environment such as user request patterns and realistic electric price. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves remarkably high energy cost efficiency, low reject rate as well as low runtime with fast convergence. Compared with one of the state-of-the-art energy efficient algorithms, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower reject rate on average. For an example CSP setup with 5; 000 servers and 200; 000 tasks, compared to a fast round-robin baseline, the proposed DRL-Cloud achieves up to 144% runtime reduction. 1.2.2 Stochastic Computing (SC)-Based Deep Convolutional Neural Network (DCNN) Block Design and Optimization In chapter 7, we conduct a detailed investigation of the Approximate Parallel Counter (APC) based neuron and multiplexer-based neuron using SC, and analyze the impacts of various design parameters, such as bit stream length and input number, on the energy/power/area/accuracy of the neuron cell. From an architecture perspective, the influence of inaccuracy of neurons in different layers on the overall DCNN accuracy (i.e., software accuracy of the entire DCNN) is studied. Accordingly, a structure opti- mization method is proposed for a general DCNN architecture, in which neurons in different layers are implemented with optimized SC components, so as to reduce the area, power, and energy of the DCNN while maintaining the overall network perfor- mance in terms of accuracy. Experimental results show that the proposed approach can 6 find a satisfactory DCNN configuration, which achieves 55X, 151X, and 2X improve- ment in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation. 1.2.3 Highly-Scalable SC-Based DCNN Design and Optimization In chapter 8, we propose the first comprehensive design and optimization framework of SC-based DCNNs (SC-DCNNs), using a bottom-up approach. We present effec- tive designs and optimizations on weight storage to reduce the corresponding area and power (energy) consumptions, including efficient filter-aware SRAM sharing, effective weight storage methods, and layer-wise weight storage optimizations. We conduct thor- ough optimizations on the overall SC-DCNN, with feature extraction blocks carefully selected, to minimize area and power (energy) consumption while maintaining a high network accuracy level. The optimization procedure leverages the important observa- tion that hardware inaccuracies in different layers in DCNN have different effects on the overall network accuracy, therefore different designs may be exploited to minimize area and power (energy) consumptions. Overall, the proposed SC-DCNN achieves the lowest hardware cost and energy consumption in implementing LeNet5 compared with reference works. 1.2.4 SC-Based DCNN Hardware-Driven Nonlinear Activation Design One major challenge in SC based DCNNs is designing accurate nonlinear activation functions, which have a significant impact on the network-level accuracy but cannot be implemented accurately by existing SC computing blocks. In chapter 9, we design and optimize SC based neurons, and we propose highly accurate activation designs for the 7 three most frequently used activation functions in software DCNNs, i.e, hyperbolic tan- gent, logistic, and rectified linear units. Experimental results on LeNet-5 using MNIST dataset demonstrate that compared with a binary ASIC hardware DCNN, the DCNN with the proposed SC neurons can achieve up to 61X, 151X, and 2X improvement in terms of area, power, and energy, respectively, at the cost of small precision degrada- tion. In addition, the SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. ReLU acti- vation is suggested for future SC based DCNNs considering its superior performance under a small bit stream length. 1.2.5 SC-Based DCNN Softmax Regression Design In chapter 10, we design and optimize the SC based Softmax Regression function. Experiment results show that compared with a binary SR, the proposed SC-SR under longer bit stream can reach the same level of accuracy with the improvement of 295X, 62X, 2617X in terms of power, area and energy, respectively. Binary SR is suggested for future DCNNs with short bit stream length input whereas SC-SR is recommended for longer bit stream. 1.2.6 Normalization and Dropout for SC-Based DCNNs In chapter 11, we introduce normalization and dropout, which are essential techniques for the state-of-the-art DCNNs, to the existing SC-based DCNN frameworks. In this work, the feature extraction block of DCNNs is implemented using an approximate parallel counter, a near-max pooling block and an SC-based rectified linear activation unit. A novel SC-based normalization design is proposed, which includes a square and 8 summation unit, an activation unit and a division unit. The dropout technique is inte- grated into the training phase and the learned weights are adjusted during the hardware implementation. Experimental results on AlexNet with the ImageNet dataset show that the SC-based DCNN with the proposed normalization and dropout techniques achieves 3.26% top-1 accuracy improvement and 3.05% top-5 accuracy improvement compared with the SC-based DCNN without these two essential techniques, confirming the effec- tiveness of our normalization and dropout designs. 1.3 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 reviews the background of SER estimation, cloud computing, SC, and DCNNs. Chapter 3 presents the proposed SER estimation framework of combinational circuits in the presence of SETs, and the proposed acceleration algorithm. Chapter 4 provides the details of the proposed MCU modeling technique. The comprehensive SER evaluation framework is described in Chapter 5. The DRL-Cloud framework is given in Chapter 6. Chapter 7 presents the neuron block design and optimization, and the comprehensive design and optimization framework of SC-based DCNNs is given in Chapter 8. The non-linear activation, soft- max regression design and normalization/dropout are described in Chapter 9, Chapter 10, and Chapter 11, respectively. Finally, this thesis is concluded in Chapter 12. 9 Chapter 2 Background of Soft Error Rate (SER) Evaluation, Cloud Computing, Stochastic Computing (SC), and Deep Convolutional Neural Network (DCNN) 2.1 SER Evaluation Background 2.1.1 Radiation-Induced Soft Error Basics A radiation-induced soft error occurs in a semiconductor device when the free mobile carriers generated by the passage of an energetic radiation particle are collected by the depletion region of a revere-biased p-n junction [KOTN14, RRV + 08]. Consequently, a transient noise pulse is generated due to the momentary current flowing through the device [RRV + 08]. This single-event transient (SET), if propagated through subsequent circuitry and captured by a storage element, becomes a single-event upset (SEU), i.e., a bit error [KMH12, VLTC11]. Such SEUs caused by the SETs are referred to as “soft” errors since there are no permanent damage to the hardware, and the rate at which they occur is called the soft error rate (SER). 10 Traditionally, memory elements have been much more sensitive to soft errors than combinational logic circuits [MZM07]. As a result, extensive error detection and cor- rection techniques have been implemented mainly for register files and on-chip SRAMs [ZMM + 06]. Nevertheless, memory protection is not enough for advanced technologies, because drastic device shrinking, reduction of parasitic nodal capacitances, low operat- ing voltages, and high operating frequency have added to the increase of sensitivity of both combinational circuits and sequential elements to the radiation-induced soft errors [HW14, VLTC11, Nic10], posing a severe problem for system resilience. In order to improve soft error resilience in sequential elements, many radiation- hardened structures, such as the dual interlocked storage cell (DICE) latch [KKM + 14], feedback redundant SEU-tolerant (FERST) latch [FPME07], and triple modular redun- dancy (TMR) latch [PK15], have come into existence to mitigate SEUs, i.e., change of state resulting from one single particle hit. However, the continued downscaling of device dimensions has propelled the severity and relevance of multiple-node charge col- lection mechanisms, such as charge sharing and the parasitic bipolar effect [ZFKO14], which makes the aforementioned hardened structures more sensitive to multiple cell upsets (MCUs), i.e., a single particle strike causing simultaneous failures at multiple bits (nodes). This necessitates the modeling of MCU effects for evaluating soft error vulnerability of hardened sequential elements in advanced technologies. On the other hand, combinational soft errors, unlike errors in memory that can be corrected with effi- cient error code correction (ECC) techniques or errors in sequential elements that can be mitigated by redundant structures, cannot be rectified without incurring significant area overhead and performance penalties [VLTC11]. Therefore, a combinational SER estimation method is required, in order to quantify the degree of soft error tolerance and identify the most vulnerable sites for enhancement. The soft errors generated in com- binational logic and sequential elements are referred to as logic soft errors [ZMM + 06]. 11 These logic soft errors, once latched in stage flip-flops (FFs), can propagate through the subsequent sequential logic and may appear in primary outputs more than once [MZM07]. Hence, it is imperative to analyze the propagation of logic soft errors in sequential logic, in order to accurately estimate the total SER. 2.1.2 Review of the State-of-the-Art Considerable research efforts have been conducted in the context of computing SER in digital circuits. The previous studies can be categorized into soft error characterization studies, combinational SER studies, redundant sequential element studies, and sequen- tial SER studies. Soft Error Characterization Studies The first step of characterization is the generation of soft errors, which models the phys- ical effects of particle strike as current pulses at the striking nodes. A number of current models, such as Weibull function [KR11], exponential current pulse [HW14], and dou- ble exponential current model [RRV + 08, VLTC11, LR12, RKV + 06], have been adopted by different circuit level SER estimation works. Double exponential current model is one of the most widely accepted models in the circuit level works, and several works [WDT + 14, KR11] have concentrated on determining technology dependent parameters in the double exponential formula. In addition to the soft error generation related works, the authors in [LR12] proposed a soft error characterization method that captured the pulse widths of SETs, whereas the authors in [GBB + 12] considered both pulse widths and pulse heights of SETs during soft error characterization. The authors in [CDLQ14] proposed a sensitive area calculation method in order to model the actual sensitive area. The aforementioned works achieve high accuracy, however, the added parameters will lead to increased workload and processing time in the following propagation phase. 12 Combinational SER Studies The authors in [EACC13] proposed an RTL-based combinatorial SER estimation to achieve fast RTL level SER analysis. Compared with RTL level SER analysis, circuit- level and logic-level approaches are more accurate. Binary decision diagrams were used during the SER estimation in FASTER [ZWO06]. SEAT-LA [RKV + 06] presented an SER estimation framework which characterized the SET parametric waveforms using analytical equations. For structural combinational logic, HSEET [RRV + 08] provided a hierarchical approach to improve the speed of the SER estimation process by block level partition. Multi-cycle effects and striking effects were considered in [HW14], and the authors in [CDLQ14] proposed the effective sensitive area calculation method, in order to model the actual sensitive area. The aforementioned works achieve high accuracy, however, the added parameters will lead to increased workload and processing time in the following computation phase. Redundant Sequential Element Studies The TMR latch tolerates SEU at the cost of large area overhead and high power dissipa- tion due to the added three identical static latches and a majority voter [PK15, HLH15]. In order to reduce the area and power penalties, the FERST latch was proposed in [FPME07], where a redundant feedback line and Muller C-elements were employed to achieve SEU resilience. Similar to FERST, the DICE structure duplicates the stor- age nodes by the half C-element and the clocked half C-element, in order to mitigate SEUs [KKM + 14]. These works are mainly focused on individual sequential element enhancement. 13 Sequential SER Studies The sequential SER studies can be roughly divided into three subgroups: (i) Markov Chain analysis-based SER estimation, which provides accurate steady-state SER esti- mation following a particle hit [MZM08], however, it suffers from the potential state explosion problems, (ii) fault simulation approach [ESCT15], which injects faults into circuits and simulates for a typical workload, in order to find whether the faults propa- gates to the primary outputs, and (iii) time frame expansion method [MZM07, WA10], where the sequential circuit is unrolled for time-dependent SER analysis. The fault simulation approach has high computational complexity as each soft error needs to be tracked and the circuit needs to be simulated for each workload. Hence, we adopt the more efficient time frame expansion approach. 2.2 Background of Cloud Computing, DCNN and SC 2.2.1 Resource Allocation in Cloud Computing Cloud computing has emerged as a cogent and powerful paradigm that delivers omnipresent and on-demand access to a shared pool of configurable computing resources as a service through the Internet [LWL + 16]. Virtualization is the fundamental technology of cloud computing, which enables multiple operating systems to run on the same physical platform, and structures servers into Virtual Machines (VMs) [RR16]. VMs are used by Cloud Service Providers (CSPs) to provide infrastructures, platforms, and resources (e.g., CPU, memory, storage, etc.). In the cloud computing paradigm, CSPs are incentivized by the benefit of charging users for cloud service access, resource utilization and VM rental, whereas users are attracted by the opportunity of eliminating 14 expenditure of implementing computational, time and power consuming applications on cloud based on their own requirements [GWGP13]. Despite the success of many well-known CSPs such as Google App Engine (GAE) and Amazon Elastic Compute Cloud (EC2), the tremendous energy costs in terms of electricity consumed by data centers is a serious challenge. Data center electricity con- sumption is projected to be roughly 140 billion kilowatt-hours annually by 2020, which costs 13 billion US dollars annually in electric bills [Del14]. Hence, in order to increase the profit margin and as well, reduce the carbon footprint for sustainable development and abstemious economical society, it is imperative to minimize the data center electric- ity consumption for large-scale CSPs. According to [BCH13], energy usage of data centers has two important features: (i) servers tend to be more energy inefficient under low utilization rate (with the optimal power efficient utilization rate of most servers ranging between 70% and 80%), and (ii) servers may consume a considerable amount of power in idle mode. Therefore, server consolidation and load balancing can be applied to improve the overall energy efficiency through selectively shutting down idle servers and improving the utilization levels in active servers. Meanwhile, the agreements in the Service-Level Agreement (SLA) should be consistently met, which is negotiated by the CSP and users regarding privacy, security, availability, and compensation [WBTY11]. Energy consumption and electric cost reduction become challenging for CSPs, and the reasons are twofold: First, scalability of expenditure control is critical due to the large-scale server farms and enormous numbers of incoming requests per day, and both of which are still growing. Second, as user request patterns can change both in short- term (within a day) and long-term (from month/year to month/year), the adaptability and self-learning capacity of the energy and electric cost reduction method are required. 15 2.2.2 DCNN Architecture Overview Deep convolutional neural networks are biologically inspired variants of multilayer per- ceptrons (MLPs) by mimicking the animal visual mechanism [len16]. An animal visual cortex contains two types of cells and they are only sensitive to a small region (receptive field) of the visual field. Thus a neuron in a DCNN is only connected to a small recep- tive field of its previous layer, rather than connected to all neurons of previous layer like traditional fully connected neural networks. As shown in Figure 2.1, each layer of DCNN is a 3D volume that has neurons arranged in three dimensions: heightwidthdepth. Height and width refer to the size of one feature map, while depth represents the number of feature maps. A whole feature map is covered by tiling receptive fields [len16]. A DCNN is in the simplest case a stack of three types of layers: Convolutional Layer, Pooling Layer, and Fully Connected Layer. The convolutional layer is the core building block of DCNN, and the main operation is the convolution that calculates the dot-product of receptive fields and a set of learnable filters (or kernels) [cs216]. Figure 2.2 illustrates the process of convolution operations. Suppose that the size of the input feature map is 7 7, and the size of a filter is 3 3, thus the feature map is divided into nine receptive fields if the stride is two. The first and ninth elements of the output ... Input layer Convolutional layer Pooling layer Fully Connected layer Output layer ... ... ... ... ... ... ... ... ... Figure 2.1: The general DCNN architecture. 16 Feature Map 1 -1 -1 0 0 -1 0 0 1 1 2 0 0 0 0 0 2 0 0 1 0 0 0 0 1 0 0 0 2 0 0 2 1 0 1 1 0 2 2 0 1 0 0 0 0 0 2 1 0 2 2 0 0 0 0 2 1 0 Input Volume Filter -1 -1 Figure 2.2: Illustration of the convolution process. feature map are computed by respectively convolving the first and ninth receptive fields with the filter. After the convolution operations, the nonlinear down-samplings are conducted in the pooling layers for reducing the dimension of data. The most common pooling strategies are max pooling and average pooling. Max pooling is to pick up the maximum value from the candidates, and average pooling is to calculate the average value of the can- didates. Then the extracted feature maps after down-sampling operations are sent to activation functions that conduct non-linear transformations such as Rectified Linear Unit (ReLU)f(x) =max(0;x), Sigmoid functionf(x) = (1 +e x )1 and hyperbolic tangent (tanh) function f(x) = 2 1+e 2x 1. Finally, the high-level reasoning is com- pleted via the fully connected layer. Neurons in this layer are connected to all activation results in the previous layer. Finally, the loss layer is normally the last layer of DCNN and it specifies how the deviation between the predicted and true labels is penalized in the network training process. Various loss functions such as softmax loss, sigmoid cross-entropy loss and so on may be used for different tasks. The main operations in DCNNs are inner product, pooling, and activation function operations, as shown in Figure 2.3 (a), (b), and (c), respectively. In convolutional layers, the inner product operation is performed by a convolutional neuron to calculate the dot-product of a receptive field (x i ’s in Figure 2.3 (a)) and a filter (w i ’s in Figure 2.3 17 (a)). Generally, the inner products are then subsampled through pooling operations performed by pooling neurons, and Figure 2.3 (b) shows the average pooling and max pooling which are studied in this work. The subsampled outputs are transformed by an activation function shown in Figure 2.3 (c) to ensure the inputs of the next layer are within the [-1, 1] range. In the fully connected layer,x i is thei-th activation output from the previous layer andw i is a weight of the corresponding link, and they are also inputs of neurons in the fully connected layer. The concept of “neuron” is widely used in the software/algorithm domain. In the context of DCNNs, a neuron may consist of one or multiple basic operations. For example, neurons in convolutional layers implement inner product operations only; those in pooling layers implement pooling and activation operations; and those in fully connected layers implement inner product and activation operations. Since this the- sis focuses on hardware designs and optimizations, we focus on the basic operations, i.e., inner product, pooling, and activation, and the corresponding SC-based designs of these fundamental operations are termed function blocks. Furthermore, different func- tion blocks (main operations) need to be jointly optimized with respect to the bit-stream length and structure compatibilities (e.g., an APC-based inner product block needs to connect to a Btanh-based activation function block). The composition of an inner prod- uct block, a pooling block, and an activation function block is referred to as the feature extraction block, which takes charge of extracting features from feature maps. The design and optimizations of the basic function blocks and feature extraction blocks will be discussed in Section 7.2 and 8.2.4. 2.2.3 Stochastic Computing Stochastic computing is a technology that represents a probabilistic number by counting the number of ones in a bit-stream. For instance, the bit-stream 0100110100 contains 18 w1 w2 w3 w4 wn x1 x2 x3 x4 xn ... Σ ϕ 5 Inner Product Function Block Activation Function Block Pooling Function Block 8 1 6 3 4 7 2 6 9 5 4 7 6 8 7 8 7 9 8 5 4 6 7 Average Pooling Max Pooling (a) (b) (c) Figure 2.3: Three types of basic operations (function blocks) in DCNN. (a) Inner Prod- uct, (b) pooling, and (c) activation. four ones in a ten-bit stream, thus it representsP (X = 1) = 4=10 = 0:4. In addition to this unipolar encoding format, SC can also represent numbers in the range of [-1, 1] using the bipolar encoding format. In the scenario of bipolar encoding scheme, a real number x is processed by P (X = 1) = (x + 1)=2, thus 0.4 can be represented by 1011011101. To represent a number beyond the range [0, 1] using unipolar format or beyond [-1, 1] using bipolar format, a pre-scaling operation [YZW16] can be used. The major advantage of stochastic computing is its much lower hardware cost on a large category of arithmetic calculations, when compared to conventional binary com- puting. The abundant area budget offers immense design space in optimizing hardware performance via efficient tradeoffs between the area and other metrics, such as power, latency, and parallelism degree, and thus becomes a promising technology for imple- menting large-scale DCNNs. Multiplication. Figure 2.4 shows the basic multiplication components in SC domain. A unipolar multiplication can be performed by an AND gate sinceP (AB = 1) = P (A = 1)P (B = 1) (assuming independence of two random variables), and a bipolar multiplication is performed by means of a XNOR gate since c = 2P (C = 19 1)1 = 2(P (A = 1)P (B = 1)+P (A = 0)P (B = 0))1 = (2P (A = 1)1)(2P (B = 1) 1) =ab. Addition. In this thesis, four popular stochastic addition methods are investigated, optimized, and carefully selected for SC-DCNNs. An OR gate in Figure 2.5 (a) is the simplest method that consumes the least hardware footprint to perform an addition, but this method will introduce much accuracy loss because the computation “logic 1 OR logic 1” only generates a single logic 1 and results in inaccuracy. The second com- ponent in Figure 2.5 is a multiplexer, which is the most popular method to perform additions in either the unipolar or the bipolar format [BC01b]. For example, a bipolar addition is performed asc = 2P (C = 1)1 = 2(1=2(P (A = 1)+1=2P (B = 1))1 = 1=2(2P (A = 1) 1) + (2P (B = 1) 1)) = 1=2(a +b). Approximate parallel counter (APC) depicted by Figure 2.5 (c) is proposed in [KLC15], and it calculates the summa- tion of inputs by accumulating the number of ones. It consumes fewer logic gates when compared with the conventional accumulative parallel counter [KLC15, PY95]. The fourth implementation of stochastic addition uses two-line representation of a stochastic number, which is proposed in [TQF00]. The two-line representation consists of a mag- nitude streamM(X) and a sign streamS(X), in which 1 represents a negative bit and 0 represents a positive bit. The value of the represented stochastic number is calculated by: x = 1 L P L1 i=0 (1 2S(X i ))M(X i ), whereL is the length of the bit-stream. As an example, -0.5 can be represented byM(0:5) : 10110001 andS(0:5) : 11111111. Hyperbolic Tangent (tanh). The tanh function is highly suitable for stochastic computing-based implementations, for the reasons that (i) it can be easily implemented with a K-state finite state machine (FSM) in the SC domain [BC01b] and causes less hardware cost when compared to the piecewise linear approximation (PLAN)-based implementation [LKMO06] in conventional computing domain, and (ii) replacing ReLU or sigmoid function by tanh function does not cause accuracy loss in DCNN [KSH12]. 20 Z A B A B 1,1,1,1,0,0,0,0 (4/8) (a) 1,1,0,1,1,1,1,0 (6/8) 1,1,0,1,0,0,0,0 (3/8) (b) 1,1,0,1,0,0,1,0 (0/8) 1,0,1,1,1,1,1,0 (4/8) 1,0,0,1,0,0,1,1 (0/8) Z Figure 2.4: Stochastic multiplication. (a) Unipolar multiplication and (b) bipolar multi- plication. (c) Parallel Counter A 1 ... n to 1 Mux Comb. Logic for Truth Tab Counter In Enable U/D 1 1 1 2 3 2 2 S(A ) i M(A ) i S(B ) i M(B ) i S(C ) i M(C ) i A 2 A 3 A n X (b) (a) (d) A 1 A 2 A 3 A n X X A 1 A 2 A 3 A n ... ... Figure 2.5: Stochastic addition. (a) OR gate, (b) MUX, (c) APC, and (d) two-line representation-based adder. Therefore we choose tanh as the activation function in SC-DCNNs in this work. The diagram of the FSM is shown in Figure 2.6. It will output a zero when the current state is on the left half of the diagram, otherwise output a one. The value calculated by the FSM satisfiesStanh(K;x) =tanh( K 2 x), whereStanh denotes stochastic tanh. 21 _ X S0 S1 SK/2-1 SK/2 SK-2 SK-1 X _ X X X X X _ X _ X _ X _ Z=0 Z=1 tanh( x) Z X X 2 K _ X Figure 2.6: Stochastic hyperbolic tangent. 2.2.4 Network Accuracy vs. Hardware Accuracy The overall network accuracy (e.g., the overall recognition or classification rates) is one of the key optimization goals of the SC-based hardware DCNN. On the other hand, the SC-based function blocks and feature extraction blocks exhibit certain degree of inaccuracy due to the inherent stochastic nature. The network accuracy and hardware accuracies are different but correlated, i.e., high accuracies in each function block will likely lead to a high overall network accuracy. Hence, the hardware accuracies will be optimized in the design of SC-based function blocks and feature extraction blocks. 2.2.5 Review of the State-of-the-Art Related Works of Resource Allocation in Cloud Computing Power cost reduction problem shares imperative importance in cloud computing field, and different approaches are proposed in prior works to solve this problem. J. Li et al. [LWL + 16] and H. Li et al. [LLY + 17], propose two methods: NBRPTS and FERPTS respectively, to minimize electric cost for each task under a user workload model and dynamic price model. Allocating VM resource and choosing the optimum time-slot in an iteration fashion gives optimal electric cost for every single step in short-term, 22 which may lead to a relatively high total electric cost in long-term. However, time consumption of iteration-based approach is increased exponentially when server number increased, which leads to an extremely sever time consuming issue in real world when server number is hundreds even thousands. Y . Gao et al. proposed a genetic algorithm which requires large size of memories to process many generations, which is not scalable [GWGP13]. H. Li et al. [LLY + 17] proposed FERPTS method to minimize electric cost for each task in an iteration fashion. However, time consumption of iteration-based approach is increased exponentially when server number increased, which leads to an extremely sever time consuming issue in real world when server number is hundreds even thousands. Recently DRL has breakthroughs of with large state space and finite action space like AlphaGo [SHM + 16] and Atari [MKS + 13], [MKS + 15]. S. Wang et al. applied DRL in dynamic multichannel access in wireless sensor network problem in commu- nication field [WLGK17], T. Wei et al. put DRL into use in HV AC control problem in smart buildings [WWZ17], and N. Liu et al. used DRL (partially) to solve cloud resource allocating problem [LLX + 17] based on offline training, and there lacks a work which applies DRL to fully resolve the problem by considering the detailed schedul- ing decisions, the internal task dependencies for parallel computing, and dynamic user requests. Related Works of Efficient DCNN Implementations Authors in [LSN12, SDS15, KSH12, JSD + 14] leveraged the parallel computing and storage resources in GPUs for efficient DCNN implementations. FPGA-based acceler- ators, benefited from the advantages of being programmable, high degree of parallelism 23 and short develop round, is another promising path towards the hardware implementa- tion of DCNNs [ZLS + 15, MGAG16]. However, these GPU and FPGA-based imple- mentations still exhibit a large margin of performance enhancement and power reduc- tion. This is because (i) GPUs and FPGAs are general-purpose computing devices not specifically optimized for executing DCNNs, and (ii) the relatively limited signal rout- ing resources in such general platforms will restrict the performance of DCNNs which exhibit high inter-neuron communication requirements. ASIC-based implementations of DCNNs have been recently exploited to overcome the limitations of general-purpose computing devices. Two representable recent works on ASIC-based implementations are DaDianNao [CLL + 14] and EIE [HLM + 16]. The former proposes an ASIC “node” which could be connected in parallel to implement a large-scale DCNN, whereas the latter focuses specifically on the fully-connected layers of DCNN and achieves high throughput and energy efficiency. Novel computing paradigms need to be investigated in order to provide the ultra-low hardware footprint and the highest possible energy efficiency and scalability. Stochastic computing-based design of neural networks is an attractive candidate to meet the above goals and facilitate the widespread of DCNNs in personal, embedded, and mobile IoT devices. Although not focusing on deep learning systems, predecessors in [SNA + 03] proposed the design of a neurochip using stochastic logic. Reference [JRML15] uti- lized stochastic logic to implement a radial basis function-based neural network, and the neuron design with SC for deep belief network was presented in [KKY + 16]. How- ever, there is no existing work that investigates comprehensive designs and optimizations of SC-based hardware DCNNs including both computation blocks and weight storing methods. 24 2.3 Summary 2.3.1 Limitation of the Previous Works on SER Evaluation As soft error vulnerability evaluation is an essential part of cost-effective robust circuit design, considerable research efforts have been invested in accurately characterizing and propagating soft errors in combinational circuits [RCBS07, RRV + 08, HW14, CDLQ14] and sequential circuits [MZM07, WA10, ESCT15, EET + 15]. However, the prior works have mainly focused on the accuracy of the SER estimation results, and the efforts to improve the runtime and scalability of the estimation process are limited to circuit par- titioning, where the circuit of interest is divided into sub-blocks for parallel processing, and the parallel fault simulation, in which multiple errors are injected and analyzed in parallel. Nevertheless, the SER estimation process has become computationally expen- sive and the reason is threefold: (i) the SER estimation process is required to process a large set of parameters, in order to describe various complex effects, (ii) runtime of the estimation process increases near-exponentially as the size and logic complexity of the circuit increases [LD16a], and (iii) the analysis of propagation of soft errors in sequen- tial circuits takes more than one clock cycle, as a single particle hit can affect outputs for several clock cycles. Moreover, the prior works have not considered the MCU effects during the soft error evaluation of radiation-hardened sequential elements in the circuit layer, which may result in over optimistic SER estimation results. 25 2.3.2 Limitation of Prior Approaches in Cloud Resource Allocation and Open Problems in SC-Based DCNNs For resource allocation problem, the prior works [ZZBH13, GWGP13, XTQ13, LLY + 17] have scalability issues and their offline algorithms have difficulties in deal- ing with the large size of inputs and adapt to changes, e.g., dealing with different user request patterns. The recent DRL approach proposed by N. Liu et al. (partially) solve the resource allocating problem in cloud computing [LLX + 17], without detailed schedul- ing for tasks with data dependencies, which is critical to guarantee tasks are executed correctly [IBY + 07]. As for the DCNNs, there lacks a detailed investigation of the energy-accuracy trade- offs for DCNNs using different SC components. Moreover, within a DCNN architec- ture, neurons in different layers have various connection patterns and exhibit different degrees of influence on the overall system performance, which indicates a structure opti- mization can be applied to achieve further improvement. There are many open problems in this research direction, which require block design and optimization, network opti- mization, and software-hardware co-design and optimization. 26 Chapter 3 Accelerated Soft Error Rate (SER) Estimation for Combinational Circuits With the advent of nanoscale computing, soft errors have become one of the most chal- lenging issues that impact the reliability of modern electronic systems at ground level for the semiconductor industry [KMH12, WDT + 14]. It is shown that the proportion of combinational logic soft errors at chip level increases with technology scaling and the combinational logic SER is comparable with the memory SER under high operating frequencies [MGA + 14]. In addition, unlike errors in memory that can be corrected effi- ciently with the error code correction (ECC) techniques [VLTC11, FCMG13], errors in combinational logic cannot be rectified without incurring significant area overhead and performance penalties [FCMG13]. In order to quantify the degree of SET tolerance in combinational logic, circuit-level SER estimation method is required. Many soft error estimation methods, such as FASTER [ZWO06] and SEAT-LA [RKV + 06], have come into existence. However, the prior works have mainly concentrated on the accuracy of the SER esti- mation results, and the efforts to speed up the estimation process is limited to circuit partitioning, where the circuit of interest is divided into sub-blocks for parallel process- ing. Nevertheless, SER estimation process has become computationally expensive and the reason is twofold: (i) the SER estimation process is required to process a large set of 27 parameters, in order to describe various complex effects, and (ii) runtime of the estima- tion process increases near-exponentially as the size and logic complexity of the circuit increases. 3.1 Introduction In this chapter, we propose an efficient SER estimation framework, which significantly reduces the runtime with improved scalability and preserves the solution quality in terms of accuracy at the same time [LD16a]. The proposed SER estimation framework is com- prised of two phases: (i) characterization and (ii) propagation. In the characterization phase, we adopt the double exponential current source model for SET pulse generation and extract SER estimation related parameters using HSPICE simulations. Various two- dimensional lookup tables (LUTs) are established to store the characterization results. All these characterizations need to be performed only once for each technology node, however, the propagation phase needs to be performed for each combinational circuit. In the propagation phase, the SETs are propagated from the particle striking sites towards the outputs (i.e., inputs of next stage flip-flops), and the final SER of the circuit is cal- culated based on the propagation results. In this chapter, we propose a top-down mem- oization algorithm for accelerating the SET propagation. In the proposed algorithm, overlapping SET propagations are only processed once and the results are cached into maps, such that the following recursive call never re-processes a SET propagation if it has been processed before and cached in the maps. The memory overhead is negligible since most temporary data are released immediately after they have been consumed. The contributions of this work are twofold. First, we carry out a detailed anal- ysis on the soft error vulnerabilities in CMOS combinational circuits and determine 28 the key parameters that need to be extracted during the characterization process. Sec- ond, a top-down memoization algorithm is proposed to effectively accelerate the com- putation expensive propagation process. The proposed SER estimation framework is also compatible for the FinFET technology since LUT data structure is highly flexible. Experimental results on various benchmarks demonstrate that the proposed framework achieves up to 560.2X speedup with less than 3% SER difference compared to the base- line algorithm. 3.2 Overall Flow In this section, the overall flow of the proposed SER estimation framework is described. When a particle strikes the circuit, it generates electron-hole pairs inside some transistors on the chip, leading to parasitic transient current pulses at the striking nodes. Each generated transient current pulse is modeled by a current source, as shown in Figure 3.1. In the characterization phase, three steps are performed: 1) Based on HSPICE simulations with different SET current pulses, driver states and output capacitances, a generation LUT (G-LUT) is built for each standard combinational Figure 3.1: Current source model of a particle strike at a circuit node. 29 cell in the library, in order to transform the radiation-induced current pulses into voltage pulses for propagation. 2) The next step is to propagate these voltage pulses towards the outputs. Based on HSPICE simulations, a propagation LUT (P-LUT) is established for each standard cell in the library, which records the mapping from voltage pulses at the input of the cell to the propagated voltage pulses at the output for a certain cell state and output capacitance. 3) The propagated SET voltage pulses at outputs need to be captured by the next stage flip-flops to become an error. Therefore, apart from the G-LUTs and P-LUTs, the characteristics of the flip-flops in the library are examined. In the propagation phase after characterization, a process is conducted, where the final SER is calculated by accumulating the probabilities of SETs that are generated at vulnerable nodes, propagated to outputs and captured by flip-flops. Figure 3.2 provides the proposed SER estimation flow. The details of characterization and the proposed propagation algorithm are explained in Section 3.3 and Section 3.4, respectively. Figure 3.2: Overall flow of our SER estimation framework. 30 3.3 Characterization In this section, SET current pulse generation, the steps to establish LUTs, and flip-flop characterization are presented. 3.3.1 Parasitic Transient Current Pulse Model In this chapter, the soft error impact is described as a current pulse generated at each particle strike node. Note that there are some debates about the validity of different current models. We choose the double exponential current pulse model because this model has been adopted in [VLTC11] for the technology that is used in our experiments, and our SER estimation framework can accommodate other current models as well. The double exponential current is calculated as I(t) =I peak (e t e t ) (3.1) where peak currentI peak = Q , in whichQ is the maximum collected charge as the result of the particle strike; is the charge collection time constant and is the Ion- track establishment time constant. Figure 3.3 (a) shows a typical double exponential current pulse waveform with the time parameters extracted from 3D technology com- puter aided design (TCAD) simulations for a bulk CMOS technology. The rising part of the current pulse corresponds to the establishment of the Ion track at the particle strike node, whereas the falling part represents the process where the excess carrier concen- tration is moved by drift until the carrier concentration is restored to the background doping. The technology dependent time constants and correspond to the rising and falling part, respectively. 31 Figure 3.3: Parasitic transient current pulse model. (a) double exponential current shape (b) probability of charge deposition. According to reference work [VLTC11], the charge collection time for bulk CMOS technology is calculated as = k 0 qDN (3.2) where k 0 is the substrate dielectric constant, q is the electron charge, is the carrier mobility,D is the doping concentration, andN is a scaling factor, which scales doping concentrationD to the generation rate of electron-hole pairs. The Ion-track establish- ment time constant can be calculated using the method developed in [WDT + 14]. The probability of deposited charge Q is required during the characterization of G-LUTs. In this chapter, we adopt an exponential distribution function of charge prob- ability developed in [RRV + 08], which constructs the charge probability based on data points obtained for neutron energy and the corresponding differential flux at sea level from the JEDEC Solid State Technology Association Standard JESD89 [Sta06]. In this model, the probability of charge is calculated as P (Q) =a 0 e a 1 Q +a 2 (3.3) wherea 0 ,a 1 anda 2 are constants. Figure 3.3 (b) shows an example normalized proba- bility of charge deposition. 32 3.3.2 Generation-Lookup Tables (LUTs) and Propagation-Lookup Tables (LUTs) The characterization results are stored in 2D G-LUTs and P-LUTs. As mentioned in Section 3.1, G-LUTs are established to record the mapping that transforms the radiation- induced SET current pulses into voltage pulses at the striking node, whereas P-LUTs are constructed to save the mapping from SET voltage pulses at the input of a standard cell to the propagated SET voltage pulse at the output of this cell. In order to make the characterization process general, we conduct a comprehensive investigation of all the factors that could potentially affect the generation and propaga- tion of the SETs, including deposited charge, load capacitance, cell state (i.e., input com- bination), gate type and gate size. Accordingly, the G-LUTs and P-LUTs are obtained for all possible input vector states, gate sizes, and a wide range of output capacitance as well as deposited charges for the given technology. Figure 3.4 provides the simu- lation setup for establishing G-LUTs and P-LUTs. The pulse width measured at 50% Figure 3.4: Simulation setup for establishing G-LUTs and P-LUTs considering various factors. 33 supply voltage is captured as the main parameter that describes the SET voltage pulse, and the rise and fall times of the input pulses are chosen to be typical values for P-LUT simulations. Without any loss of generality, the proposed characterization method can be extended to consider more parameters such as SET pulse height and establish more types of LUTs for capturing more radiation-induced physical effects. The flexibility of LUT data structures also enables characterization for FinFET technology. For exam- ple, based on reference work [KOTN14], the number of generated electron-hole pairs in a FinFET device for different particles energies can be characterized and stored into LUTs. In general, detailed characterization results lead to high accuracy of SER results but also increased workload for the propagation phase. Therefore, the size of LUTs should be carefully decided in order to avoid long LUT accessing time. For a given technology file that containsM combinational standard cells with various sizes and logic functions, the maximum input pin number of a combinational cell is denoted byC, and the total number of types of k input combinational cells is denoted by M k (1 k C). For each input combination, each gate size of each combinational standard cell type, we establish (i) one G-LUT that uses deposited charge and output capacitance as index keys, and (ii) one P-LUT that uses input SET pulse width and output capacitance as index keys. Therefore, we constructM total = 2 C k=1 M k 2 k two-dimensional LUTs in the characterization phase. 3.3.3 Flip-Flop Characterization The characteristics of flip-flops play an important role in determining the SER because the SET pulse has to be latched into one of the next stage flip-flops to become an error. A flip-flop is insensitive to arrival SET voltage pulses that fall outside the latching window 34 (i.e., setup time + hold time). Therefore, the setup time and hold time of flip-flops are required. 3.4 Propagation Methodology In this section, we present the SER estimation method and the proposed top-down mem- oization algorithm for accelerating the SER propagation process. 3.4.1 SER Estimation Method In the propagation phase, the three important masking factors affecting the propagation of a SET voltage pulse through the combinational circuit are logical masking, electrical masking, and latch window masking [RRV + 08, HW14]. Logical masking occurs if the SET pulse arrives to the input of a gate when at least one of its other inputs has a controlling value. Electrical masking occurs when the SET voltage pulse is attenuated or even completely disappears due to the electrical properties of propagated gates. Latch window masking eliminates the error when the propagated pulse cannot get latched into the flip-flop (as mentioned in Section 3.3.3). The total SER is the summation of SER contributed by each noden, which is calcu- lated as SER total = N X n=1 SER n (3.4) where N is the total number of nodes in the circuit. Each node n is susceptible for particle strikes with the deposited charge over the range ofQ min andQ max . Accordingly, SER n is calculated as SER n = Z Qmax Q=Q min P (Q)SER(n;Q)dQ (3.5) 35 where SER(n;Q) is the SER induced by the SETs that are generated at node n with deposited chargeQ, propagated through the circuit under the electrical and logical mask- ing effects, and latched into the flip-flops under the latching window masking effect. P (Q) is the probability of charge calculated in Equation (3.3). SER(n;Q) can be fur- ther formulated as SER(n;Q) = D X d=1 P log (n;d)P ele (n;Q;d)P lat (n;Q;d) (3.6) whered andD represent thed-th output and the total number of outputs, respectively. The termsP log (n;d),P ele (n;Q;d) andP lat (n;Q;d) in Equation (3.6) represent the log- ical masking, electrical masking and latch window masking effects, respectively. More specifically, the first termP log (n;d) in Equation (3.6) indicates the total sen- sitized probability of SETs propagating through the circuit from node n to output d along all paths, which can be computed by accumulating logic probability (P side ) for non-controlling values on all side-inputs along the paths, P log (n;d) = Y i2n!d P side (i) (3.7) wherei represents one node along the path fromn tod. The logic probability of each signal can be calculated using the correlation coefficient method [Ric89], which consid- ers reconvergent paths. Alternatively, it can be estimated by simulations over a large set of typical vectors (possibly obtained by running a set of benchmark programs), which implicitly considers the reconvergent paths. The second approach is adopted in this chapter. The second term P ele (n;Q;d) in Equation (3.6) denotes the overall possibil- ity of SETs that are generated from noden withQ charge deposited and propagated to output d with recognizable electrical strength along all paths assuming all side-inputs 36 always have non-controlling values. The third termP lat (n;Q;d) in Equation (3.6) indi- cates the probability of the SETs, that are induced by chargeQ at noden, getting latched into the flip-flop at outputd considering no logical masking or electrical masking effects. In order to calculate Equation (3.6), LUT access functions () are required. We denote the generated SET voltage pulse width at noden with chargeQ as hn;r gen (n;Q), where h n indicates the gate type of node n and r represents the cell state. With SET pulse width pw at the input of gate j, the propagated pulse width from node i to j is denoted as h i ;r prop (i;Q;j). These two LUT access functions are associated with G-LUTs and P-LUTs characterized in Section 3.3, respectively. Therefore,SER n is computed through generating all possible current pulses at node n, transforming the SET current pulses to voltage pulses by accessing G-LUTs, propa- gating the SET voltage pulses to the outputs along all possible paths, and getting latched into one flip-flop. The three masking effects are inherently considered during this pro- cess. Propagating through one gate requires one P-LUT access. Note that in a large scale circuit, generated SET pulses at each node need to be propagated through all pos- sible paths towards the outputs, making the propagation process computational very expensive. 3.4.2 Top-Down Memoization Algorithm to Accelerate Propagation In this subsection, an efficient propagation algorithm using the top-down memoization technique is proposed in order to accelerate the computational expensive SER estimation process. The first step of the proposed algorithm is to levelize the circuit logic and find the desired node evaluation order such that a node will not be evaluated until all its driving nodes have been evaluated. Several levelization algorithms can be found in reference 37 Figure 3.5: SER analysis for one node. [JG03]. The levels of inputs and outputs are set to 0 and L max , respectively, and we denote the level of noden byl(n); (0l(n)L max ). From Equation (3.4), the total SER is calculated by accumulating SER n in each noden. Figure 3.5 shows the detailed steps for analyzing the SER of one node. First, all the possible SET high-voltage pulses generated at noden and their associated prob- abilities form a high-voltage vector (HVV), whereas a low-voltage vector (LVV) stores all the possible low-voltage SET pulses with their associated probabilities. The high- voltage pulses and low-voltage pulses are separated into two different vectors due to the fact that propagating a high-voltage SET and a low-voltage SET through the same gate with the same pulse width can result in different SET pulse widths at the output. This step requires a number of G-LUT accesses and considers all possible charge and input combinations for generating the SET voltage pulses. Next, the HVV and LVV at node n need to be propagated to its output node(s) in the next level. The example in Figure 3.5 propagates the vectors from a node in leveli into two output branch nodes in level i + 1. The propagation continues until the vectors reach the outputs, and theSER n is calculated by accumulating the probabilities of the SET pulses getting latched into the next stage flip-flops (or other storage components from the technology library). 38 In the propagation process, all the possible paths from the particle striking node to the outputs are automatically covered, and logical masking is considered by the accu- mulated logic probability for non-controlling values for all side-inputs along the paths. Electrical masking effects are considered in the P-LUTs, i.e., if a SET voltage is attenu- ated or disappears, the propagated SET pulse width becomes smaller or 0, respectively. Of note, the propagation process requires a large number of P-LUT accesses. The SER estimation framework is provided in Algorithm 1. We have the following important observations: Observation 1: The nodes that are closer to inputs (i.e., in lower levels) tend to require more P-LUT accesses because the paths from these nodes to outputs involve more gates. Observation2: Repeated P-LUT accesses occur frequently at each node because all upper logic nodes that have at least one path to the outputs involving this node need to propagate vectors through this node, and the vectors from upper logic nodes may have many overlapping SET pulses. Algorithm 1: Overall SER estimation framework / * characterization phase * / 1 Read in the technology library; 2 Generate characterization results such as P-LUTs, G-LUTs and flip-flop timing; / * propagation phase * / 3 Initialize the circuit logic, signal probabilities and clock period; 4 Levelization (); 5 for leveli 0 toL max do 6 foreach nodex in this level do 7 Generate (x); / * generate HVV and LVV * / 8 Analyze (x); / * calculate SER x * / 9 UpdateSER (); / * update SER total * / 10 end 11 end 12 returnSER total ; 39 Based on the above observations, we propose an effective top-down memoization algorithm in order to accelerate the SER propagation. In the proposed algorithm, a high-voltage map (HVM) and a low-voltage map (LVM) are dynamically established at each node, which cache the mappings from the high-voltage SET pulses and low-voltage SET pulses at this node with their overall contributed SER at all outputs, respectively. For each SET pulse in the HVV and LVV to be processed, it first checks the maps and removes this SET entry from the vectors if it is a hit. SER is updated accordingly. The remaining “missing” SETs are propagated to form the HVV and LVV at the next node(s), and the previous steps are repeated. In this way, the “missing” SETs will be propagated recursively until they reach one of the output nodes, where their SER can be finally calculated, and the missing entries in the maps will be filled. Then the proposed algorithm goes backwards from this output node to the upper nodes along the path (i.e., return the previous recursive function calls), and fills the missing entries in the maps of those nodes. During this process, the SET entries in the HVV and LVV are released after their corresponding entries in the HVM and LVM are found or filled, in order to reduce the memory overhead. Algorithm 2 summarizes the proposed algorithm. 3.5 Experimental Results In this section, we demonstrate the effectiveness of the proposed framework on a set of ISCAS85 benchmarks. The technology library is 45nm Nangate Open Cell Library [nan09], which is characterized using Predictive Technology Model (PTM) [ptmel]. In the characterization phase, a number of HSPICE simulations are conducted to generate G-LUTs and P-LUTs, and several Perl scripts are used to assist the characterization process. The SER propagation phase is implemented in C++ on a laptop with an Intel Core i5 processor and 8GB RAM, and the proposed algorithm is compared with two 40 Algorithm 2: Top-down memoization algorithm for accelerating SER propagation 1 FunctionAnalyze (nodex) / * HVV and LVV have been prepared at node x * / 2 if HVV and LVV empty then return; / * check the cached results in HVM and LVM * / 3 if HVM not empty then 4 foreach SETpw a in HVV do 5 ifpw a exists in HVM then 6 Update SER and update this entry in HVM; 7 Remove this entry from HVV; 8 end 9 end 10 end 11 if LVM not empty then 12 foreach SETpw a in LVV do 13 ifpw a exists in LVM then 14 Update SER and update this entry in SER; 15 Remove this entry from LVV; 16 end 17 end 18 end 19 if HVV and LVV empty then return; / * process the SETs that are not found * / 20 foreach output nodey of nodex do 21 Propagate (nodey); / * propagate vectors * / 22 end / * update the missing SETs to the maps * / 23 foreach SET in LVV or HVV do 24 Add new entry or update the SER to HVM or LVM; 25 end 26 Delete HVV and LVV at nodex; / * release space * / 27 return; baselines. Baseline 1 is the main baseline algorithm, which is implemented in the same C++ environment except that no memoization technique is applied. In order to make the comparison fair, LUT access functions and SER update functions are shared between baseline 1 and the proposed algorithm. Besides, no parallel processing techniques are allowed, in order to make sure that the speedup is achieved by the proposed algorithm (instead of parallel computing). The proposed algorithm is also compared with baseline 41 Table 3.1: Experimental results of various ISCAS85 benchmark circuits Circuit Information Runtime Comparison SER Circuit #nodes #PI #PO L max Proposed(s) Baseline 1(s) Speedup Baseline 2(s) Speedup Proposed(FIT) c432 233 36 7 32 0.0245 0.5862 23.9 1.77 72.1 4.14E-04 c499 638 41 32 34 0.2407 28.6909 119.2 9.38 39.0 8.13E-03 c880a 433 60 26 28 0.1601 1.4876 9.3 1.79 11.2 2.83E-03 c1355 629 41 33 35 0.2645 26.6811 100.9 9.63 36.4 6.88E-03 c1908 425 33 25 45 0.3408 80.0182 234.8 4.29 12.6 4.47E-03 c2670 872 157 64 27 0.1965 1.6831 8.6 2.47 12.6 5.53E-03 c3540 901 50 22 53 1.3183 738.5540 560.2 9.29 7.0 2.48E-03 c5315 1833 178 123 41 1.7709 105.5140 59.6 8.59 4.9 1.09E-02 c6288 2788 32 32 129 26.5855 6271.8153 235.9 120.50 4.5 1.09E-02 c7552 2171 207 108 39 4.7501 182.4250 38.4 13.2 2.8 7.73E-03 average 1.0074 129.5156 128.3 6.71 22.1 2, which is one of the state-of-the-art SER estimation algorithms [HW14]. Baseline 1 is more important than baseline 2, since the platform and detailed implementation of baseline 2 are not available whereas baseline 1 is implemented in the same platform as the proposed algorithm. Table 3.1 concludes the experiments on a variety of ISCAS85 benchmarks. The information of each benchmark is provided, including the number of nodes (#nodes), the number of primary inputs (#PI), the number of primary outputs (#PO), and the number of levels (L max ). The middle columns in Table 3.1 conclude the runtime comparison among the proposed algorithm and the two baselines. The last column provides the final SER of each circuit in terms of failure-in-time (FIT) generated by our SER estimation framework. Results in Table 3.1 demonstrate that the proposed algorithm consistently outper- forms both baselines in all the ISCAS85 benchmark circuits. The proposed algorithm achieves up to 560.2X and 72.1X speedup compared to baseline 1 and baseline 2, respec- tively. Of note, the implementation of the proposed algorithm does not take the advan- tage of parallel computing (in order to make comparison fair between the proposed algo- rithm and baseline 1) and baseline 2 is implemented with parallel computing, resulting in the trend that the proposed algorithm has lower speedups over baseline 2 for larger circuit. The proposed algorithm can be further accelerated using circuit partitioning and 42 parallel computing techniques that have already been exploited in prior works. Differ- ence in SER results between the proposed algorithm and baseline 1 have been observed (< 3%), which is caused by the round up/down of the SET pulse widths (key for the maps) during map checking. Considering the significant speedup achieved by the pro- posed algorithm, this amount of SER difference can be accepted. Besides, the peak memory usage reported by the largest benchmark c6288 is 50.2MB, indicating that the memory overhead from the additional maps is negligible. 3.6 Conclusion In this chapter, a novel top-down memoization algorithm was proposed to accelerate the SER estimation process for combination circuits. The proposed algorithm cached solutions to avoid overlapping SET propagations, enabling fast SER estimation. The run-time of the proposed algorithm was of the order of few seconds for various ISCAS85 benchmark circuits. Compared with the baseline algorithm, the proposed algorithm achieved up to 560.2X speedup with less than 3% difference of SER results. 43 Chapter 4 Schematic and Layout Co-Simulation for Multiple Cell Upset (MCU) Modeling Soft error protection is important since the system-level soft error rate (SER) has been rising with the shrinking geometric dimensions and increasing circuit complexity [FCMG13, MGA + 14]. As for exascale computing and mission critical systems, achiev- ing adequate soft error resilience is crucial due to the huge recovery penalty in large scale computing systems and potentially disastrous results in critical applications, such as automobile and spacecraft [KMH12]. Accordingly, digital circuit designers imple- ment extensive error detection and correction techniques mainly for on-chip SRAMs and register files [ZMM + 06]. Nevertheless, memory protection is not enough for advanced technologies because the soft errors in sequential elements and combinational logic, also referred to as logic soft errors, are significant contributors to the system-level SER [ZMM + 06, MGA + 14]. To improve soft error resilience in sequential elements, i.e., flip-flops (FFs) and latches, several redundant structures have been employed to mitigate single-event upsets (SEUs), i.e., change of state resulting from one single particle hit, such as triple modu- lar redundancy (TMR) latch [PK15], feedback redundant SEU-tolerant (FERST) latch [FPME07], and dual interlocked storage cell (DICE) latch [KKM + 14]. Nevertheless, 44 the continued downscaling of device dimensions has propelled the severity and rele- vance of multiple node charge collection mechanisms (e.g., charge sharing and parasitic bipolar effect), making the aforementioned hardened FFs vulnerable to multiple cell upset (MCU), in which a single particle hit causes simultaneous failures at multiple bits (nodes). This necessitates the modeling of MCU effects for estimating SER of hardened FFs in advanced technologies. On the other hand, soft errors in combinational logic, unlike memories and sequential components, cannot be rectified without incurring sig- nificant area and performance overheads [FCMG13]. Therefore, a combinational logic SER estimation tool is required, in order to quantify the degree of soft error tolerance and identify the most vulnerable sites for radiation enhancement. Since accurate soft error vulnerability evaluation is an essential part of cost-effective robust circuit design, many techniques have been proposed in the context of estimating SER of digital circuits [KR11, HW14, CDLQ14, FSK15]. However, the aforemen- tioned works either analyze only SER in combinational circuits or evaluate only soft error resilience in sequential elements, and there lacks a joint investigation of soft error vulnerability in both combinational logic and sequential elements. 4.1 Introduction In this chapter, we propose a comprehensive SER assessment framework for both combi- national and sequential circuits, which considers single-event transients (SETs) in com- binational components as well as FFs without redundancy and accurately characterizes soft errors in several representative radiation-hardened FF structures, i.e., FERST and DICE, by modeling the MCU effects. More specifically, on the subject of combina- tional circuits, we adopt the double exponential current model (described in Section 3.3.1) to describe the SET pulse shape and compute the combinational SER resulted 45 from the process where SETs are generated at particle striking sites, propagated through logically sensitized paths under various masking effects [HHW15], and captured by the storage elements at outputs. As for sequential components, we present a unified schematic and layout co-simulation methodology, which models MCUs in radiation- hardened structures, whereas SET-induced soft errors are characterized for FF structures with no redundancy using circuit simulation. The contribution of this work is threefold. First, we jointly consider radiation- induced soft errors in combinational and sequential circuits for advanced CMOS tech- nologies. Simulation results on a variety of benchmarks demonstrate that both combi- national and sequential components contribute to the total SER. Second, we propose a general schematic and layout co-simulation method for evaluating SER caused by MCUs in redundant storage structures. Simulation results demonstrate that the SER that considers MCUs is significantly higher than the SER without considering MCUs in radiation-hardened structures, indicating the importance of modeling MCUs in advance technologies. Third, we compare the area and soft error resilience among different FF structures, which can be used to guide circuit designers to choose the best FF structure based on their needs. 4.2 Improved Overall Flow In this chapter, we examine soft error vulnerabilities in one combinational circuit and its output FF stage (i.e., one sequential stage), which are the most basic blocks in pipelined circuits. The FF in the sequential stage can be non-hardened FF, i.e., transmission gate FF (TGFF), or radiation-hardened FF, such as FERST and DICE. Based on the striking location of the particle, the SER is divided into two parts (i) combinational SER and (ii) 46 Figure 4.1: System diagram of the SER estimation framework. sequential SER, as shown in Figure 4.1. Each part requires a characterization phase and a computation phase. In the characterization phase of the combinational SER, for each standard combina- tional cell in the library, a generation lookup table (G-LUT) is established, which trans- forms the generated SET current pulses into voltage pulses at the striking site, whereas a propagation LUT (P-LUT) is built to record the mapping from voltage pulses at the input of the cell to the propagated voltage pulses at the output for a certain cell state and output capacitance. In addition to the G-LUTs and P-LUTs, we also examine the pulse filtering characteristic of the FFs in the library, since the propagated SET volt- age pulses at outputs need to be captured by the next stage FFs to become an error. The computation phase of the combinational SER is the process where the final SER is calculated by accumulating the probabilities of SETs that are generated at vulnerable nodes, propagated to outputs and captured by FFs. For non-hardened FF structures, the characterization phase is conducted by evaluat- ing the possibilities of seeing soft errors when a radiation particle hits an internal node of the FF under certain cell state and input combinations. Hardened FF structures, on 47 Figure 4.2: Overall flow of the proposed SER estimation framework. the other hand, requires an additional MCU modeling step to estimate the final SER. Figure 4.2 shows the overall flow of the proposed SER estimation framework. 4.3 Combinational Soft Error Rate (SER) Estimation In this section, the characterization phase and computation phase of combinational SER are described. 4.3.1 Characterization Phase of Combinational SER For each combinational standard cell in the technology library of interest, a G-LUT and a P-LUT are built, in order to record the mapping that transforms the radiation-induced 48 SET current pulses into voltage pulses at the striking node and the mapping from SET voltage pulses at the input of a standard cell to the propagated SET voltage pulse at the output of this cell, respectively. We conduct a comprehensive investigation of all the factors that could potentially affect the generation and propagation of the SETs, including deposited charge, load capacitance, gate type, gate size, and cell state (i.e., input combinations). More specifically, for each input combination, each gate size of each combinational standard cell, we establish (i) one 2D G-LUT that uses deposited charge and output capacitance as index keys, and (ii) one 2D P-LUT that uses input SET pulse width and output capacitance as index keys. Without any loss of generality, other parameters, such as pulse height and slew, can be taken into consideration by adding more indices to the LUTs at the cost of long LUT accessing time and large memory overhead. In addition to the G-LUTs and P-LUTs, the propagated SET pulse needs to be latched into the output FF before becoming a soft error. According to [HW14], any SET pulse width smaller than the summation of setup time and hold time of the FF at the output is filtered. Therefore, the setup time and hold time of FFs under different FF state (logic value of the FF) are required during the characterization phase of combinational SER. Note that characterizing the FF filtering effect is included in the combinational SER part since it is the last step for a SET generated in the combinational circuit to become a soft error. Figure 4.3 provides the simulation setup for the combinational SER characterization phase. 4.3.2 Computation Phase of Combinational SER Three important masking effects are included in the computation phase: (i) logical masking, where at least one of the other inputs of the propagated gate has a controlling value, (ii) electrical masking, where the SET voltage pulse is attenuated or completely disappears due to the electrical properties of propagated gates, and (iii) latch window 49 Figure 4.3: Simulation setup for the characterization phase of combinational SER. masking, where the propagated pulse cannot get latched into the FFs (as mentioned in Section 4.3.1). The total combinational SER is the summation of SER contributed by each noden in the combinational logic, which is calculated as SER total comb = N X n=1 SER n comb (4.1) whereN is the total number of nodes in the circuit. The detailed calculation is given in Section 3.4.1. As the computation phase is computationally expensive, various tech- niques have been proposed to accelerate the computation phase. In this chapter, we adopt the top-down memoization algorithm developed in Section 3.4.2. 4.4 Sequential SER Estimation In this section, the characterization and computation phase of sequential SER are pre- sented. 50 4.4.1 Characterization Phase of Sequential SER A soft error occurs in a FF when the stored value is flipped by the deposited charge from a particle hit. In redundant FFs, such as FERST and DICE, each storage node is associated with a duplicated node, and a soft error occurs when one storage node and its duplicate node are flipped simultaneously. The probability of two simultaneous hits on redundant nodes by different particles are negligible. However, in advanced technology nodes charge sharing and parasitic bipolar effects contribute to the MCU effects, where a single particle hit flips multiple redundant nodes and the generated charge from this single particle hit is shared among these redundant nodes, leading to the increased SER of radiation-hardened FF structures. The objective of the characterization phase is to build FF LUTs for all the FFs in the technology library, which record the mapping from the deposited charge at internal nodes caused by a single particle hit to whether such a particle strike results in a soft error or not under all the combinations of FF logic states and clock signal states. More specifically, for each non-hardened FF, a double exponential current source is connected to each storage node, and the FF LUT records the mapping from different amounts of deposited charge at each storage node to a binary value which represents whether it leads to an error or not, given the FF logic state and clock signal value. On the other hand, for radiation-hardened FF structures, two independent double exponential current sources are attached to one cross-coupled pair of storage nodes, in order to flip the stored value. Figure 4.4 shows an example of two independent current sources I gen1 and I gen2 that are attached to nodes N1 and N2 with deposited charge Q 1 and Q 2 , respectively, which can flip the stored value at two cross-coupled nodes N 1 and N 2 , if Q 1 and Q 2 are large enough. By changing the value of Q 1 and Q 2 , we characterize a Shmoo-like error map for each cross-coupled pair of nodes, under a certain FF logic state and clock signal value, as depicted in Figure 4.5. The layout 51 Figure 4.4: Two independent current sources attached to a pair of cross-coupled storage nodesN 1 andN 2 in a DICE FF. spacing information of radiation-hardened FFs is extracted for the MCU modeling step in the following computation phase. 4.4.2 Computation Phase of Sequential SER For each non radiation-hardened FFx, the total sequential SER is calculated as SER x seq = 1 X d=0 M X m=1 Z Qmax Q min SER gen seq (m;Q;d)P (Q)dQ P (d) (4.2) where d is the FF logic state, P (d) is the probability of logic state d, M is the total number of internal storage nodes, andSER gen seq (m;Q;d) represents the combined SER of internal nodem with deposited chargeQ and FF logic stated under all possible clock signal values (assuming 50% duty cycle). Note that for TGFF, M is equal to 4 since there are 4 internal storage nodes. 52 In order to compute the MCUs by a single particle hit, we denote the ratio of col- lected charge in a node, that is close to the striking site, to the deposited charge at the striking node by the charge collection ratio termsR n (x) andR p (x) for NMOS and PMOS regions, respectively, wherex is the distance between these two nodes. From the heavy ion results in [AWM + 06], the charge collection ratio is exponentially reduced by the distance between the drain and the particle striking site, which is calculated as R n (x) =c n 1 e c n 2 x ; forNMOS R p (x) =c p 1 e c p 2 x ; forPMOS (4.3) wherec n 1 ,c n 2 ,c p 1 , andc p 2 are technology related constants. For hardened FFs, we examine one PMOS region or NMOS region in the layout at a time. Figure 4.6 shows an example, where the PMOS region is examined, which con- tainsK rectangle blocks with a pair of cross-coupled nodesN 1 andN 2 . Each transistor occupies at least one block, and in Figure 4.6, transistor N 1 occupies block 1 and 2, whereasN 2 occupies blockK 1 andK. For the particle hit at blockk with charge Q(k), the charge deposited at block 1 andK areQ(k)R p (x) andQ(k)R p (sx), respectively. Similarly, we can calculate the charge at block 2 and K 1. The total charge at N 1 and N 2 (i.e., Q 1 and Q 2 ) can be calculated by summing the charges in Figure 4.5: An example error map for radiation-hardened structures. 53 Figure 4.6: Modeling MCU effects from layout, where N 1 and N 2 are cross-coupled storage nodes. related blocks, and with these values, we can find whether this hit leads to an soft error or not by checking the error map in Figure 4.5. This process is repeated for each cross- coupled pair from block 1 to blockK in each PMOS or NMOS region in the layout. Note that this approach can be applied to either latch or FF. As for triplicated FF structures like TMR, the storage elements does not influence with each other, and suc- cessive simulations with one double exponential current source are enough to compute the SER. With the aforementioned process, we obtain the total SER of a radiation- hardened FF. Table 4.1: Experimental results of various ISCAS85 benchmark circuits circuit information TGFF FERST DICE circuit #nodes #PI #PO comb. SER seq. SER total SER comb. SER seq. SER total SER diff comb. SER seq. SER total SER diff time (#FF) (FIT) (FIT) (FIT) (FIT) (FIT) (FIT) (%) (FIT) (FIT) (FIT) (%) (s) c17 12 5 2 5.84E-04 1.34E-02 1.40E-02 3.53E-04 9.48E-04 1.30E-03 9.31 3.55E-04 4.02E-04 7.57E-04 5.42 0.001 c432 233 36 7 4.14E-04 4.64E-02 4.68E-02 4.08E-04 3.32E-03 3.73E-03 7.97 4.08E-04 1.86E-03 2.27E-03 4.84 0.010 c499 638 41 32 8.13E-03 2.15E-01 2.23E-01 6.05E-03 1.52E-02 2.12E-02 9.52 6.05E-03 5.99E-03 1.20E-02 5.40 0.059 c880a 433 60 26 2.83E-03 1.72E-01 1.75E-01 2.40E-03 1.23E-02 1.47E-02 8.42 2.46E-03 7.02E-03 9.47E-03 5.42 0.048 c1355 629 41 33 6.88E-03 2.15E-01 2.22E-01 6.74E-03 1.52E-02 2.19E-02 9.89 6.88E-03 6.00E-03 1.29E-02 5.82 0.066 c1908 425 33 25 4.47E-03 1.66E-01 1.70E-01 3.71E-03 1.19E-02 1.56E-02 9.15 3.73E-03 6.67E-03 1.04E-02 6.11 0.074 c2670 872 157 64 5.53E-03 3.30E-01 3.35E-01 5.30E-03 2.37E-02 2.90E-02 8.65 5.50E-03 1.46E-02 2.01E-02 5.99 0.074 c3540 901 50 22 2.48E-03 1.46E-01 1.48E-01 2.22E-03 1.04E-02 1.27E-02 8.55 2.25E-03 6.06E-03 8.31E-03 5.61 0.251 c5315 1833 178 123 1.09E-02 6.73E-01 6.84E-01 1.04E-02 4.84E-02 5.88E-02 8.60 1.07E-02 2.96E-02 4.02E-02 5.88 0.495 c6288 2788 32 32 1.89E-03 2.11E-01 2.13E-01 1.74E-03 1.52E-02 1.69E-02 7.95 1.75E-03 9.44E-03 1.12E-02 5.26 2.537 c7552 2171 207 108 7.73E-03 3.56E-01 3.64E-01 7.53E-03 2.56E-02 3.31E-02 9.11 7.60E-03 1.58E-02 2.34E-02 6.44 1.117 average 4.72E-03 2.31E-01 2.36E-01 4.26E-03 1.66E-02 2.08E-02 8.83 4.33E-03 9.40E-03 1.37E-02 5.65 0.430 54 4.5 Experimental Results In this section, we demonstrate the effectiveness of the proposed joint SER estimation framework for combinational logic and sequential elements on a set of ISCAS85 bench- marks. We use the Nangate 45nm Open Cell Library [nan09], which is characterized with Predictive Technology Model (PTM) [ptmel]. As mentioned in Section 4.1, we consider one combinational stage and its output FF stage. The combinational part of each benchmark circuit is synthesized with the Nangate 45nm standard cell library using Synopsys Design Compiler, and the FF in the sequential stage can be TGFF, FERST or DICE. The layout of each FF is created in Cadence Virtuoso and the distance informa- tion is extracted for the computation phase. The areas of TGFF, FERST and DICE are 5:7m 2 , 18:15m 2 , and 10:45m 2 , respectively. In the characterization phase, a num- ber of HSPICE simulations are conducted to generate G-LUTs, P-LUTs, and FF LUTs, and several Perl scripts are used to assist the characterization process. The computation phase is implemented in C++ on a laptop with an Intel Core i5 processor and 8GB RAM. The propagation algorithm, which is developed in Section 3.4.2, is adopted to speed up the combinational logic SET propagation. Table 4.1 concludes the experiments on a variety of ISCAS85 benchmarks. The information of each benchmark is provided, including the number of nodes (#nodes), the number of primary inputs (#PI), and the number of primary outputs (#PO), which is also the number of FFs in the sequential stage. The middle columns in Table 4.1 conclude the SER comparison among designs that use different types of FFs, where the SER is expressed in terms of failure-in-time (FIT). For a certain type of FF, combinational SER, sequential SER and total SER are listed, and the last column provides the runtime. The diff column in FERST and DICE shows the total SER of FERST and DICE is 8.83% and 5.65%, respectively, as a percentage of the total SER of TGFF, showing 55 significant SER reduction after using radiation-hardened FF structures. Since DICE has more redundant nodes, DICE achieves more SER improvement than FERST consis- tently in all benchmark circuits. For FERST and DICE, without considering the MCUs, the sequential SER is 0 among all the benchmark circuits. The proposed approach, which considers MCUs, shows that these redundant FFs are still vulnerable to radiation- induced soft errors due to the MCUs in the 45nm CMOS technology. For all the bench- mark circuits using any type of FFs, combinational SER is comparable to the sequential SER, although sequential SER is still the major part of the total SER. The time column in Table 4.1 shows the runtime of the propagation phase for each benchmark circuit. The average runtime is 0.430s. Given the originality of our work in combining combina- tional and sequential elements in a cohesive SER analysis, it is difficult to quantitatively compare to prior work. 4.6 Conclusion In this chapter, a joint SER estimation framework for both combinational logic and sequential elements was proposed, which considered MCU effects in advanced tech- nologies. Various LUTs were built to characterized different combinational or sequen- tial components, and a general schematic and layout co-simulation method for modeling MCUs was presented. Experimental results on various ISCAS85 benchmark circuits showed the comparison of different FFs and demonstrated that MCU cannot be ignored for estimating SER of sequential elements in advanced technologies. 56 Chapter 5 Fast and Comprehensive Soft Error Rate (SER) Evaluation Framework The aggressive downscaling of process technologies combined with the reduction in supply voltage [CLN + 16] has reduced the particle energy that is required to upset the state of logic gates, registers and memory circuits [KMH12]. As the energy thresh- old for causing a soft error decreases, the number of particles with sufficient energy to cause errors increases rapidly, and the system-level soft error rate (SER) grows signif- icantly [KMH12, FCMG13, MGA + 14]. Moreover, various aging effects can increase the SER over time [AvSE + 14]. Failure to address these radiation-induced soft errors can lead to silent data corruptions and application failures, with huge recovery penalties in today’s exascale computing systems and potentially disastrous results in mission-critical systems, such as medical systems, automobiles, and spacecrafts [KMH12, GBR + 12]. Therefore, logic circuits, especially those in large scale computing systems and mission- critical applications, should have an adequate degree of soft error resilience. The soft errors generated in combinational logic and sequential elements are referred to as logic soft errors [ZMM + 06]. These logic soft errors, once latched in stage flip-flops (FFs), can propagate through the subsequent sequential logic and may appear in primary outputs more than once [MZM07]. Hence, it is imperative to analyze the propagation of logic soft errors in sequential logic, in order to accurately estimate the total SER. 57 5.1 Introduction As soft error vulnerability evaluation is an essential part of cost-effective robust circuit design, considerable research efforts have been invested in accurately characterizing and propagating soft errors in combinational circuits [RCBS07, RRV + 08, HW14, CDLQ14] and sequential circuits [MZM07, WA10, ESCT15, EET + 15]. However, the prior works have mainly focused on the accuracy of the SER estimation results, and the efforts to improve the runtime and scalability of the estimation process are limited to circuit par- titioning, where the circuit of interest is divided into sub-blocks for parallel processing, and the parallel fault simulation, in which multiple errors are injected and analyzed in parallel. Nevertheless, the SER estimation process has become computationally expen- sive and the reason is threefold: (i) the SER estimation process is required to process a large set of parameters, in order to describe various complex effects, (ii) runtime of the estimation process increases near-exponentially as the size and logic complexity of the circuit increases [LD16a], and (iii) the analysis of propagation of soft errors in sequen- tial circuits takes more than one clock cycle, as a single particle hit can affect outputs for several clock cycles. Moreover, the prior works have not considered the MCU effects during the soft error evaluation of radiation-hardened sequential elements in the circuit layer, which may result in over optimistic SER estimation results. In this chapter, we propose an efficient and comprehensive SER assessment frame- work for combinational and sequential circuits, which significantly improves the run- time of the SER estimation process and accurately accounts for the MCU effects. The proposed SER estimation framework is comprised of two phases: (i) characterization and (ii) computation. In the characterization phase, we consider single-event transients (SETs), i.e., single transient induced by one particle hit, in both combinational compo- nents and non-hardened sequential elements, and the MCU effect is modeled for redun- dant FF structures, such as FERST and DICE. Various two-dimensional lookup tables 58 (LUTs) are created to save the characterization results. The double exponential current pulse model described in Section 3.3.1 is adopted in this chapter. In the computation phase, the generation of soft errors at stage FFs is considered by two sources: (i) in combinational circuits, the SETs are generated at the particle striking sites, propagated through logically sensitized paths under various masking effects [HHW15], and cap- tured by the storage elements in the stage FFs, and (ii) in sequential elements, the soft errors are induced by direct particle hit at internal storage nodes. These generated soft errors in the stage FFs need to be further propagated through the subsequent clocks, in order to accurately evaluate the final SER at the primary outputs. Our main contributions can be summarized as follows: We carry out a detailed analysis on the soft error vulnerabilities, and determine the key parameters that need to be extracted during the characterization process for both combinational circuits and sequential elements in advanced CMOS tech- nologies. We propose a general schematic and layout co-simulation approach to accurately model the MCU effects in redundant storage structures [LD16a]. Unlike prior works [RCBS07, HW14, CDLQ14, MZM07, WA10, ESCT15, EET + 15] that mainly focus on accuracy, we aim to improve runtime and scal- ability in the computation phase and meanwhile achieving high accuracy in the characterization phase. This is because the characterization steps need to be per- formed only once for each technology library, whereas the computation phase needs to be performed for each circuit of interest. A novel top-down memoiza- tion algorithm is proposed to accelerate the computationally expensive process of transient pulse propagations [LD16a]. We jointly consider particle hits at both combinational circuits and sequential ele- ments, and propose an efficient and comprehensive analysis of combinational and 59 sequential circuits, where the time frame expansion method is used to efficiently calculate the SER contributed by the propagation of soft errors in the sequential logic. The proposed SER estimation framework can also accommodate new technolo- gies, such as FinFET [CLS + 16] and gate-all-around (GAA) devices [WSC + 15], using device-circuit cross-layer characterization for these devices and flexible LUT data structures to store the results. Besides, the proposed top-down memo- ization algorithm has the potential to be applied to multiple transient propagation, if the propagation of each single transient pulse can be separated and accelerated independently after considering the interaction of the SET pulses that are induced by a single particle hit. Experimental results on various ISCAS85 combinational benchmarks demonstrate that the proposed top-down memoization algorithm achieves up to 560.2X speedup with less than 3% SER difference compared to the baseline algorithm. Results on ISCAS89 combinational and sequential benchmarks demonstrate that MCU effects cannot be ignored in hardened FFs, and the runtime of the proposed SER estimation framework is of the order of hundreds of seconds, even for a relatively large scale combinational and sequential circuit (with more than 3,000 FFs and more than 17,000 gates). 5.2 Flowchart of the Proposed Framework In this chapter, we consider a sequential circuit that is comprised of combinational logic and sequential elements, where the stage FFs can be non-hardened FFs, e.g., transmis- sion gate FFs (TGFFs), or hardened FFs, such as FERSTs and DICEs. Figure 5.1 shows the system diagram of soft error generation and propagation in a circuit. The particle hit can occur in the combinational part or sequential elements. When particles strike 60 Figure 5.1: System diagram of soft error generation and propagation in a circuit. combinational cells, electron-hole pairs are generated inside these transistors, leading to parasitic transient current pulses at the striking nodes. If propagated through the sensi- tized paths towards the primary outputs, these transient pulses will become soft errors. The induced SET pulses may also get latched into the stage FFs and propagate through the combinational logic many times in the subsequent clock cycles, which may appear at the primary outputs more than once in the following clock cycles. If particles hit the storage nodes of the stage FFs with sufficient energy, it may flip the stored values in non-hardened FFs, and single particle strikes can affect the hardened FFs through the MCU effects. In the characterization phase, for each standard combinational cell in the technol- ogy library, a generation lookup table (G-LUT) is built to transform the generated SET current pulses into voltage pulses, whereas a propagation lookup table (P-LUT) is char- acterized to keep the mapping from voltage pulses at the input of a cell to the propagated voltage pulses at the output for a certain cell state and load capacitance [LD16a]. The 61 timing of each FF is also examined since the propagated SET pulses need to be wide enough to get latched into the FFs. For non-hardened FF structures, the possibilities of having soft errors when radiation particles hit internal nodes under certain cell state are evaluated, whereas the layout information of hardened FF structures is extracted and the vulnerability of cross-coupled storage nodes is evaluated by running HSPICE simulations. In the computation phase, we first evaluate the soft errors generated in the combina- tional logic. More specifically, for each primary output (or stage FF), we accumulate the probabilities of SETs, which are generated at vulnerable combinational nodes and prop- agated to this primary output (or latched into this stage FF). A top-down memoization algorithm is proposed to accelerate this process. Next, we analyze the soft errors that are directly generated in the FFs, where a general schematic and layout co-simulation method is proposed to model the MCU effects in hardened FF structures. After that, the sequential circuit is unrolled into several time frames for propagating the soft errors from the stage FFs through the combinational logic in the subsequent clocks, in order to accurately evaluate the final SER. For FinFET and GAA devices [LXW + 15, WSC + 15], device-circuit cross-layer sim- ulations and radiation experiments are required for characterization and validation of the models in the characterization phase, where the flexible LUT data structure can store the characterization results. On the other hand, the computation phase remains the same. As a typical technology library contains many corners under different supply voltages and temperatures, it is imperative to consider all the corners and re-characterize the FFs, G-LUTs and P-LUTs under each corner with the adjusted SET current pulse model. To be conservative, the circuit of interest should be evaluated under all the corners in order to find the worst case SER. Figure 5.2 shows the overall flow of the proposed SER estimation framework. 62 Figure 5.2: The flowchart of our SER estimation framework. 5.3 Combinational Logic Characterization This section provides the detailed characterization steps for combinational standard cells. 5.3.1 Generation and Propagation As mentioned in Section 5.1, the characterization results of soft error generation and propagation for combinational standard cells are saved in 2D G-LUTs and P-LUTs, respectively. G-LUTs record the mappings that transform the radiation-induced SET 63 current pulses into voltage pulses at the striking sites, whereas P-LUTs save the map- pings from SET voltage pulses at the input of a combinational standard cell to the propa- gated SET voltage pulses at the output of this cell. In order to make the characterization process general, we consider all possible input combinations, gate sizes, and a wide range of output capacitances as well as deposited charges for the given technology dur- ing the process of establishing G-LUTs and P-LUTs. More specifically, for each gate size and input combination of each standard cell, we establish (i) one 2D G-LUT, where deposited charge and output capacitance are the index keys, and (ii) one 2D P-LUT, that uses input SET pulse widths measured at 50% supply voltage and output capacitance as index keys. The simulation setup for building G-LUTs and P-LUTs is shown in Figure 5.3. The LUT structure is suitable for soft error characterization in new technologies. For example, in reference work [KOTN14], an LUT is used to record the number of generated electron-hole pairs in a FinFET device under strikings with different particle energies. With no loss of generality, the proposed LUT-based characterization method can accommodate more parameters that capture additional radiation-induced physical effects, e.g., SET pulse height. Note that detailed characterization results lead to high Figure 5.3: Simulation setup for characterizing a combinational standard cell consider- ing various factors. 64 Figure 5.4: Latching window masking mechanism. (a) an example where pulse A and C are masked (b) simulation setup for characterizing latching windows in FFs. accuracy of SER results but also increased workload for the computation phase. There- fore, the size of LUTs should be carefully decided, in order to avoid long LUT access time. 5.3.2 Latching Window Characterization After the SET pulses propagated through combinational logic under logic and electrical masking effects, only those pulses with enough strength positioned around the down- stream FF closing edge will be captured. According to reference work [WA10], a latch- ing window (t w ) is a duration bounded by the setup time (t setup ) and hold time (t hold ) around the active clock edge of a flip-flop, i.e.,t w =t setup +t hold . A flip-flop is insensi- tive to the arrival SET voltage pulses that fall outside the latching windowt w . Therefore, the latching window masking effects should be characterized for the FFs under different logic states. Figure 5.4 (a) and (b) provides an example of latching window masking mechanism and the required characterization steps, respectively. 5.4 Sequential Element Characterization This section describes the detailed characterization steps for sequential standard cells. 65 5.4.1 Non-hardened Flip-Flop (FF) Characterization For each non-hardened FF, a soft error occurs when its value is flipped by the deposited charge from a particle hit at the storage node. An accurate characterization should be conducted in the device level, which requires the simulation of the 3-D structure of the FF under different radiation particle energies. However, this step is computationally expensive and the 3-D structure of the non-hardened FF in the technology that is used in our experiments is not available. Therefore, in this chapter, an alternative approach is applied, where we connect a double exponential current source to each storage node in the TGFF of interest, and we run a series of simulations to establish a FF LUT, which stores the mappings from different amounts of deposited charge at each storage node to a binary value indicating whether it causes SEU or not, under a certain FF logic state and clock signal value. 5.4.2 Hardened-FF Characterization Unlike non-hardened FFs, redundant FFs, such as FERST and DICE, have at least one duplicated node for each storage node, and a soft error can occur only when both the storage node and its duplication are flipped simultaneously. The probability of two par- ticles simultaneous striking both nodes is negligible. Nevertheless, the MCU effect, that is exacerbated by charge sharing and parasitic bipolar effects in advanced technology nodes, significantly increases the SER of hardened FF structures. In order to model the MCU effects, we need to characterize the hardened FFs in a different way from the non- hardened FFs. More specifically, in the technology library used in our experiments, a storage node of a radiation-hardened FF structure, i.e., FERST or DICE, is associated with a duplicated node. Figure 5.5 (a) shows the schematic of a FERST latch, where N 1 andN 3 are coupled withN 2 andN 4 , respectively. In order to flip the stored value, 66 Figure 5.5: Hardened-FF characterization. (a) an example simulation setup for a FERST latch (b) a sample error map. two independent current sourcesI gen1 andI gen2 are attached to the cross-coupled nodes N1 and N2 with deposited charge Q 1 and Q 2 , respectively. The stored value will be changed whenQ 1 andQ 2 are large enough, and a Shmoo-like error map, as depicted in Figure 5.5 (b), is built for each cross-coupled pair of nodes, which records the mappings from deposited charge values at the cross-coupled nodes to the result whether an error is induced, given a certain FF logic state and clock signal value. Similar steps are repeated for other cross-coupled nodes, e.g.,N 3 andN 4 in Figure 5.5 (a). 5.5 Combinational SER Computation In this section, we present the detailed steps in the combinational SER computation. Three important masking effects are considered during the computation phase: (i) log- ical masking, where at least one of the other inputs of the propagated gate has the con- trolling value, (ii) electrical masking, in which the transient voltage pulse is attenuated or completely disappears due to the electrical properties of propagated gates, and (iii) latching window masking, where the propagated pulse cannot be latched into the FFs. 67 5.5.1 Combinational SER Computation Method As shown in Figure 5.1, the outputs of combinational logic are primary output (PO) nodes and next stage FFs that store the next state (NS) values. The total combinational SER, that is the summation of SER contributed by each combinational noden, is calcu- lated as SER total comb = N comb X n=1 SER n comb (5.1) whereN comb is the total number of combinational nodes in the circuit. Each combina- tional noden is susceptible for particle strikes with deposited charge over the range of Q min andQ max . Hence,SER n comb is calculated as SER n comb = Z Qmax Q min P (q) Npo X i=1 P err (n;i;q) + Nns X i=1 P err (n;i;q) dq (5.2) where P (q) is the deposited charge distribution function in [RRV + 08]; N po and N ns denote the total number of PO nodes and NS nodes, respectively;P err (n;i;q) is the SER induced by the SETs that are generated at noden with deposited chargeq, propagated through the circuit and latched into nodei, which is either a FF node or PO node, under the three masking effects. Note that the errors that are propagated to the NS nodes need to be analyzed in the subsequent clocks. The termP err (n;i;q) is calculated as P err (n;i;q) =P gen (n;q)P log;ele (n;i;q)P lat (5.3) where P gen (n;q) indicates the SET generation at node n with deposited charge q, P log;ele (n;i;q) represents the total sensitized probability of the SET, that is generated at noden with chargeq, propagating through the circuit from noden to nodei along all paths with recognizable electrical strength under logical masking and electrical masking, andP lat represents the latch window masking effect. The second termP log;ele (n;i;q) in 68 Equation (5.3) is further calculated by accumulating logic probability (P side;ele ) for non- controlling values on all side-inputs along the paths under electrical masking effects P log;ele (n;i;q) = Y k2n!i P side;ele (k) (5.4) wherek represents one node along the path fromn toi. The logic probability of each node can be calculated using the correlation coefficient method [Ric89], which con- siders reconvergent paths. Alternatively, it can be estimated by simulating a large set of typical vectors (possibly obtained by running a large set of benchmark programs), where the reconvergent paths are considered implicitly. In this chapter, we adopt the second approach. The last termP lat in Equation (5.3) can be calculated as P lat = 8 > > > > > < > > > > > : 0; ifd<t w dtw T clk ; ift w dt w +T clk 1; ifd>t w +T clk (5.5) whered is the duration of the SET pulse width propagated from noden to outputi with enough electrical strength, t w is the latching window mentioned in Section 5.3.2, and T clk is the clock period. With the characterization results in various LUTs, SER n comb is calculated through generating all possible current pulses at noden, transforming the SET current pulses to voltage pulses by accessing G-LUTs, propagating the transient voltage pulses to the outputs along all possible paths, and latching the SETs. During this process, the three masking effects are inherently considered. 69 5.6 Sequential SER Computation using Time Frame Expansion The evaluation of vulnerability of different FFs under particle strikes is given in Section 4.4.2. In order to deal with the feedback paths among different FFs, the entire sequential circuit is unrolled intoW stages, as shown in Figure 5.6, where each stage contains the same combinational logic and stage FFs. Due to the fact that the probability of seeing two consecutive particle hits in a few clock cycles is extremely small, the particle hits are only introduced in stage 1 and no particle strikes occur in the remaining stages. In stage 1, the radiation-induced transient pulses are generated in both combinational circuits and stage FFs. The generated SETs in the combinational block may appear at POs, be masked during the propagation, get latched into the stage FFs, or have a combination of the aforementioned outcomes. In stage 1, each stage FF also has a certain SER generation rate caused by particle striking its internal storage nodes. We denote the SER of FFi in thej-th stage bySER seq (i;j). The SER of stage FFi after the 1-st stage is calculated as SER seq (i; 1) =SER gen seq (i) + Z Qmax Q min P (q) N comb X n=1 P err (n;i;q)dq (5.6) Figure 5.6: Time frame expansion for sequential circuits. 70 where the first term represents the soft errors generated in its internal nodes, and the second term is caused by the SETs propagated from the combinational circuit. The expanded frames are used to analyze the cumulative contribution of the errors, that are induced in stage 1, to the overall circuit SER at POs in the following stages. The error propagation in the following stages is essentially propagating errors from the previous FFs through the combinational logic. The SER of FFi in thew-th stage (1<wW ) is calculated as SER seq (i;w) = Nns X k=1 P log (k;i)SER seq (k;w 1) (5.7) where P log (k;i) is the logic probability for propagating errors from the upstream FF k to downstream FF i considering all the potential paths under logic masking. The contributed SER at stagew decreases drastically as the stage numberw increases. This convergence is also observed in prior works [MZM07, WA10], where W is a fixed number. In this chapter, the ending stageW is set to be ten since there is no significant increase (> 0:1%) to the total SER when W is increased to a very large value (e.g., 1000). Therefore, the overall SER is calculated as SER total seq = N comb X n=1 Npo X i=1 Z Qmax Q min P (q)P err (n;i;q)dq + W X w=2 Npo X i=1 Nns X k=1 P log (k;i)SER seq (k;w) (5.8) where the first term represents the total SER contributed by the combinational logic in stage 1, and the second term indicates the total SER contributed in the following stages. Note that theP log (k;i) in the second term indicates the logic probability for propagating errors from upstream FFk to the primary output nodei. Algorithm 3 provides the high- level flow of the proposed SER estimation framework for combinational and sequential circuits. 71 Algorithm 3: Overall SER Estimation Framework / * characterization phase * / 1 Read in the technology library 2 Generate characterization results such as P-LUTs, G-LUTs and FF timing 3 Evaluate and extract layout information of FFs / * computation phase * / 4 Initialize the circuit logic, signal probabilities and clock period 5 Levelization () 6 for leveli 0 toL max do 7 foreach nodex in this level do 8 Generate (x) / * generate HVV and LVV * / 9 Analyze (x) / * calculate SER x comb * / 10 UpdateSER () / * update the combinational term in SER seq (i; 1) * / 11 end 12 end 13 foreach FFi do CalculateSER gen seq (i) 14 Time frame expansion (W ) / * unroll the circuit into W time frames * / 15 foreach FFi do CalculateSER seq (i; 1) by (5.6) 16 forw 2 toW do 17 Propagate () / * do propagation in the next time frame * / 18 Update SER by (5.8) 19 end 20 returnSER total seq 5.7 Experimental Results In this section, we demonstrate the effectiveness of the proposed framework on a DDR3 controller (DDR3ctrl) operating at 625MHz for Micron DDR3 SDRAM MT41J128M8 [ddr14] and a set of ISCAS89 benchmarks with feedback loops. In our experiments, the Nangate 45nm Open Cell Library [nan09], that is characterized with the Predictive Technology Model (PTM) [ptmel], is used to synthesize the aforementioned benchmark circuits using Synopsys Design Compiler. The entire computation phase is implemented in C++ on a MAC laptop with an Intel Core i5 processor and 8GB RAM. In the charac- terization phase, G-LUTs and P-LUTs are established for each combinational standard 72 cell by running a number of HSPICE simulations with the assistance of several Perl and Python scripts. In this chapter, we consider one non-hardened FF, i.e., TGFF, and two hardened FFs, i.e., FERST and DICE. The layout of each FF is created in Cadence Vir- tuoso, and the areas of TGFF, FERST and DICE are 5:7m 2 , 18:15m 2 , and 10:45m 2 , respectively. For each FF, the latching window is measured and the distance information is extracted for the computation phase. The stage number W is set to 10 since no significant SER increase (> 0:1%) is observed even when W is increased to 1000. Table 5.1 concludes the experimental results. The information of each benchmark is provided, including the number of com- binational gates (#gates), the number of primary inputs (#PIs), the number of primary outputs (#POs), and the number of FFs (#FFs). The middle columns in Table 5.1 con- clude the SER comparison among designs that use different types of FFs. The average runtime of estimating designs with different FFs is also provided, and the scalability performance is evaluated using the metric runtime Lmax#gates in the last column. Note that for FERST and DICE, without considering the MCUs, the sequential SER is 0 among all the benchmark circuits, whereas the proposed method shows that these redundant FFs are still vulnerable to radiation-induced soft errors due to the MCUs in the 45nm CMOS technology. The ratio columns in FERST and DICE show that the overall SER of circuits using FERST and DICE is 9.15% and 6.02%, respectively, as a percentage of the SER of circuits using TGFF on average, showing significant SER improvement achieved by applying radiation-hardened FF structures. It can also be observed from the diff columns that DICE achieves more SER improvement than FERST consistently in all benchmark circuits, and the reason is that DICE has twice redundant nodes as FERST. The average runtime of the proposed SER estimation framework is 7.20s, and one can observe that the proposed framework is able to analyze a relatively large scale sequential circuit (e.g., DDR3ctrl that contains 3,214 FFs and 17,182 gates) in up to 119.23s. The 73 Table 5.1: Experimental Results of Various ISCAS89 Combinational and Sequential Benchmark Circuits circuit information TGFF FERST DICE time scalability circuit #gates #PIs #POs #FFs SER(FIT) SER(FIT) ratio(%) SER(FIT) ratio(%) (s) runtime(s) Lmax#gates s27 10 4 1 3 2.41E-03 2.14E-04 8.89 1.29E-04 5.35 0.001 1.1E-05 s349 161 9 11 15 4.43E-02 3.78E-03 8.53 2.15E-03 4.86 0.009 1.8E-06 s382 158 3 6 21 3.01E-02 2.58E-03 8.57 1.68E-03 5.59 0.006 1.8E-06 s400 164 3 6 21 3.44E-02 2.98E-03 8.64 1.97E-03 5.71 0.007 2.0E-06 s420 218 18 1 16 3.63E-03 3.82E-04 10.55 2.33E-04 6.42 0.015 2.1E-06 s444 181 3 6 21 3.11E-02 2.70E-03 8.71 1.73E-03 5.57 0.007 1.5E-06 s510 211 19 7 6 1.75E-02 1.81E-03 10.32 1.15E-03 6.57 0.027 7.5E-06 s641 379 35 24 19 2.14E-02 2.58E-03 12.04 1.89E-03 8.82 0.023 2.0E-06 s713 393 35 23 19 1.93E-02 2.41E-03 12.50 1.81E-03 9.38 0.028 2.4E-06 s820 289 18 19 5 4.56E-02 4.01E-03 8.78 2.00E-03 4.37 0.020 3.8E-06 s820a 289 18 19 5 5.43E-02 4.82E-03 8.87 3.24E-03 5.97 0.019 3.1E-06 s832 287 18 19 5 5.89E-02 5.04E-03 8.56 3.48E-03 5.92 0.027 4.4E-06 s1196a 529 14 14 18 7.68E-03 8.42E-04 10.97 6.50E-04 8.46 0.071 5.1E-06 s1238 508 14 14 18 1.39E-02 1.50E-03 10.74 9.91E-04 7.11 0.077 5.6E-06 s1238a 508 14 14 18 9.64E-03 1.03E-03 10.71 7.27E-04 7.54 0.075 4.6E-06 s1423 657 17 5 74 1.96E-02 1.84E-03 9.39 1.19E-03 6.08 0.093 2.8E-06 s1488 653 8 19 6 1.39E-01 1.40E-02 10.05 1.02E-02 7.33 0.069 4.1E-06 s5378 2779 35 49 179 6.14E-02 5.72E-03 9.32 4.00E-03 6.52 0.274 3.5E-06 s9234 5597 36 39 211 9.94E-03 1.12E-03 11.31 8.13E-04 8.18 0.050 2.4E-07 s9234a 5597 36 39 211 2.85E-02 2.27E-03 7.97 1.38E-03 4.84 0.156 6.1E-07 s13207 7951 62 152 638 8.68E-01 7.22E-02 8.32 4.35E-02 5.01 1.210 2.9E-06 s15850 9772 77 150 534 5.20E-01 4.36E-02 8.39 2.88E-02 5.53 2.553 3.6E-06 s35932 16065 35 320 1728 1.24E+00 1.02E-01 8.21 5.53E-02 4.47 25.605 6.4E-05 s38417 22179 28 106 1636 6.45E-01 4.88E-02 7.57 2.85E-02 4.43 30.429 2.2E-05 DDR3ctrl 17182 62 92 3214 4.98E-01 4.25E-03 0.85 2.94E-03 0.59 119.23 8.6E-05 average 1.77E-01 1.33E-02 9.15 8.02E-03 6.02 7.20 9.7E-06 peak memory usage reported by the largest benchmark DDR3ctrl is 916.1MB. Given the originality of our work in a cohesive SER analysis in advanced technologies, it is diffi- cult to quantitatively compare to prior work in terms of SER. As for scalability, one can observe from the last column of Table 5.1 that the scalability metric runtime Lmax#gates remains near the same level for s27 to s15850 but shows an increase for s35932, s38417, and DDR3ctrl. This indicates that (i) under the same level of #FFs, the average runtime of analysis per gate and per logic level is near the same for circuits of different scales, and (ii) the runtime per gate and per logic level will increase as the #FFs grows due to the MCU evaluation for FFs and more complicated soft error propagation in the subsequent clocks after particle strikes. 74 5.8 Conclusion In this chapter, a fast, scalable and comprehensive SER estimation framework for com- binational and sequential circuits was proposed, which was comprised of a characteriza- tion phase and a computation phase. In the characterization phase, various LUTs were built to characterize different combinational and sequential components, whereas in the computation phase, a top-down memoization algorithm was proposed to accelerate the SET propagation. Besides, a time frame expansion approach was presented for sequen- tial logic analysis, and a general schematic and layout co-simulation approach was pro- posed to model the MCU effects in redundant FF structures. Experimental results on various ISCAS89 combinational and sequential benchmark circuits demonstrated the importance of considering MCU effects, and the significant SER improvement after replacing TGFF by DICE or FERST. The runtime of the proposed algorithm was of the order of tens of seconds for various ISCAS89 benchmark circuits. 75 Chapter 6 DRL-Cloud: Deep Reinforcement Learning-Based Resource Provisioning and Task Scheduling for Cloud Service Providers As soft errors greatly affect the systems based on the conventional computing, the recent advancement of the emerging computing systems, i.e., the Deep Neural Network (DNN) and Deep Convolutional Neural Network (DCNN), can not only achieve record-breaking performance in many detection and recognition tasks, but also completely tolerate soft errors. These emerging resilient systems show a new path for resilient computing. Nevertheless, there are two major limitations of such resilient systems: (i) The best performance is achieved mainly for detection and recognition tasks. So how to apply the such resilient systems to control problems with boarder impact? For exam- ple, the cloud resource allocation problem. (ii) Such systems are resource-intensive, which limits their applications in wearable/IoT devices with tight resource constraints [DLW + 17, YKD + 18, DLZ + 17]. Further efficiency improvement must be achieved to promote the adoption of such resilient systems. Therefore, the rest of the thesis contributes to overcome the aforementioned limita- tions. More specifically, this chapter is focused on the first limitation, and Chapter 6-10 are focused on the second limitation. 76 6.1 Introduction In this chapter, we propose the DRL-Cloud framework, which is the first DRL-based highly scalable and adaptable RP and TS system with capability to handle large-scale data centers and changing user requests [CLN18]. In this chapter, a general type of realistic pricing policy comprised of time-of-use pricing (TOUP) and real-time pricing (RTP) [MRLG10, LWL + 15] is used. In addition, Pay-As-You-Go billing agreement (as in GAE and EC2) is used. All deadlines are hard deadlines and a task will be rejected if the hard deadline is violated. DRL-Cloud is comprised of two major parts: i) user request acceptance and decoupling into a job queue and a task ready queue; ii) energy cost minimization by our DRL-based two-stage RP-TS processor, and fast convergence is guaranteed by training techniques in deep Q-learning such as target network and expe- rience replay. The contributions of this work are as follows: Applying DRL to RP and TS. To the best of our knowledge, this is the first to present a DRL-based RP and TS system to minimize energy cost for CSPs with large-scale data centers and large amounts of user requests with dependencies. A two-stage RP- TS processor based on DRL is designed to automatically generate the best actions to obtain minimum energy cost in long-term by learning from changing environment, and its multi-stage structure gives the proposed DRL-Cloud high efficiency and high scalability. Semi-Markov Decision Process (SMDP) formulation. Cloud resource allocation and energy cost minimization problem is formulated based on a semi-Markov deci- sion process, because DRL-Cloud receives user requests that can raise randomness and data centers’ resource utilization status can be formulated as MDP. State space 77 and action space are defined in both stages of RP-TS processor, both of which are large but finite. Fast convergency and high adaptability. The proposed DRL-Cloud is fully par- allelizable to training algorithm, which empowers our system with robustness, effi- ciency and ability of steady evolvement. Utilization of training techniques such as experience replay and target network, makes DRL-Cloud converge in less than 0:5 second, and gives it high adaptability and low runtime. Remarkably low runtime and low energy cost. Compared to FERPTS, one of the state-of-the-art methods that considers historical resource allocation and current server utilization, DRL-Cloud achieves up to 3X energy cost efficiency improvement while maintaining up to 2X lower user request reject rate (hard deadline violation rate) and up to 92X runtime reduction. And compared to the Round-robin method, which is known for the remarkably low runtime, DRL-Cloud achieves up to 12X runtime reduction, 2X energy cost efficiency improvement and rejects up to 15X less user requests. 6.2 System Model for DRL-Based Energy Cost Mini- mization in Cloud Computing The system model (as shown in Figure 6.1) is introduced in this section, which includes user workload model, cloud platform model, energy consumption model, and price model. 78 6.2.1 User Workload Model In this chapter, the entire user workload is composed of a number of jobs (i.e., user requests), each of which contains several tasks with dependencies. Job Characteristics Directed Acyclic Graphs (DAGs) are used to model jobs. The entire user workload ofU jobs is represented as a collection ofU disjoint DAGs:fG 1 (N 1 ;W 1 );:::;G U (N U ;W U )g. A DAGG u (N u ;W u ) (u2 [1;U]) containsN u vertexes andW u edges. Each vertex u n (n2 [1;N u ]) represents a single task, and each edge u w(i;j) represents the amount of data that needs to be delivered from parent task u i to child task u j . Figure 6.1 presents an example of user workload model with multiple jobs. Other cloud frame- works that use similar task graph based workload model include Nephele [WK09] and Dryad [IBY + 07]. Task Characteristics For each task u n , the requested VM types is denoted by u n K, and its estimated exe- cutable time is represented by u n L, which can be derived from Nephele, an approxi- mation method [WK09]. Besides, user specifies hard deadline u n T ddl for task u n , and the scheduled start time by CSP is u n T start . Based on Admission Control Policy, if a task cannot be completed before given deadline by using infinite resource, i.e., a tight deadline is given, then this task should be rejected immediately. Also, according to SLA, one prerequisite should be met: u n T start + u n L u n T ddl . The CSP supports V types of VMsfVM 1 ;:::;VM V g. Each typeVM v (v2 [1;V ]) is associated with a two- tuple parameter setfR v CPU ;R v MEM g, which represents the required amount of CPU and memory, respectively. Similarly, each task u n is associated with a two-tuple parameter setf u n D CPU ; u n D MEM g, which represents the required amount of CPU and memory by 79 Figure 6.1: System model of the cloud platform and structure for DRL-Cloud. The system model is defined in Section 6.2. The structure and algorithm of the proposed DRL-Cloud are described in Section 6.3 and Section 6.4. this task, respectively. Successful task execution requires sufficient resource. Hence, if task u n is allocated toVM v , two prerequisites should be met: R v CPU u n D CPU and R v MEM u n D MEM . 6.2.2 Cloud Platform Model As shown in Figure 6.1, a CSP ownsM serversf 1 ; 2 ;:::; M g, and nearby servers are clustered in a server farm. The number of server farms owned by CSP is denoted byF . Server farmF f hasM f servers, i.e., P F f=1 M f =M. Servers farms are connected with each other through two-way high-speed channels, whereas servers within one server farm are connected through local channels. The cloud platform is modeled as an undi- rected graph, where each vertex represents a server and each edge embodies the com- munication channel [XLNB17]. The bandwidth of the channel between server m and server m 0 is represented by the weight of the edgeB(m;m 0 ) (1m;m 0 M). Note that data exchange in the same server does not incur extra delay, i.e., B(m;m) =1. Similar to tasks, each server m has a two-tuple parameter set:fC m CPU ;C m MEM g, which represents the available amount of CPU and memory on server m . During operation, 80 the state of server m is represented asf m 1 (t); m 2 (t);:::; m V (t)g, where m v (t) is the number of v type VMs that are hosted on server m at time t. One prerequisite for VM allocation is that the total utilized computing resource (CPU, memory) must be less than the total available resource on each server during the entire operating process: 8t;8m; P V v=1 m v (t)R v CPU C m CPU and P V v=1 m v (t)R v MEM C m MEM . 6.2.3 Energy Consumption Model The total power of server m at time t is composed of static power Pwr m st (t) and dynamic powerPwr m dy (t). Both static and dynamic power of server m are dependent on the CPU utilization rateUr m (t) at timet, which is calculated as Ur m (t) = P V v=1 m v (t)R v CPU C m CPU : (6.1) Pwr m st (t) is constant when Ur m (t) > 0 and zero otherwise. Pwr m dy (t) is linearly increased when Ur m (t) is under the optimal utilization rate Ur m Opt , and non-linearly increased otherwise [GWGP13, LLY + 17]. Additionally, different server may have dif- ferent energy efficiency even under exactly identical utilization rate, this is captured by the parameter m , and m measures the power consumption increase of m . In this chapter, we adopt the dynamic power model from [GWGP13] andPwr m dy (t) is calcu- lated as 8 > > < > > : Ur m (t) m ; Ur m (t)<Ur m Opt Ur m Opt m + Ur m (t)Ur m Opt 2 m ; Ur m (t)Ur m Opt : (6.2) Note that the proposed method can accommodate other energy consumption models as well. The total power at timet isPwr ttl (t) = P M m=1 Pwr m st (t) +Pwr m dy (t) . 81 6.2.4 Realistic Price Model In this chapter, we consider a realistic non-flat price modelPrice t;Pwr ttl (t) that is composed of a time-of-use-pricing (TOUP) component and a real-time pricing (RTP) component with inclining block rates (IBR) [MRLG10, LWC + 14, ARG15, LLNP17]. The TOUPTOUP (t) is dependent on the time of the day and is usually higher in peak usage time periods than off-peak time, in order to incentivize users to shift loads towards off-peak periods. The RTP price with IBRRTP Pwr ttl (t) increases by the total quan- tity consumed, which is calculated as RTP Pwr ttl (t) = 8 > > < > > : RTP l (t); Pwr ttl (t)<(t) RTP h (t); Pwr ttl (t)(t): (6.3) where (t) is a threshold, and RTP l (t) and RTP h (t) are the wholesale prices set by utility company on-the-fly. The total energy cost during whole RP and TS process is: TotalCost = T X t=1 Price t;Pwr ttl (t) : (6.4) 6.2.5 Problem Formulation Given user workload model, cloud platform model, energy consumption model, dynamic pricing model. Find VM configuration, server allocation and execution time slot for each user task. Minimize total energy price during entire RP-TS operation: TotalCost = T X t=1 Price(t; M X m=1 Power m TT (t)): (6.5) 82 Subject to: V X v=1 m v (t)R v CPU C m CPU ;8t;8m (6.6) V X v=1 m v (t)R v CPU C m CPU ;8t;8m (6.7) R v CPU u n D CPU ;8 u n (6.8) R v MEM u n D MEM ;8 u n (6.9) u n T start + u n L u n T ddl ;8 u n (6.10) andtaskdependencyrequirements: 6.3 DRL-Cloud: DRL-Based Cloud Resource Provi- sioning and Task Scheduling System In this section, we present DRL-Cloud that minimizes power consumption and electric bills for CSPs with large-scale data centers. As shown in Figure 6.2, the proposed system first decorrelates the dependencies among tasks, then the decoupled tasks are sent to the two-stage RP-TS processor. After this process, final energy cost is calculated by using the realistic price model (as proposed in Section 6.2.4). 6.3.1 Task Decorrelation A CSP holds all jobs sent from users in job queue Queue G . Within one job, child tasks are dependent on parent tasks with dependencies, and tasks that do not depend on other parent tasks or whose dependencies are satisfied are ready tasks. Queue G takes feedback (parent task complete signal and data that need to be delivered to child task) from the two-stage RP-TS processor as shown in task decorrelation part of Figure 6.2, 83 which makes parent task become ready task. Job queue provides ready tasks from each job as input to task ready queueQueue . Then CSP pops tasks fromQueue and sends them to the two-stage RP-TS processor. 6.3.2 Two-Stage RP-TS Processor Based on Deep Q-Learning The admission control policy takes place to filter the jobs that cannot be accomplished before hard deadline even with infinite computing resources. Otherwise, the first stage (Stage 1 ) of the two-stage RP-TS processor will allocate task to one ofF server farms, and determine task start time u n T start , and then continue processing or drop the job if SLA described in Section 6.2.1 is violated. After allocating task into server farmF f (f2 [1;F ]), the second stage (Stage 2 ) of the processor takes responsibility of choosing exact server m to run task u n . When task is completed,Stage 2 sends parent task complete signal and data to job queue (as feedback as described in Section 6.3.1). Setups of the proposed deep Q-learning-based two-stage RP-TS processor is described as follows: Figure 6.2: The structure of the DRL-Cloud framework: the details of task decorrela- tion is described in Section 6.3.1, and the details of the two-stage RP-TS processor is described in Section 6.3 and Section 6.4. 84 Action Space For server farm level, DQN inStage 1 is responsible for choosing a server farm fromF farms, and determines start time u n T start to be one of the possible time, so action space for Stage 1 ’s DQN can be represented as A Stage 1 =fF T 1 1 ;:::;F T T F g. For server level, DQN inStage 2 selects exact server within server farmF f , so action space forStage 2 ’s DQN can be represented asA Stage 2 =f 1 ;:::; M f g. State Space The optimal action is determined based on current observationx, which is the combi- nation of current server observationx server and current task observationx task . Current server observationx server describes available CPUC v CPU , memoryC v MEM of requested VMs on server, whereasx task is comprised of requested CPUR v CPU , memoryR v MEM of several types of VM and task deadline. Therefore, state is a sequence of actions and observations, s t = x 1 ;a 1 ;x 2 ;a 2 ;:::;a t1 ;x t , and is input to the proposed deep Q- learning-based two-stage RP-TS processor. The processor then learns optimal alloca- tion strategies depending on these sequences. All sequences are assumed to terminate in finite steps, which leads to a large but finite semi-Markov decision process (SMDP) and each sequence is a distinct state. The proposed processor chooses action based on cur- rent state and get a reward from environment, i.e., energy cost, and changes system state in the meantime, then uses Q value of action-reward pair to train DQN to “maximize” long-term reward, i.e., minimize long-term energy cost in our case. Reward Function The goal of the two-stage RP-TS processor is to minimize long-term energy cost by taking a sequence of actions. After taking action a t at current state s t , system will evolve into a new state s t+1 and receive a reward r t from the environment, which is 85 energy cost increase of actiona t , i.e., current energy cost minus previous energy cost (at timeT pre start ). ForStage 1 , reward function can be calculated as price increase: r Stage 1 =Price u n T start ;Pwr F f ttl ( u n T start )Pwr F f ttl (T pre start ) : (6.11) Similarly, reward function inStage 2 can be calculated as: r Stage 2 =Price u n T start ;Pwr m f ttl ( u n T start )Pwr m f ttl (T pre start ) : (6.12) 6.3.3 Semi-Markov Decision Process (SMDP) Formulation As the aforementioned state transition, we formulate the energy cost minimization prob- lem of cloud resource allocation into SMDP problem. In Q-learning procedure, DRL agent’s ultima goal is to maximize value functionQ(s;a), which specifies what is opti- mal in long-run. In other words, the value of a state is the total amount of reward that an agent can expect to accumulate for the future, starting from that state [Zac16]. Opti- mum action-value functionQ (s;a) after seeing sequences and taking actiona is the maximum expected achievable reward by following any policy, which obeys Bellman equation is defined as: Q (s;a) =E s 0[r + max a 0 Q (s 0 ;a 0 )js;a] (6.13) Q-learning is proven to have the ability to achieve optimum policy under SMDP envi- ronment and can get stunning results even in non-stationary environment [BD95]. In this work, we update value estimates as follows: Q t+1 (s t ;a t ) =Q t (s t ;a t )+ r t+1 + max a t+1 Q t (s t+1 ;a t+1 )Q t (s t ;a t ) (6.14) where learning rate2 (0; 1] and discount factor 2 [0; 1]. 86 6.4 Deep Q-learning Algorithm for DRL-Cloud With Experience Replay The algorithm for training DQN is modified from standard Q-learning by using experi- ence replay and target network to make it suitable for training large neural network with high speed of converging. 6.4.1 Training Details for Deep Q-Networks Experience Replay: At each time-step of inner loop in Algorithm 5, store transition (s t+1 ;a t ;r t ;s t ) into a replay memory , and apply Q-learning updates to minibatch of experience which is selected randomly from the pool of stored samples. This approach is superior to standard Q-learning in several ways. First, data efficiency is higher because each step of experience is potentially replayed many times in many weight updates. Second, learning from randomly selected experiences instead of sequential experience Algorithm 4: Control Algorithm for DRL-Cloud 1 Initialize realistic price model 2 Initialize environment and deep Q-networkDQN Stage1 3 RunDQN Stage1 and store user request allocation 4 Initialize environment and deep Q-networkDQN Stage2 5 forf = 1;F do 6 for = 1;T do 7 RunDQN Stage2 and store user request allocation 8 end 9 end 10 Calculate final user request allocation matrix, i.e.,Ur for every server 11 Calculate final energy consumptionPwr ttl and electric billTotalCost 12 returnTotalCost 87 breaks the correlation of learning data and reduces the variance of update, which pro- vides learning procedure with higher efficiency. Third, behavior distribution of ran- domly selected minibatch of experience is averaged over previous states, which makes learning procedure stable and filters oscillations or divergence in the parameters. Target Network: A separate neural network (target network) is used in DQN to gen- erate target Q value in Q-learning update. Target network has same structure with eval- uation network, but has different parameters. More specifically, parameters of target network are cloned from evaluation network in every steps. This modification over- matches standard Q-learning by adding delay between the time an update effects Q and the time an update effects target, which eliminates divergence and oscillations even more. Exploration & Exploitation: Exploration and exploitation is a puzzle about focusing on exploring new, indeterminate actions or spending time to exploit existing outstanding policy. The strategy we use in this work isgreedy with decreasing, that is with probability (start with large ) choose a random action, otherwise choose the action with highest Q-value, and decrease over time to a minimum value. This strategy explores actions randomly at the beginning and settles down to a smaller exploration rate, which makes the DRL agent to exploit existing known-good policy to return high reward more often. 6.4.2 System Control Algorithm and the Two-Stage RP-TS Proces- sor Algorithm with Experience Replay Control algorithm in this work is shown in Algorithm 4, which implements the hierarchi- cal structure described in Section 6.3.2. And the two-stage RP-TS processor algorithm is shown in Algorithm 5. 88 Algorithm 5: Deep Q-Learning for DRL-Based RP-TS With Experience Replay 1 Initialize replay memory to capacity 2 Initialize action-value functionQ with random weights 3 Initialize target action-value function ^ Q with weights 0 = 4 forepisode = 1;E do 5 Reset cloud server environment to initial state 6 Initialize sequences 1 =fx 1 g 7 fort = 1;T do 8 With probability choose a random actiona t 9 otherwise choosea t =argmax a Q(s t ;a;) 10 Execute actiona t and observe next observationx t+1 , rewardr t , and reject signal 11 ifreject = 1 then 12 Run DQN again to get a new actiona 0 t 13 ifa t 6=a 0 t then 14 Replacea t witha 0 t 15 end 16 end 17 Sets t+1 =s t ;a t ;x t+1 18 Store transition (s t+1 ;a t ;r t ;s t ) in 19 Sample random minibatch of transitions (s j+1 ;a j ;r j ;s j ) from 20 target j = ( r j ; if episode terminates at step j+1 r j + max a 0 ^ Q(s j+1 ;a 0 ; 0 ); otherwise 21 Perform a gradient descent step on target j Q(s j ;a j ;) 2 22 Every steps, train evaluation network, decrease 23 Every steps, copyQ to ^ Q 24 end 25 end 26 return All actionsa t 6.5 Experimental Results 6.5.1 Experiment Setup Three baselines are used to compare with DRL-Cloud: The Greedy Method: The CSP tries every option to find the assignment that yields the minimum energy cost increase. The CSP rejects tasks according to SLA. The Round-Robin (RR) Method: The CSP assigns each task in circular order. If the current assignment violates SLA, the scheduler will try the following options 89 until non-violation. A task will be rejected by the CSP if it is rejected by all possible assignments. FERPTS [LLY + 17]: one contemporary algorithm inspired by a negotiation-based heuristic [LLNP17, LWC + 14, LWL + 15], that is aware of historical decisions and current scheduling of other tasks with the introduction of congestion concept. Comparison is based on three indicators: energy cost, runtime and reject task number during the whole RP-TS process. We conduct two sets of experiments: one set on small-scale problems with 3; 000 to 5; 000 user requests and 100 to 300 servers that are clustered into 10 server farms; the other on a large-scale problem with 50; 000 to 200; 000 user requests and 500 to 5; 000 servers in 10 to 100 clusters. Note that, in this chapter we use a very large-scale configuration for user workload model and cloud platform model 1 . We adopt the price data from [ele, Far10], where we consider that utility announces the price 24 hours ahead. Note that price is an input to the DRL- Cloud framework, which can accommodate on-the-fly price changes, e.g., price policy that updates every 10min. We use real user workload trace from Google cluster-usage traces [goota], which represents 29 days’ worth of cell information from May 2011, on a cluster of about 12:5k machines. Task dependencies in user workload model and the amount of resources in VMs and servers in cloud platform model are randomly generated, whereas information about the number of tasks in each job, task requirements of CPU and memory is retrieved from real data. For deep Q-learning, learning rate = 0:1, discount factor = 0:9, and is decreased from 0:9 by 0:05 in each learning iteration. All parameters are chosen 1 Up to 180 servers and 10; 000 tasks in [GWGP13], up to 150 servers and 280 tasks in [LLY + 17], up to 200 servers and 95; 000 tasks in [LLX + 17]. 90 Figure 6.3: Runtime and energy cost comparisons with baselines for small scale work- load and platform configuration. Energy cost is normalized with regard to the energy cost of Greedy method for 100 servers and 5; 000 requests. from commonly used range [MKS + 15, SLW + 06] by giving relatively optimal perfor- mance. All simulation experiments are conducted in Python environment with Tensor- Flow [AEGT16] on a MacBook Pro (2012) with 2:6 GHz Intel Core i7 processor, 8GB 1600MHz DDR3 memory. 6.5.2 Experiments on Small-Scale Workloads and Platforms Runtime and energy cost comparisons with three baselines are shown in Figure 6.3. One can observe that DRL-Cloud consistently outperforms the three baselines in different scenarios. Compared to the Greedy baseline, DRL-Cloud achieves up to 4X energy cost efficiency improvement and 480X runtime reduction. Compared to RR, DRL-Cloud outperforms with up to 3X less energy cost while rejects 3X less tasks on average, but uses 91:42% more runtime. This is because RR takes incoming task and allocates it immediately without decision making process. This gives RR remarkably low runtime when server number is small and decisions can be made in few tries. Compared to 91 Figure 6.4: (a) Convergence of DRL-Cloud. (b) Energy cost comparison with RR in long-run (29 days) on large scale workloads and platform configuration. FERPTS, DRL-Cloud achieves up to 3X energy cost efficiency improvement and 92X runtime reduction. To be noticed, RR outperforms FERPTS in terms of energy cost because RR allocates tasks evenly and rejects more tasks due to hard deadline violation, and rejected tasks are not calculated in energy cost. 6.5.3 Experiments on Large-Scale Workloads and Platforms Both FERPTS and Greedy use much more than 30 minutes for the large-scale experi- ment, so only RR is scalable to be used for large-scale comparison with DRL-Cloud. As shown in Table 6.1, compared to RR, DRL-Cloud achieves up to 225% energy cost improvement, and rejects 253% less tasks in the meantime. Compared with RR, DRL- Cloud results in higher runtime when server number is 500 to 2; 000, but results in lower runtime when server number is larger (when server number and task number are 5; 000 and 200; 000, DRL-Cloud achieves 144%, 218% and 249% runtime reduction, energy cost efficiency improvement and reject rate reduction, respectively). This is because in RR, tasks are dispatched evenly in servers; servers are tried in turn when current server is overloaded. This process does not need much time when server/task number are small and existing servers can accept or reject incoming tasks within few tries. However, it would take longer time when server/task number are large. 92 Table 6.1: Comparison of Energy Cost, Runtime and Reject Task Number between DRL-Cloud and Round-robin Problem Scale Runtime (s) Reject Rate Energy Cost Server Task RR DRL-Cloud Improvement Improvement 500 50,000 5 29 248.48% 222.97% 100,000 11 60 250.76% 200.79% 200,000 24 24 252.64% 199.71% 2,000 50,000 18 30 250.32% 225.20% 100,000 37 61 247.93% 225.04% 200,000 78 128 250.76% 213.26% 5,000 50,000 48 30 247.41% 220.60% 100,000 97 64 248.64% 218.71% 200,000 196 135 249.17% 218.16% 6.5.4 Long-Term Experiments and Convergence In this section, long-term comparison (within one month) of RR and DRL-Cloud with 5; 000 servers and 50; 000 tasks per day is experimented to test adaptability of changing user pattern in long term. The result is shown in Figure 6.4(b). Compare to RR, DRL- Cloud achieves 2X energy cost efficiency improvement, 2X runtime reduction and 3X reject rate reduction on average. Average runtime of RR and DRL-Cloud are 58:40s and 30:43s respectively. To be noticed, runtime improvement and reject rate improvement of DRL-Cloud during 29 days is up to 12X and 45X. Convergence situation is shown in Figure 6.4(a), we can see that DRL-Cloud converges fast, and this is because of the training techniques we used in Section 6.4. 6.6 Conclusion We presented DRL-Cloud, a novel DRL-based system with two-stage resource pro- visioning and task scheduling to reduce energy cost for cloud service providers with 93 large-scale data centers and large amounts of user requests with dependencies. DRL- Cloud is highly scalable and highly adaptable compared to the state-of-the-art methods, and the training algorithm converges fast, thanks to the training techniques in our two- stage RP-TS processor such as experience replay, target networks as well as exploration and exploitation. Compared with one of the contemporary algorithms that has stun- ning energy cost efficiency and low task reject rate: FERPTS, DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower task reject rate on average. For a large-scale problem with 5; 000 servers and 200; 000 task, and compared to the round-robin baseline which offers extraordinary runtime, DRL-Cloud achieves DRL-Cloud achieves up to 144% runtime reduction, 218% energy cost effi- ciency improvement and 249% lower reject rate. 94 Chapter 7 Stochastic Computing (SC) Based Deep Convoluntional Neural Network (DCNN) Block Design and Optimization Besides the Deep Neural Network (DNN) that has been integrated in Deep Reinforce- ment Learning framework (as mentioned in Chapter 6), Deep Convolutional Neural Net- work (DCNN) is another high performance resilient system that tolerates soft errors. However, DCNNs require more computation resources, which greatly hinders their deployment in the widespread IoT and wearable devices. Hence, the remainder of this thesis is dedicated to improving the energy, power, and area efficiency of DCNNs through the Stochastic Computing (SC) technique, in order to promote their adoption in the fast growing IoT and wearable devices. 7.1 Introduction Machine learning technology is increasingly present in the Internet and consumer prod- ucts, and it powers many fundamental applications such as speech-to-text transcription, selection of relevant search results, natural language processing, and objects identifica- tion in images or videos [Ben09, LBH15]. Conventional machine learning techniques 95 are limited in its ability to process data in their raw form (e.g., the pixel values of an image) [LBH15]. As a result, considerable human engineering efforts and domain expertise are required to transform raw data into suitable internal representations that can be understood and processed by the learning system [LBH15]. With the fast grow- ing amount of data and range of applications to machine learning methods, the ability to automatically extract powerful features is becoming increasingly important [Ben09]. Many representation learning methods have been proposed to automatically learn and organize the discriminative information from raw data [LBH15]. Deep learning is one of the most promising representation learning methods, which enables a system to extract representations automatically at multiple levels of abstraction and learn complex functions directly from data with very little engineering by hand [LBH15, Ben09]. With the self-learning ability to configure its intricate structure, a deep learning architecture can easily take advantage of increases in the amount of available computation and data [LBH15]. Recently, Deep Convolutional Neural Network (DCNN), which is one of most widely used types of deep neural networks, has achieved tremendous success in many machine learning applications, such as speech recognition [SMKR13], image classifica- tion [SZ14b], and video classification [KTS + 14]. DCNN is now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks [LBH15]. Nevertheless, compared with other machine learning techniques, DCNNs require more computations due to the deep layer architecture. Furthermore, the industrial and academic demands for better quality of results also tend to increase the depth and/or width of DCNNs [SLJ + 15], leading to complicated topologies and increased computation resources required for implementation. Therefore, a practical implementation of large-scale DCNNs is to use high performance server clusters with accelerators such as GPUs and FPGAs [MGAG16, RLC16]. A notable trend is that with 96 the astonishing advances on wearable devices and Internet-of-Things (IoT) [CLD + 17], machine learning has also been rapidly adopted in the widespread mobile and embedded systems. In order to bring the success of DCNNs to these resource constrained systems, designers must overcome the challenges of implementing resource-hungry DCNNs in embedded systems with limited area and power budget. Stochastic Computing (SC), which is the paradigm of logical computation on stochastic bit streams [LLQ + 12], has the potential to enable fully parallel and scal- able hardware-based DCNNs. Since SC provides several key advantages compared to conventional binary arithmetic, including low hardware area cost and tolerance to soft errors [LLQ + 12, LD16a, LD16b], considerable research efforts have been invested in the context of designing neural networks using SC in recent years [KKY + 16, SGMA15, JRML15, RLW + 16, LRL + 16]. Nevertheless, there lacks a comprehensive investigation of energy-accuracy trade-off for DCNN designs using different SC components. In this chapter, two hardware-based neuron structures using SC are introduced, i.e., Accumulative Parallel Counter (APC) based neuron and multiplexer (MUX) based neuron. We further investigate the trade-off among area, power, energy and (neuron cell) accuracy for these neuron structures using different input sizes and stochastic bit stream lengths. Then, from an architecture per- spective, the influence of inaccuracy of neurons in different layers on the overall DCNN accuracy is studied. Based on the results, a structure optimization method is proposed for a general DCNN architecture, in which neurons in different layers are implemented with the optimized SC components such that the overall DCNN area, power, and energy consumption are minimized while the DCNN accuracy is preserved. The contributions of this work are threefold. First, we introduce SC into the DCNNs, in order to make the footprints of DCNNs small enough for successful implementations in today’s wearable devices and embedded systems. Second, we carry out a detailed 97 analysis on the energy-accuracy trade-off for different SC-based neuron designs. Third, based on the analysis of the results, we propose a structure optimization method for a general DCNN architecture using SC, which jointly optimizes the area, power, energy, and accuracy for the entire DCNN. Experimental results on a LeNet 5 DCNN architec- ture demonstrate that compared with the conventional 8-bit binary implementation, the presented hardware-based DCNN using SC achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2:86%. 7.2 Hardware-Based DCNN Design and Optimization using SC In this section, we first conduct a detailed investigation of the energy-accuracy trade- off among two hardware neuron designs using SC, i.e., APC-based neuron and MUX- based neuron, as shown in Figure 7.1 (a) and (b), respectively. Hardware-based pooling is provided afterward, and finally we present the structure optimization method for the overall DCNN architecture. 7.2.1 Approximate Parallel Counter (APC)-Based Neuron Figure 7.1 (a) illustrates the APC-based hardware neuron design, where the inner prod- uct is calculated using XNOR gates (for multiplication) and an APC (for addition). To be more specific, we denote the number of bipolar inputs and stochastic stream length by n andm, respectively. Accordingly,n XNOR gates are used to generaten products of inputs (x 0 i s) and weights (w 0 i s), and then the APC accumulates the sum of 1s in each col- umn of the products. Since the sum generated by APC is a binary number, theK-state FSM design mentioned in Section 2.2.3 cannot be applied here directly. Instead of an 98 FSM, a saturated up/down counter is used to perform the scaled hyperbolic tangent acti- vation functionBtanh() for binary inputs. Details and optimization of theBtanh() activation function using a saturated up/down counter for binary inputs can be found in reference work [KKY + 16]. For an APC-based neuron with the fixed bit stream length 1024, the accuracy, area, power, and energy performance with respect to the input size are shown in Figure 7.2 (a), (b), (c), and (d), respectively. To be more specific, as illustrated in Figure 7.2 (a), APC- based neuron shows a very slow accuracy degradation as input size increases. However, the area, power, and energy of the entire APC-based neuron cell increases near linearly as the input size grows, as shown in Figure 7.2 (b), (c), and (d), respectively. The reason is as follows: With the efficient implementation of Btanh() function, the hardware of Btanh() increases logarithmically as the input increases, since the input width of ... Parallel Counter Up/Down Counter w1 w2 w3 w4 wn x1 x2 x3 x4 xn ... } n ... binary number n stochastic bit-streams with m length } one column of n products one stochastic bit-stream with m length m log2 n 1 ... Stanh w1 w2 w3 w4 wn x1 x2 x3 x4 xn ... ... stochastic bit-streams n stochastic bit-streams with m length } m 1 n to 1 Mux (a) (b) 1 } n Figure 7.1: Various hardware neuron designs. (a) APC-based neuron, and (b) MUX- based neuron. 99 Btanh() islog 2 n. On the other hand, the number of XNOR gates and the size of the APC grow linearly as the input size increases. Hence, the inner product calculation part, i.e., XNOR array and APC, is dominant in an APC-based neuron, and the area, power, and energy of the entire APC-based neuron cell also increase at the same rate as the inner product part when the input size increases. Since the length of the stochastic bit stream is important, we investigate the accuracy of APC-based neurons using different stream lengths under different input sizes. As shown in Figure 7.3, longer bit stream length consistently outperforms lower bit stream length in terms of accuracy in APC-based neurons with different input sizes. However, designers should consider the latency and energy overhead caused by long bit streams. Figure 7.2: Using the fixed bit stream length 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for an APC-based neuron. 100 Figure 7.3: The length of bit stream versus accuracy under different input numbers for an APC-based neuron. 7.2.2 Multiplexer (MUX)-Based Neuron As shown in Figure 7.1 (b), a MUX-based neuron is comprised of XNOR gates, a MUX, and aK-state FSM, in order to compute the products of bipolar inputs (x 0 i s) and weights (w 0 i s), the stochastic sum of all products, and the hyperbolic tangent activation function, respectively. As the inner product calculated by a MUX is a stochastic number, the K-state FSM design mentioned in Section 2.2.3 can be used here to implement the activation functionStanh(). Nevertheless, two problems must be taken into consideration: (i) the inner product calculated by ann input MUX is scaled to z n , assuming the correct result isz, and (ii) with the input z n , the K-state FSM calculates tanh( Kz 2n ) instead of the desired value tanh(z). Hence, in order to get the correct activation, we need to scale up the results of MUX byn times and multiply the stream by 2 K (or multiply by 2n K directly). As opposed to the relatively simple and efficient data conversions on a software platform, such conversions in a hardware-based neuron incurs significant hardware overhead, because the linear gain transformation needs one more FSM [BC01b], and the multiplication requires one XNOR gate as well as the generation of the other bipolar stochastic stream. 101 In this chapter, considering ann inputs neuron with inner product denoted byz, we select the state numberK such that 2n K = 1, and the final output of the FSM is calculated as Stanh(K; z n ) =tanh( Kz 2n ) =tanh(z) (7.1) In this way, we achieve the desired activation result with no additional bit stream con- version (i.e., no hardware overhead). We first investigate the performance of the MUX-based neuron with respect to its input size. Figure 7.4 (a), (b), (c), and (d) show the results of the number of inputs versus accuracy, area, power, and energy, respectively, for a MUX-based neuron using a fixed bit stream length which is equal to 1024. It is important to achieve a high accuracy of a neuron cell, however, as shown in Figure 7.4 (a), the accuracy of a MUX-based neuron significantly degrades as the input size increases. The reason is that MUX addition selects only one bit at a time and ignores the rest of the bits, leading to low accuracy when input size is large. In addition, one can observe from Figure 7.4 (b), (c), and (d) that as the number of inputs increases, area, power, and energy of the MUX-based neuron all tend to increase. This is because a MUX-based neuron with more inputs requires more XNOR gates and MUXes for inner product calculation, and more states in the FSM (K = 2 n) to compute the activation function. Hence, the increased hardware components result in more area, power, and energy of the neuron cell. Next, we investigate the relationship between bit stream length and accuracy under different numbers of inputs. As shown in Figure 7.5, for a certain input size, longer bit stream results in higher accuracy, and the improvement of accuracy is more significant when input size is larger. Hence, when designing a MUX-based neuron, long bit stream can be applied to compensate the accuracy degradation for large input size. 102 Figure 7.4: Using the fixed bit stream length 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for a MUX-based neuron. Figure 7.5: The length of bit stream versus accuracy under different input numbers for a MUX-based neuron. 7.2.3 Pooling Operation In a DCNN, down sampling steps are performed by the pooling layers, which sum- marize the outputs of neighboring groups of neurons in the same kernel map. Pooling operation achieves the invariance to input data (i.e., image, video, etc.) transformations 103 and better robustness to noise and clutter. Moreover, the inter-layer connections can be significantly reduced for a hardware DCNN by using pooling layers. Considering a pooling region consisting of k neurons: fa 1 ; ;a k g in a feature map, wherea i denotes the activation result of thei-th neuron, the pooling layer selects one activation a out at a time. In this chapter, we adopt the average pooling, where each activation resulta i has the same probability to be selected as output, i.e., a out = mean(a 1 a k ). For example, the stochastic arithmetic mean over a 2 2 region is provided in Figure 7.6, where three 2-to-1 MUXes are needed to implement the average pooling. 7.2.4 Structure Optimization for the Entire DCNN Architecture There are four performance metrics for the DCNN design, i.e., accuracy, area, power, and energy. In this chapter, we consider a general DCNN optimization problem, where the objective function is comprised of one or multiple metrics and the rest of the metrics are considered as constraints, e.g., energy, power, and accuracy as objective function with area as constraint. In addition, we introduce one more constraint that the accuracy of hardware-based DCNN cannot be significantly lower than the accuracy of software- based DCNN, so as to make the accuracy of the hardware-based DCNN competitive. Mux 1/2 bit stream Mux Mux a1 3 a2 a a4 aout 1/2 bit stream Figure 7.6: A 4-to-1 pooling example. 104 The DCNN architecture of interest shown in Figure 2.1 consists of two pooling lay- ers, two convolutional layers, and one fully-connected layer. The two pooling layers are implemented using MUX trees, as described in Section 7.2.3. As for the remaining two convolutional layers (referred to as layer 0 and layer 1) and one fully-connected layer (referred to as layer 2), they can be built using either APC-based neurons or MUX-based neurons with a certain bit stream length. We further investigate the influences of errors in layer 0, layer 1 and layer 2 on the overall test error of the entire DCNN, as shown in Figure 7.7, where the data values in each layer follow a normal distribution (as observed in the test benches) with various standard deviations representing the errors of the neurons in that layer. It is observed that a layer closer to the inputs has more impact on the overall accuracy of the DCNN than a layer closer to the output layer. The explanation is that inaccurate features captured near the inputs may affect all the following layers, whereas the errors occurring near the output layer can only disturb a few subsequent layers. Therefore, the intuition is that accurate neuron structures should be applied to the layers near inputs, and less accurate neurons can be used in the layers closer to the output layer to achieve better energy/power/area performance. Next, we compare the performance between APC-based neuron and MUX-based neuron using a fixed bit stream length equal to 1024 under different input sizes, as shown in Table 7.1. Clearly, APC-based neuron is more accurate but occupies more area than MUX-based neuron. Besides, as APC is much slower than MUX, the latency of APC- based neuron is larger than MUX-based neuron, which causes APC-based neuron to consume more energy than MUX-based neuron for one calculation. As for the power performance, an APC-based neuron has less switching (due to the long latency) and larger area than the MUX-based neuron, resulting in less dynamic power, more leakage power, and less overall power. 105 Figure 7.7: The impact of errors in different layers on the overall DCNN test error. The proposed structure optimization method for the overall DCNN architecture is given in Figure 7.8. As the bit stream length significantly affects the energy consumption and accuracy of the entire DCNN, the first step is to apply binary search to choose a suitable bit stream length for a DCNN configuration (i.e., neuron structure configuration in each layer). Note that the DCNN configuration used in step 1 is not important as the results will be refined in the following steps. In step 2, under the fixed bit stream length, all the promising configurations are explored, where some configurations can be ruled out, e.g., all layers using MUX-base neurons is highly inaccurate and can be ruled out. Based on the results of step 2, the configurations with desirable performance will be selected, and in the following step 3, for each configuration, we try other bit stream lengths to see if better performance can be achieved. The final configuration of the DCNN is decided based on the result of step 3, and several more iterations may be needed to further refine the result by exploring more configurations. 106 Table 7.1: Comparison between APC-Based Neuron and MUX-Based Neuron using 1024 Bit Stream APC-based neuron MUX-based neuron Ratio of APC/MUX (%) Input size 16 32 64 16 32 64 16 32 64 Absolute error 0.15 0.16 0.17 0.29 0.56 0.91 51.94 27.56 18.34 Area (m 2 ) 209.9 417.6 543.2 110.7 175.3 279.8 189.7 238.2 194.1 Power (W ) 80.7 95.9 130.5 206.5 242.9 271.2 39.1 39.5 48.1 Energy (fJ) 177.4 383.7 548.1 110.0 169.1 238.9 161.3 226.9 229.5 Yes No Step 1: find a suitable bit stream length using binary search Step 2: explore configurations using the fixed bit stream Step 3: explore other bit streams for the promising configurations Constraints satisfied? Performance satisfactory? End Figure 7.8: Structure optimization method for the entire DCNN. 7.3 Experimental Results The LeNet5 DCNN used in this experiment is built with a 7841152028803200 800 500 10 configuration. The MNIST handwritten digit image dataset [Den12] consists of 60,000 training data and 10,000 testing data with 28x28 grayscale image and 10 classes is used in the experiments, and the network is trained with 20 epochs (batch size =500). We use Synopsys Design Compiler to synthesize the DCNNs with the 45nm Nangate Open Cell Library [nan09]. Table 7.2 concludes the configurations and performance for all the explored hardware-based DCNNs (No. 115) using the proposed structure optimization method, the 8 bit conventional binary pipelined baseline (No. 16) and software-based DCNNs using CPU (No. 17) or GPU (No. 18) for comparison. Note that the power for soft- ware is estimated using Thermal Design Power (TDP), and the energy is calculated by multiplying the run time and TDP. 107 Without any loss of generality, we set the desired accuracy to be 4:5% error rate. In the first step of the proposed structure optimization method, the bit stream length is set to 1024 using binary search. In step 2, using the fixed bit stream length, all configu- rations are explored, as shown in Table 7.2 (No. 1 8). DCNNs in No. 1 5 are ruled out due to the low accuracy, and in step 3, the remaining promising DCNNs in No. 68 are explored using the decreased bit stream length 512 bits, where the results are given as DCNNs in No. 9 11. Since DCNNs No. 10 11 satisfy the accuracy constraint, we further reduce the bit stream to 256 to improve energy performance, where DCNNs in No. 12 13 provide the results. This time, only DCNNs in No. 13 (all APC-based neurons) meet the accuracy constraint ( 4:5%). Hence, the bit stream length is further reduced for DCNN in No. 13 so as to find the configuration that achieves the minimum energy while satisfying the accuracy constraint. The DCNNs that use more MUX-based neurons provide smaller footprints, which are suitable for area-constraint embedded systems, whereas the DCNNs with more APC-based neurons achieve better accuracy, energy and power, which are good for power/energy-constraint embedded systems. Given the constraint(s), the proposed structure optimization method can provide the DCNN configurations with satisfactory performance. For instance, DCNNs in No. 10; 11; 13 15 are all promising configura- tions found by the proposed method, given the accuracy constraint. Compared with the conventional 8-bit binary implementation, the presented hardware-based DCNN using SC (No. 15) achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2:86%. 108 Table 7.2: Comparison among Various Hardware-Based DCNNs and Software-Based DCNNs No. Bit Layer 0, 1, 2 Error Area Power Energy Stream (%) (mm 2 ) (W ) (J) 1 1024 MUX, MUX, MUX 21.66 6.62 3.3 4.4 2 1024 MUX, MUX, APC 11.89 7.42 1.3 6.7 3 1024 MUX, APC, MUX 16.25 8.05 1.5 6.9 4 1024 MUX, APC, APC 8.68 8.85 1.7 8.7 5 1024 APC, MUX, MUX 7.69 11.75 2.6 12.0 6 1024 APC, MUX, APC 2.49 12.56 2.7 13.8 7 1024 APC, APC, MUX 4.32 13.18 3.0 14.0 8 1024 APC, APC, APC 1.70 13.98 3.1 15.8 9 512 APC, MUX, APC 4.66 12.56 2.7 6.9 10 512 APC, APC, MUX 4.45 13.18 3.0 7.0 11 512 APC, APC, APC 1.70 13.98 3.1 7.9 12 256 APC, APC, MUX 5.20 13.18 3.0 3.5 13 256 APC, APC, APC 2.00 13.98 3.1 4.0 14 128 APC, APC, APC 2.34 13.98 3.1 2.0 15 64 APC, APC, APC 4.40 13.98 3.1 1.0 16 8 bit fixed point binary(pipelined) 1.54 769.30 470.0 2.0 17 CPU: two Intel Xeon W5580 1.54 263 130.0 198200 18 GPU: NVIDIA Tesla C2075 1.54 520 225.0 96443 7.4 Conclusion In this chapter, two hardware-based neuron structures using SC were analyzed, and the influence of inaccuracy of neurons in different layers on the overall DCNN accuracy was studied. A structure optimization method was proposed for a general DCNN archi- tecture, which jointly optimized the accuracy, area, power, and energy. Experimental results demonstrated that compared with the binary ASIC DCNNs, the area, power and energy of the hardware-based DCNN generated by the proposed structure optimization were significantly improved, whereas the accuracy performance was slightly degraded. 109 Chapter 8 Highly-Scalable Stochastic Computing Based Deep Convolutional Neural Network (SC-DCNN) Design and Optimization In the recent decade, deep learning, or deep structured learning, has emerged as a new area of machine learning research, which enables a system to automatically learn com- plex information and extract representations at multiple levels of abstraction [DY14]. Deep Convolutional Neural Network (DCNN), one of the most promising types of arti- ficial neural networks taking advantage of deep learning, has been recognized as the dominant approach for almost all recognition and detection tasks [LBH15]. Specif- ically, DCNN has achieved significant success in a wide range of machine learning applications, such as image classification [SZ14b], natural language processing [CW08], speech recognition [SMKR13], and video classification [KTS + 14]. High-performance server clusters are usually required for executing software-based DCNNs since software-based DCNN implementations involve a large amount of com- putations so as to achieve outstanding performance. However, the use of server clusters implies high power (energy) consumptions and large hardware volumes, and is there- fore inappropriate for low-power applications in personal and mobile devices, which are playing an increasingly important role in our everyday life and exhibit a notable trend 110 of being “smart”. To overcome the limitation of low-power and low-hardware foot- print implementations of DCNNs, utilizing highly-parallel or dedicated hardware has attracted much academic and industrial attention in recent years, including the works utilizing General-Purpose Graphics Processing Units (GPGPUs), Field-Programmable Gate Array (FPGAs), and Application-Specific Integrated Circuit (ASICs) to imple- ment DCNNs [LSN12, SDS15, ZLS + 15, MGAG16, ACRB16, TTYYN15, ASC + 15, SSSW08, HLC + 14, XLT + 16, CLL + 14, HLM + 16]. Despites the performance and power (energy) efficiency gains, a large margin of improvement still exists due to the inherent inefficiency in implementing DCNNs using conventional computing methods or using general-purpose computing devices [JRML15, KKY + 16]. 8.1 Introduction Novel computing paradigms need to be investigated in order to provide the ultra-low hardware footprint and therefore the highest possible energy efficiency and scalability. Stochastic Computing (SC), which represents a probability number using a bit-stream [Gai67], has the potential to implement DCNNs with significantly reduced hardware resources and achieve high power (energy) efficiency, and therefore can potentially trig- ger a revolutionary reshaping of hardware design of large-scale deep learning systems. To be more specific, in SC, key arithmetic calculations such as multiplications and addi- tions can be implemented as simple as AND gates and multiplexers (MUX), respectively [BC01b]. Considering the large number of multiplications and additions in DCNN, the efficient implementations using stochastic computing save a large design space for fur- ther improvements on the parallelism degree. Inspired by the promising characteristics, in this chapter, we propose the first com- prehensive design and optimization framework of SC-based DCNNs (SC-DCNNs), 111 using a bottom-up approach. The proposed SC-DCNN fully utilizes the advantages of SC technology, and could achieve ultra-low hardware footprint, low power and energy consumption, while maintaining high network accuracy level. Besides the SC-DCNN architecture itself, key contributions in the proposed design and optimization framework are listed as follows: Basic function blocks and hardware-oriented max pooling. We first design and investigate the function blocks that perform the basic operations, i.e., inner product, pooling, and activation functions, in DCNN. More specifically, we present a novel hardware-oriented max pooling design for effectively implement- ing (approximate) max pooling in SC domain. We thoroughly investigate the pros and cons of different types of function block implementations. Joint optimizations for feature extraction blocks. We propose the optimal designs of four types of combinations of basic function blocks, named feature extraction blocks, which are in charge of extracting features from input feature maps. The function blocks inside the feature extraction block are jointly opti- mized through both analysis and experiments with respect to input bit-stream length, function block structure, and function block compatibilities. Weight storage schemes. We present effective designs and optimizations on weight storage to reduce the corresponding area and power (energy) consump- tions, including efficient filter-aware SRAM sharing, effective weight storage methods, and layer-wise weight storage optimizations. Overall SC-DCNN optimization. We conduct thorough optimizations on the overall SC-DCNN, with feature extraction blocks carefully selected, to minimize area and power (energy) consumption while maintaining a high network accuracy 112 level. The optimization procedure leverages the important observation that hard- ware inaccuracies in different layers in DCNN have different effects on the overall network accuracy, therefore different designs may be exploited to minimize area and power (energy) consumptions. Ultra-low hardware footprint and low power (energy) consumptions. Overall, the proposed SC-DCNN achieves the lowest hardware cost and energy consump- tion in implementing LeNet5 compared with reference works. 8.2 Design and Optimization for Function Blocks and Feature Extraction Blocks in SC-DCNN In this section, we first perform comprehensive designs and optimizations in order to derive the most efficient SC-based implementations for function blocks, including inner product/convolution, pooling, and activation function, in terms of power, energy, and hardware resource, meanwhile maintaining a high accuracy level. Based on the detailed analysis of pros and cons of each type of basic function block design, we propose the optimal designs of feature extraction blocks for SC-DCNNs through both analysis and experiments. 8.2.1 Inner Product/Convolution Block Design As shown in Figure 2.3 (a), an inner product/convolution block in DCNNs is composed of multiplication and addition operations. Since in SC-DCNNs, inputs are distributed in the range of [-1, 1], we adopt the bipolar multiplication implementation (the XNOR gate) for the inner product block design. The summation of all products is performed by 113 the adder(s). In the SC domain, the addition operation has various possible implemen- tations, such as OR gate-based, multiplexer (MUX)-based, APC-based, and two-line representation-based adders. Therefore, we present and investigate four types of inner product block structures by replacing the summation unit in Figure 2.3 (a) with various adder implementations shown in Figure 2.5. Their pros and cons are carefully analyzed for the subsequent step of developing feature extraction blocks in SC-DCNNs. OR Gate-Based Inner Product Block Design. The idea of utilizing OR gate to perform addition is straightforward. For instance, 3 8 + 4 8 can be performed by “00100101 OR 11001010”, which generates “11101111” ( 7 8 ). However, the first input bit-stream can also be “10011000”, which makes the output of OR gate as “11011010” ( 5 8 ) and results in inaccuracy. The accuracy loss comes from the fact that ”logic 1 OR logic 1” only generates a single logic 1 without a carry bit. To reduce the accuracy loss, the input streams should be pre-scaled to ensure that there are only very few 1’s in the bit-streams. For the unipolar format bit-streams, the scaling can be easily performed by dividing the original number by a scaling factor. Nevertheless, in the scenario of bipolar encoding format, there are about 50% 1’s in the bit-stream when the original value is close to 0, which renders the scaling ineffective in reducing the number of 1’s in the bit-stream. Table 8.1 displays the average inaccuracies of OR gate-based inner product block with different input sizes, in which the bit-stream length is fixed at 1024 and all average inaccuracy values are obtained with the most suitable pre-scaling. The experimental results suggest that the accuracy of unipolar calculations may be acceptable, but the accuracy is too low for bipolar calculations and becomes even worse with the increase of input size. Since it is almost impossible to have only positive input values and weights, the OR gate-based inner product block is not appropriate in SC-DCNNs. MUX-Based Inner Product Block Design. According to [BC01b], ann-to-1 MUX can sum all inputs together and generate an output with a scaling down factor 1 n . Since 114 Table 8.1: Inaccuracies of OR Gate-Based Inner Product Block Input Size 8 16 24 32 Unipolar inputs 0.16 0.33 0.47 0.66 Bipolar inputs 1.04 1.39 1.54 1.70 Table 8.2: Inaccuracies of MUX-Based Inner Product Block Input size Bit stream length 512 1024 1536 2048 8 0.27 0.19 0.16 0.14 16 0.54 0.39 0.31 0.28 32 1.18 0.77 0.64 0.56 only one bit is selected among all inputs to that MUX at one time, the probability of each input to be selected is 1 n . The selection signal is controlled by a randomly generated natural number between 1 andn. Taking Figure 2.3 (a) as an example, the output of the summation unit (MUX) is 1 n (x 0 w 0 +::: +x n1 w n1 ). As displayed in Table 8.2, the average inaccuracies of the MUX-based inner product block are measured with different input sizes and bit-stream lengths. The accuracy loss of MUX-based block mainly comes from the fact that only one input is selected at one time, and all the other inputs are not used. The increasing input size causes accuracy reduction because more bits are dropped, but good enough accuracy can still be obtained by increasing the bit-stream length. Two-Line Representation-Based Inner Product Block. As mentioned above, the MUX-based adder is a down-scaled adder and the down-scaling is a main source of accuracy loss. Accordingly, reference [TQF00] proposed a two-line representation- based SC scheme that can be used to construct a non-scaled adder. Figure 2.5 (d) illus- trates the structure of a two-line representation-based adder. SinceA i , B i , andC i are bounded as the element off1; 0; 1g, a carry bit may be missed. Therefore, a three-state counter is used here to store the positive or negative carry bit. 115 3 2 2 2 1 2 1 2 A0 B0 A1 B1 A2 B2 A3 B3 A4 B4 A5 B5 A6 A7 B7 FA FA FA FA B6 0100...1011 1001...1101 1110...0100 0100...1011 0100...1011 1110...0100 0100...1011 1110...0100 0100...1011 1110...0100 0100...1011 1110...0100 0100...1011 1110...0100 0100...1011 1110...0100 Figure 8.1: 16-bit Approximate Parallel Counter. However, there are two limitations in utilizing the two-line representation-based inner product block in hardware DCNNs: (i) because an inner product block gen- erally has more than two inputs, the overflow problem often occurs in the two-line representation-based inner product calculation due to its non-scaling characteristics, which incurs significant accuracy loss, and (ii) the area overhead is too high compared with other inner product implementation methods. APC-Based Inner Product Block. The structure of an 16-bit APC is shown in Figure 8.1. A 0 A 7 and B 0 B 7 are the outputs of XNOR gates, i.e., the products of inputsx i ’s and weightsw i ’s. Suppose the number of inputs isn and the length of a bit-stream is m, then the products of x i ’s and w i ’s can be represented by a bit-matrix of sizenm. The function of the APC is to count the number of ones in one column and represent the result in the binary format, thereby the number of outputs is log 2 n. Taking a 16-bit APC as an example, the output should be 4-bit to represent a number between 0 - 16. However, please notice that the weight of the least significant bit is 2 1 rather than 2 0 to represent 16. Therefore, the output of the APC is a bit-matrix with size of log 2 nm. 116 Table 8.3: Inaccuracies of the APC-Based Compared with the Conventional Parallel Counter-Based Inner Product Blocks Input size Bit stream length 128 256 384 512 16 1.01% 0.87% 0.88% 0.84% 32 0.70% 0.61% 0.58% 0.57% 64 0.49% 0.44% 0.44% 0.42% From Table 8.3, we know that the APC (approximate parallel counter) only results in less than 1% accuracy degradation when compared with the conventional accumulative parallel counter, but it can achieve about 40% reduction of gate count [KLC15]. This observation illustrates the significant advantage for the goal of implementing an efficient inner product block in terms of power, energy, and hardware resource. 8.2.2 Pooling Block Designs Pooling (or down-sampling) operations are performed by pooling function blocks in DCNNs to significantly reduce (i) inter-layer connections and (ii) the number of param- eters and computations in the network, meanwhile maintaining the translation invariance of the extracted features [cs216]. Average pooling and max pooling are two widely used pooling strategies. Average pooling is straightforward to implement in the SC scheme, while max pooling, which exhibits higher performance in general, requires high hard- ware resources. In order to overcome this limitation, we propose and investigate a novel hardware-oriented max pooling with high performance and high compatibility to SC scheme. Details are discussed next. Average Pooling Block Design. Figure 2.3 (b) shows how the feature map is average pooled with 2 2 filters. Since average pooling is used to calculate the mean value of entries in a small matrix, the inherent down-scaling property of the MUX can be utilized. 117 Therefore, the average pooling can be performed by the structure as shown in Figure 7.6 with a very small hardware footprint. Hardware-Oriented Max Pooling Block Design. The max pooling operation has recently shown higher performance in practice when compared with the average pooling operation [cs216]. However, in the stochastic domain, we can find out the bit-stream with the maximum value among four candidates only after counting the total number of 1’s through the whole bit-streams, which will inevitably result in a long latency and notable energy consumption. To overcome this limitation, we propose a novel SC-based hardware-oriented max pooling scheme. The idea behind this design is that once a set of bit-streams are sliced into segments, the globally largest bit-stream (among the four candidates) has the high- est probability to be the locally largest one in each set of bit-stream segments. This is because all 1’s are randomly distributed in the stochastic bit-streams. Accordingly, all input bit-streams of the hardware-oriented max pooling block are sliced into segments with a fixed lengthc, e.g., 16 bits, and one segment is selected from each set (one set has four segments) to be sent to the output. To determine the selected segment in a set, all segments in a set are counted on the number of 1’s in parallel, and the maximum counted result is utilized to determine the nextc-bit segment that to be sent to the output of the pooling block. In other words, the currently selectedc-bit segment is determined by the counting results of the previous set. Please notice that thec-bit segment from the first set of bit-stream segments is randomly chosen to reduce the latency. This strategy will incur zero extra latency and will only cause a negligible accuracy loss whenc is a sufficiently small value compared with the bit-stream length. Figure 8.2 illustrates the structure of the hardware-oriented max pooling block, where the output frommax output approximately equals to the largest bit-stream. The four input bit-streams sent to the multiplexer are also connected to four counters, and the 118 A 1 4 to 1 Mux A 2 A 3 A 4 Out Comp- arator Counter Counter Counter Counter Sel Figure 8.2: The Proposed Hardware-Oriented Max Pooling. Table 8.4: Relative Result Deviation of Hardware-Oriented Max Pooling Block Com- pared with Software-Based Max Pooling Input size Bit-stream length 128 256 384 512 4 0.127 0.081 0.066 0.059 9 0.147 0.099 0.086 0.074 16 0.166 0.108 0.097 0.086 outputs of the counters are connected to a comparator to determine the largest segment. Then the output of the comparator is used to control the selection of the four-to-one MUX. Suppose in the previous set of segments, the second line is the largest, then MUX will output the second bit-stream for the currentc-bit segment. Table 8.4 shows the result deviations of the hardware-oriented max pooling design compared with the software-based max pooling implementation. The length of a bit- stream segment is 16. In general, the proposed pooling block can provide a sufficiently accurate result even with a large input size. 119 Table 8.5: The Relationship Between State Number and Relative Inaccuracy of Stanh State Number 8 10 12 14 16 18 20 Relative Inaccuracy (%) 10.06 8.27 7.43 7.36 7.51 8.07 8.55 8.2.3 Activation Function Block Designs Stanh. Reference [BC01b] proposed aK-state FSM-based design (i.e., Stanh) in the SC domain for implementing the tanh function, and also describes the relationship between Stanh and tanh asStanh(K;x) =tanh( K 2 x). When the input streamx is distributed in the range [-1, 1], i.e., K 2 x is distributed in the range [ K 2 ; K 2 ], this equation works well, and higher accuracy can be achieved with the increased state numberK. However,Stanh cannot be applied directly in our framework for three reasons. First, as shown in Figure 8.3 and Table 8.5 (with bit-stream length fixed at 8192), when the input variable of Stanh (i.e., K 2 x) is distributed in the range of [-1, 1], the inaccuracy is quite notable and is not suppressed with the increasing ofK. Second, the equation works well whenx is precisely represented. However, when the bit-stream is not imprac- tically long (less than 2 16 according to our experiments), the equation should be adjusted with a consideration of bit-stream length. Third, in the practice of implementing SC- DCNNs, we usually need to proactively down-scale the inputs since a bipolar stochastic number cannot reach beyond the range [-1, 1]. Besides, the stochastic number may be sometimes passively down-scaled by certain components, like a MUX-based adder or an average pooling block. A scaling-back process is thus imperative to obtain an accurate result. Based on the above reasons, the design of Stanh needs to be optimized together with other function blocks to achieve high accuracy for different bit-stream lengths and meanwhile provide a scaling-back function, with more details in Section 8.2.4. Btanh. Btanh is specifically designed for the APC-based adder to perform a scaled hyperbolic tangent function. Instead of using FSM, a saturated up/down counter is used here to convert the binary outputs of the APC-based adder back to a bit-stream. The 120 Figure 8.3: Output comparison of Stanh vs tanh. implementation details and the determination of state number can be found in reference [KKY + 16]. 8.2.4 Design and Optimization for Feature Extraction Blocks In this section, we propose and investigate the optimal designs of feature extraction blocks, which are in charge of extracting features from input feature maps. Based on the above analysis results, the MUX-based and APC-based inner product/convolution blocks, average pooling and hardware-oriented max pooling blocks, Stanh and Btanh blocks are selected as candidates for constructing feature extraction blocks, which are in charge of extracting features from input feature maps in SC-DCNNs (as shown in Fig- ure 8.4). Instead of simply composing the basic function blocks, a series of joint opti- mizations are performed on each type of feature extraction block to achieve the optimal performance. In the SC domain, factors like input size, bit-stream length, and the inaccu- racy introduced by the previous connected block can make a significant difference on the overall performance of a feature extraction block. Therefore, separate optimizations on each individual basic function block cannot guarantee to achieve the best performance for the entire feature extraction block. For example, the most important advantage of 121 Σ } Σ Σ Σ Pooling Activation X0 W0 X1 W1 X2 W2 X3 W3 Figure 8.4: The structure of a feature extraction block. the APC-based inner product block is high accuracy and thus the bit-stream length can be reduced; and the most important advantage of MUX-based inner product block is low hardware footprint and the accuracy can be improved by increasing the bit-stream length. Accordingly, in this work, we design feature extraction blocks with a consider- ation of fully making use of the advantages of each of the building blocks. For the convenience of following discussions, we define that MUX/APC represents the MUX-based or APC-based inner product/convolution blocks; Avg/Max represents the average or hardware-oriented max pooling blocks; Stanh/Btanh represents the cor- responding activation function blocks. For instance, MUX-Avg-Stanh means that four MUX-based inner product blocks, one average pooling block, and one Stanh activation function block are cascade-connected. MUX-Avg-Stanh. As mentioned in Section 8.2.3, when Stanh is utilized, the state number needs to be carefully selected with a comprehensive consideration of the scaling factor, bit-stream length, and accuracy requirement. Below is the empirical equation that is extracted from our comprehensive experiments to obtain the approximately optimal state numberK to achieve a high accuracy: K =f(L;N) 2 log 2 N + log 2 LN log 2 N ; (8.1) 122 _ X S0 S1 SK/5-1 SK/5 SK-2 SK-1 X _ X X X X X _ X _ X _ X _ Z=0 Z=1 X X Figure 8.5: Structure of optimized Stanh for MUX-Max-Stanh. where the nearest even number of the result calculated by the above equation is assigned toK,N is the input size,L is the bit-stream length, and empirical parameter = 33:27. MUX-Max-Stanh. The hardware-oriented max pooling block shown in Figure 8.2 in most cases generates an output that is slightly less than the maximum value. In this design of feature extraction block, the inner products are all scaled down by a factor ofn (n is the input size), and the subsequent scaling back function of Stanh will enlarge the inaccuracy, especially when the positive/negative sign of the selected maximum inner product value is changed. For instance, 505/1000 is a positive number, and 1% under- counting will lead the output of the hardware-oriented max pooling unit to be 495/1000, which is a negative number. Thereafter, the obtained output of Stanh may be -0.5, but the expected result should be 0.5. Therefore, the bit-stream has to be long enough to diminish the impact of under-counting, and the Stanh needs to be re-designed to fit the correct (expected) results. As shown in Figure 8.5, the redesigned FSM for Stanh will output zero when the current state is at the left 1/5 of the diagram, otherwise output a one. The optimal state numberK is calculated through the following empirical equation derived from experiments: K =f(L;N) 2 (log 2 N + log 2 L) log 2 N log 5 L ; (8.2) where the nearest even number of the result calculated by the above equation is assigned toK,N is the input size,L is the bit-stream length, = 37, and empirical parameter = 16:5. 123 APC-Avg-Btanh. When the APC is used to construct the inner product block, con- ventional arithmetic calculation components, such as full adders and dividers, can be utilized to perform the averaging calculation, because the output of APC-based inner product block is a binary number. Since the design of Btanh initially aims at directly connecting to the output of APC, and an average pooling block is now inserted between APC and Btanh, the original formula proposed in [KKY + 16] for calculating the optimal state number of Btanh needs to be reformulated as: K =f(N) N 2 ; (8.3) from our experiments. In this equationN is the input size, and the nearest even number to N 2 is assigned toK. APC-Max-Btanh. Although the output of APC-based inner product block is a binary number, the conventional binary comparator cannot be directly utilized to per- form max pooling. This is because the output sequence of APC-based inner product block is still a stochastic bit-stream. If the maximum binary number is selected at each time, the pooling output is always greater than the actual maximum inner product result. Instead, the proposed hardware-oriented max pooling design should be utilized here, and the counters should be replaced by accumulators for accumulating the binary numbers. Benefited from the high accuracy provided by accumulators in selecting the maximum inner product result, the original Btanh design presented in [KKY + 16] can be directly utilized without adjustment. 8.3 Weight Storage Scheme and Optimization As introduced in Section 8.2, the main computing task of an inner product block is to calculate the inner products ofx i ’s andw i ’s. x i ’s are input by customers, butw i ’s are 124 weights obtained by training using software and should be stored in the hardware-based DCNNs. Static random access memory (SRAM) is the most appropriate circuit structure for weight storage due to its high reliability, high speed, and small area. And specifically optimized SRAM placement schemes and weight storage methods are imperative for further reductions of area and power (energy) consumptions. In this section, we present optimization techniques including efficient filter-aware SRAM sharing, weight storage method, and layer-wise weight storage optimizations. 8.3.1 Efficient Filter-Aware SRAM Sharing Scheme Since all receptive fields of a feature map share one filter (a matrix of weights), all weights functionally can be separated into filter-based blocks and each block of weights are shared by all inner product/convolution blocks using the corresponding fil- ter. Inspired by this fact, we propose an efficient filter-aware SRAM sharing scheme, with structure illustrated in Figure 8.6. The scheme divides the whole SRAM into small blocks to mimic filters. Besides, all inner product blocks can also be separated into fea- ture map-based groups, and each group of inner product blocks takes charge of extract- ing one feature map. Therefore, a local SRAM block is shared by all the inner product blocks of the corresponding group, and then the weights of the corresponding filter are stored into the local SRAM block of this group. This scheme can significantly reduce the routing overhead and wire delay. 8.3.2 Weight Storage Method Except for the reduction on routing overhead, the size of SRAM blocks can also be reduced by trading off accuracy and hardware resources. The trading off is implemented by eliminating certain least significant bits of a weight value to reduce the SRAM size. 125 SRAM SRAM SRAM ... X 1 Pooling X 3 X 0 X 2 Activ- ation W W W W Figure 8.6: Filter-Aware SRAM Sharing Scheme. Accordingly, we present a weight storage method for significantly reducing the SRAM size with little accuracy loss. Baseline: High Precision Weight Storage. In general, DCNN will be trained with single floating point precision. Thus on hardware, up to 64-bit SRAM is needed for storing one weight value in the fixed point format to maintain its original high precision. This scheme can provide high accuracy as there is almost no information loss of weights. However, it also brings about high hardware consumptions in that the size of SRAM and its related read/write circuits is increasing with the increasing of precision of the stored weight values. Low Precision Weight Storage Method. According to our software-level experi- ments, many least significant bits far from the decimal point only have a very limited impact on the overall network accuracy, thus the number of bits for weight representa- tion in the SRAM block can be significantly reduced. We propose a mapping equation that converts a weight in the real number format to the binary number stored in SRAM to eliminate the proper numbers of least significant bits. Suppose the weight value isx, and the number of bits to store a weight value in SRAM isw (which is defined as the precision of the represented weight value in this chapter), then the binary number to be stored for representingx is: y = Int( x+1 2 2 w ) 2 w ; (8.4) 126 where Int() means only keeping the integer part. Figure 8.8 illustrates the network error rates when the reductions of weights’ precision are performed at a single layer or all layers. The precision loss of weights at Layer0 (consisting of a convolutional layer and pooling layer) has the least impact, while the precision loss of weights at Layer2 (a fully connected layer) has the most significant impact. The reason is that Layer2 is the fully connected layer, which has the largest number of weights. On the other hand, when w is set equal to or greater than seven, the network error rates are low enough and almost not decreasing with the further increasing of precision. Therefore, our proposed weight storage method can significantly reduce the size of SRAMs and their read/write circuits through decreasing the precision. The area savings achieved by this method based on estimations from CACTI 5.3 [TMAJ08] is 10.3. 8.3.3 Layer-Wise Weight Storage Optimization As shown in Figure 8.7, the inaccuracies at different layers have different impacts on the overall accuracy of the network. Layer0 is the most sensitive to inaccuracies, and Layer2 is the least sensitive to inaccuracies. Interestingly, this is the opposite to the sensitivity to precision as shown in Figure 8.8. Combining the observations from Figure 8.7 and Figure 8.8, we propose a layer-wise weight storage scheme, which sets different weight precisions at different layers. More specifically, we set the weights at Layer2 to be a relatively low precision but higher than four, while setting the weights of the previous layers with a relatively high precision to compensate the accuracy loss, so as to maintain the overall high network accuracy. This method is effective to obtain savings in SRAM area and power (energy) consumptions because Layer2 has the most number of weights compared with the previous layers. For instance, when we set weights as 7-7-6 at the three layers of LeNet5, the network error rate is 1.65%, which has only 127 Figure 8.7: The impact of inaccuracies at each layer on the overall SC-DCNN network accuracy. Figure 8.8: The impact of precision of weights at different layers on the overall SC- DCNN network accuracy. 0.12% accuracy degradation compared with the error rate obtained on software. How- ever, 12 improvements on area and 11.9 improvements on power consumptions are achieved for the weight representations (from CACTI 5.3 estimations), comparing with the baseline without any reduction in weight representation bits. 128 Figure 8.9: Input size versus absolute inaccuracy for (a) MUX-Avg-Stanh, (b) MUX- Max-Stanh, (c) APC-Avg-Btanh, and (d) APC-Max-Btanh with different bit stream lengths. Figure 8.10: Input size versus (a) area, (b) path delay, (c) total power, and (d) total energy for four different designs of feature extraction blocks. 8.4 Overall SC-DCNN Optimizations and Results In this section, we first present optimizations of feature extraction blocks along with comparison results with respect to accuracy, area/hardware footprint, power (energy) consumption, etc. Based on the results, we perform thorough optimizations on the over- all SC-DCNN to construct LeNet5 structure, which is one of the most well-known large- scale deep DCNN structure, to minimize area and power (energy) consumption while maintaining a high network accuracy level. Comprehensive comparison results are pro- vided among SC-DCNN designs (with different target network accuracy levels) and with existing hardware platforms. The hardware performance of the SC-DCNNs regarding area, path delay, power and energy consumptions are obtained by: (i) synthesizing with the 45nm Nangate Open Cell Library [nan09] using Synopsys Design Compiler, (ii) estimating using CACTI 5.3 [TMAJ08] for the SRAM blocks. Key peripheral circuitry 129 in the SC domain, e.g., the random number generators, are developed using the design in [KLC16] and synthesized using Synopsys Design Compiler. 8.4.1 Optimization Results on Feature Extraction Blocks We present optimization results of feature extraction blocks under different structures, input sizes, and bit-stream lengths on accuracy, area/hardware footprint, power (energy) consumption, etc. Figure 8.9 illustrates the accuracy performance of four types of feature extraction blocks: MUX-Avg-Stanh, MUX-Max-Stanh, APC-Avg-Btanh, and APC-Max-Btanh. The horizontal axis represents the input size that increases logarith- mically from 16 (2 4 ) to 256 (2 8 ). The vertical axis represents the hardware inaccuracies of feature extraction blocks. Three bit-stream lengths are tested and their impacts are shown in the figure. Figure 8.10 illustrates the comparisons among four feature extrac- tion blocks with respect to area, path delay, power, and energy consumptions, and the horizontal axis represents the input size that increases logarithmically from 16 (2 4 ) to 256 (2 8 ). The bit-stream length is fixed at 1024. MUX-Avg-Stanh. From Figure 8.9-(a), we know that it has the worst accuracy performance among the four structures. Because MUX-based adder as mentioned in Section 8.2 is a down-scaling adder and incurs inaccuracy because of information loss. Besides, average pooling is performed with MUXes, thus the inner products are further down-scaled and more inaccuracies are incurred. As a result, this structure of feature extraction block is only appropriate for dealing with receptive fields with a small size. On the other hand, it also possesses advantages in that it is the most area and energy efficient design with the smallest path delay. Hence, it is appropriate for scenarios with tight limitations on area and delay. MUX-Max-Stanh. Figure 8.9-(b) shows that it has a better performance in terms of accuracy when compared with the MUX-Avg-Stanh. The reason is that the mean of 130 four numbers is generally closer to zero than the maximum value of the four numbers. As mentioned in Section 8.2, minor inaccuracies on the stochastic numbers near zero can cause significant inaccuracies on the outputs of feature extraction blocks. Thus the structures with hardware-oriented pooling are more resilient than the structures with average pooling. In addition, the accuracy can be significantly improved by increasing the bit-stream length, thus this structure can be applied for dealing with the receptive fields with both small and large sizes. With respect to area, path delay, and energy, its performance is just second to the MUX-Avg-Stanh and close enough. Despite its relatively high power consumption, the power can be remarkably reduced by trading-off with the path delay. APC-Avg-Btanh. Figure 8.9-(c) and 8.9-(d) illustrate the hardware inaccuracies of APC-based feature extraction blocks. The results imply that they significantly outper- form the MUX-based feature extraction blocks in terms of accuracy, since the APC- based inner product blocks maintain most information of inner products and thus gener- ate results with high accuracy, which is the drawback of the MUX-based inner product blocks. On the other hand, APC-based feature extraction blocks consume more hard- ware resources and result in much longer path delays as well as energy consumptions. The long path delay is also the reason that their power consumptions are lower than MUX-based designs. Therefore, the APC-Avg-Btanh is appropriate for DCNN imple- mentations that have a tight specification on the accuracy performance and have a rela- tive loose hardware resource constraint. APC-Max-Btanh. Figure 8.9-(d) indicates that this feature extraction block design has the best accuracy performance since: First, it is an APC-based design. Second, the average pooling in the APC-Avg-Btanh causes more information loss than the proposed hardware-oriented max pooling. To be more specific, the fractional part of the number after average pooling is dropped, e.g., the mean of (2, 3, 4, 5) is 3.5, but it will be 131 Table 8.6: Comparison among Various SC-DCNN Designs Implementing LeNet 5 No. Pooling Bit Configuration Performance Stream Layer 0 Layer 1 Layer 2 Inaccuracy (%) Area (mm 2 ) Power (W ) Delay (ns) Energy (J) 1 Max 1024 MUX MUX APC 2.64 19.1 1.74 5120 8.9 2 MUX APC APC 2.23 22.9 2.13 5120 10.9 3 512 APC MUX APC 1.91 32.7 3.14 2560 8.0 4 APC APC APC 1.68 36.4 3.53 2560 9.0 5 256 APC MUX APC 2.13 32.7 3.14 1280 4.0 6 APC APC APC 1.74 36.4 3.53 1280 4.5 7 Average 1024 MUX APC APC 3.06 17.0 1.53 5120 7.8 8 APC APC APC 2.58 22.1 2.14 5120 11.0 9 512 MUX APC APC 3.16 17.0 1.53 2560 3.9 10 APC APC APC 2.65 22.1 2.14 2560 5.5 11 256 MUX APC APC 3.36 17.0 1.53 1280 2.0 12 APC APC APC 2.76 22.1 2.14 1280 2.7 Table 8.7: Comparison with Existing Hardware Platforms Platform Dataset Network Year Platform Area Power Accuracy Throughput Area Efficiency Energy Efficiency Type Type (mm 2 ) (W) (%) (Images/s) (Images/s/mm 2 ) (Images/J) SC-DCNN (No.6) MNIST CNN 2016 ASIC 36.4 3.53 98.26 781250 21439 221287 SC-DCNN (No.11) 2016 ASIC 17.0 1.53 96.64 781250 45946 510734 2Intel Xeon W5580 2009 CPU 263 156 98.46 656 2.5 4.2 Nvidia Tesla C2075 2011 GPU 520 202.5 98.46 2333 4.5 3.2 Minitaur [NL14] ANN 1 2014 FPGA N/A 1.5 92.00 4880 N/A 3253 SpiNNaker [SNG + 15] DBN 2 2015 ARM N/A 0.3 95.00 50 N/A 166.7 TrueNorth [EAM + 15, EMA + 16] SNN 3 2015 ASIC 430 0.18 99.42 1000 2.3 9259 DaDianNao [CLL + 14] ImageNet CNN 2014 ASIC 67.7 15.97 N/A 147938 2185 9263 EIE-64PE [HLM + 16] CNN layer 2016 ASIC 40.8 0.59 N/A 81967 2009 138927 represented as 3 in binary format, thus some information is lost during the average pooling. Generally, the increase of input size will incur significant inaccuracies except for APC-Max-Btanh. The reason that APC-Max-Btanh performs better with more inputs is: more inputs will make the four inner products sent to the pooling function block more distinct from one another, i.e., more inputs result in higher accuracy in selecting the maximum value. The drawbacks of APC-Max-Btanh are also distinct. It has the highest area and energy consumptions, and its path delay is just second to and very close to the APC-Avg-Btanh. Besides, its power consumption is just second to and close to the MUX-Max-Stanh. Accordingly, this design is appropriate for the applications that have a very tight requirement on the accuracy performance. 132 8.4.2 Overall Optimizations and Results on SC-DCNNs Based on the results on feature extraction blocks, we perform thorough optimizations on the overall SC-DCNN to construct the LeNet 5 DCNN structure, to minimize area and power (energy) consupmtion while maintaining a high network accuracy. The four types of feature extraction blocks, the basic function blocks, and the weight storage schemes are carefully compared and selected in the procedure. The (max pooling-based or aver- age pooling-based) LeNet 5 is a widely-used DCNN structure [LeC15] with a config- uration of 784-11520-2880-3200-800-500-10. The SC-DCNNs are evaluated with the MNIST handwritten digit image dataset [Den12], which consists of 60,000 training data and 10,000 testing data. The baseline error rates of the max pooling-based and average pooling-based LeNet5 DCNNs using software implementations are 1.53% and 2.24%, respectively. In the optimization procedure, we set a threshold on the error rate difference as 1.5%, i.e., the network accuracy degradation of the SC-DCNNs cannot exceed 1.5% compared with the error rates when tested using software. We set the maximum bit-stream length as 1024 to avoid over-long delays. In the optimization procedure, for the configurations that achieve the target network accuracy, the bit-stream length is reduced by half in order to reduce energy consumptions. Configurations are removed if they fail to meet the network accuracy goal. The process is iterated until no configuration is left. Table 8.6 displays some selected typical configurations and their comparison results (including the consumptions of SRAMs and random number generators). Configura- tions No.1-6 are max pooling-based SC-DCNNs, and No.7-12 are average pooling- based SC-DCNNs. It can be observed that the configurations involving more MUX- based feature extraction blocks achieve less hardware footprint, and those involving more APC-based feature extraction blocks achieve higher network accuracy. For the 133 max pooling-based configurations, No.1 is the most area efficient as well as power effi- cient configuration, and No.5 is the most energy efficient configuration. With regard to the average pooling-based configurations, No.7, 9, 11 are the most area efficient and power efficient configurations, and No.11 is the most energy efficient configuration. Table 8.7 displays the comparison of our proposed SC-DCNNs with software imple- mentations using Intel Xeon Dual-Core W5580 or Nvidia Tesla C2075 GPU as well as existing hardware platforms. For example, EIE [HLM + 16]’s performance was eval- uated on a fully connected layer of AlexNet [KSH12]; the state-of-the-art platform DaDianNao [CLL + 14] proposed an ASIC “node” that could be connected in parallel to implement a large-scale DCNN; and other hardware platforms implement different types of hardware neural networks such as spiking neural network or deep-belief net- work. Configurations No.6 and No.11 are selected for comparison, since No.6 is the most accurate max pooling-based configuration and No.11 is the most energy efficient average pooling-based configuration. When comparing with software implementation on CPU server or GPU, the pro- posed SC-DCNNs are much more area efficient, with improvements up to 30.6 by comparing SC-DCNN (No.11) with Nvidia Tesla C2075. Besides, our proposed SC- DCNNs also have outstanding performance in terms of throughput, area efficiency, and energy efficiency. The proposed SC-DCNN (No.11) achieves 15625 throughput improvements and 159604 energy efficiency improvements, comparing with Nvidia Tesla C2075. Regarding the reference hardware platforms, although not directly com- parable to some extent due to the difference in neural network types and structures, the proposed SC-DCNN achieves the lowest hardware footprint, the highest throughput, 1 ANN: Artificial Neural Network; 2 DBN: Deep Belief Network; 3 SNN: Spiking Neural Network 134 and the highest energy efficiency. Despite the fact that the power performance of SC- DCNNs is not the best, it is still comparable with other ASIC platforms like DaDianDao, EIE and Minitaur. 8.5 Conclusion In this chapter, a comprehensive SC-DCNN architecture is explored to achieve high power (energy) efficiency and low hardware footprint. First, various function blocks involving inner product calculations, pooling operations, and activation functions are investigated. Then four types of feature extraction blocks, which are constructed with the carefully selected function blocks, are proposed and jointly optimized to achieve the optimal accuracy. And three weight storage optimization schemes are investigated for reducing the area and power (energy) consumptions of SRAM. Experimental results demonstrate that our proposed SC-DCNN achieves ultra-low hardware footprint and low energy consumptions. It achieves the throughput of 781250 images/s, area efficiency of 45946 images/s/mm2, and energy efficiency of 510734 images/J. 135 Chapter 9 Hardware-Driven Nonlinear Activation for Stochastic Computing Based Deep Convolutional Neural Networks Deep learning has achieved unprecedented success in solving problems that have resisted the best attempts of the artificial intelligence community for many years [Ben09, LBH15]. Artificial neural networks with deep learning can automatically extract a hierarchy of representations by composing nonlinear modules, which trans- form the representations at one level (starting from raw input, e.g., pixel values of an image) into representations at a more abstract and more invariant level [LBH15]. Arbi- trarily complex functions can be approximated in a sufficiently large neural network using nonlinear activation functions [CS10]. However, in practice the sizes of networks are finite, and the choice of nonlinearity affects both the learning dynamics and the network’s expressive power [AHSB14]. Recently, the Deep Convolutional Neural Network (DCNN) has achieved break- throughs in many fundamental applications, such as image/video classification [SZ14b, KTS + 14] and visual/text recognition [DJV + 14, LLQ16]. DCNN is now recognized as the dominant method for almost all recognition and detection tasks and surpasses human performance on certain tasks [LBH15]. Since the deep layered structures require a large amount of computation resources, from a practical standpoint, large-scale DCNNs are mainly implemented in high performance server clusters [ALPO + 15]. The huge 136 power/energy consumptions of software DCNNs prevent their widespread deployment in wearable and Internet of Things (IoT) devices, which emerge with repercussions across the industry spectrum [KKY + 16, MHC + 18]. Hence, there is a timely need to map the latest DCNNs to application-specific hardware, in order to achieve orders of magnitude improvement in performance, energy efficiency and compactness. Many existing works have explored hardware implementations of DCNNs using GPUs [KSH12] and FPGAs [RLC16, MGAG16]. Nevertheless, more efficient design is required to resolve the conflict between the resource-hungry DCNNs and the resource- constrained IoT entities. Stochastic Computing (SC), which uses the probability of 1s in a random bit stream to represent a number, has the potential to enable mas- sively parallel and ultra-low footprint hardware based DCNNs [LRL + 17a, BC01a, LRL + 16, KKY + 16, RLL + 17b]. In SC, arithmetic operations like multiplication can be performed using simple logic elements and SC provides better soft error resiliency [LD16b, AH13, BC01a, LD16a]. In this regard, considerable efforts have been invested in the context of designing Artificial Neural Networks (ANNs), Deep Belief Network (DBNs) and DCNNs using SC components [KKY + 16, JRML15, SGMA15, LRL + 17a, LRL + 16, RLL + 17b, LRL + 17b]. One key challenge is designing accurate nonlinear activation function. A small imprecision induced by the convolution and down sampling operations can be sig- nificantly amplified by the inaccurate nonlinear activation function, propagated to the subsequent neurons, and further amplified by the activation functions in the follow- ing neurons. Hence, without an accurate activation, the network accuracy can easily decrease to an unacceptable level. Besides, while the type of activation has a sig- nificant impact on the performance of DCNNs, only two basic Finite State Machine (FSM) based hyperbolic tangent (tanh) activations are designed for ANNs and DBNs in 137 [BC01a, KKY + 16], whilst other possible activations have hardly been explored, espe- cially the Rectified Linear Units (ReLUs), which are essential for the state-of-the-art neural networks [HZRS15]. 9.1 Introduction In this chapter, we first propose three accurate SC-based neuron designs for DCNNs with the popular activation functions that are widely used in the software, i.e., tanh, logistic (or sigmoid), and ReLU. The SC configuration parameters are jointly optimized considering the down sampling operation, block size, and connection patterns in order to yield the maximum precision. Then we conduct a comprehensive comparison of the aforementioned neuron designs using different activations under different input sizes and stochastic bit stream lengths. After constructing the DCNNs using the proposed neurons, we further evaluate and compare the network performance of DCNNs. Experi- mental results demonstrate that proposed SC based DCNNs have much smaller area and power consumption than the binary ASIC DCNNs, with up to 61X, 151X, 2X improve- ment in terms of area, power, and energy, respectively, at the cost of 0:0001 0:03 accuracy degradation. Moreover, the SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the test error is increased by less than 3.07%. ReLU activation is suggested for future SC based DCNNs considering its superior accuracy, area, and energy performance under a small bit stream length. 138 9.2 Related Work 9.2.1 Activation Function Studies The hyperbolic tangent has generally shown to provide more robust performance than logistic function in DBNs [MHN]. However, both suffer from the vanishing gradient problem, resulting in a slowed training process or convergence to a poor local minimum [MHN]. ReLUs do not have this problem, since an activated unit gives a constant gra- dient of 1. Recently, the parametric ReLU proposed by K. He et al. was reported to surpass human-level performance on the ImageNet large scale visual recognition chal- lenge [HZRS15]. 9.2.2 Hardware-Based DCNN Studies In order to exploit the parallelism and reduce the area, power, and energy, many existing hardware-based DCNNs have come into existence, including GP-GPUs based DCNNs [CMM + 11] and FPGA-based DCNNs [MGAG16, RLC16]. GP-GPU was the most commonly used platform for accelerating DBNs around 2011 [CMM + 11]. As for FPGA, M. Motamedi et al. developed an FPGA-based accelerator to meet performance and energy-efficiency constraints of DCNNs [MGAG16]. Recently, SC becomes a very attractive candidate for implementing ANNs and DBNs. Y . Ji et al. applied SC to a radial basis function ANN and significantly reduced the required hardware in [JRML15], and hardware-oriented optimization for SC based DCNN was developed in [LRL + 16]. K. Kim et al. proposed hardware-based DBN using SC components, in which a SC based neuron cell was designed and optimized [KKY + 16]. A multiplier SC neuron and a structure optimization method were proposed in [LRL + 17a] for DCNN. A. Ren et al. 139 developed a DCNN architecture with weight storage optimization and a novel max pool- ing design in the SC domain [RLL + 17b]. Besides, Z. Li et al. explored eight different neuron implementations in DCNNs using SC [LRL + 17b]. However, no existing works have explored the ReLU and logistic activations (popu- lar activation choices in software) for hardware based DCNNs using SC. 9.3 Overview of Hardware-Based DCNN 9.3.1 General Architecture of DCNNs Figure 9.1 shows a general DCNN architecture that is composed of a stack of convolu- tional layers, pooling layers, and fully connected layers. In each convolutional layer, common patterns in local regions of inputs are extracted by convolving a filter over the inputs. The convolution result is stored in a feature map as a measure of how well the filter matches each portion of the inputs. After convo- lution, a subsampling step is performed by the pooling layer to aggregate statistics of these features, reduce the dimensions of data and mitigate over-fitting issues. A non- linear activation function is applied to generate the output of the layer. The stack of convolutional and pooling layers is followed by fully connected layers, which further Figure 9.1: A general DCNN architecture. 140 Figure 9.2: A hardware-based neuron cell in DCNN. aggregate the local information learned in the convolutional and pooling layers for class discrimination. By alternating the topologies of convolutional and pooling layers, powerful DCNNs can be built for specific applications, such as LeNet [LBBH98], AlexNet [KSH12], and GoogLeNet [SLJ + 15]. With no loss of generality, we use LeNet-5 (i.e., the fifth generation of LeNet for digits recognition) in our discussion and experiments throughout the chapter, and the proposed design methodology can accommodate other DCNNs as well. 9.3.2 Hardware-Based Neuron Cell As the basic building block in DCNNs, a neuron performs three fundamental operations, i.e., convolution, pooling, and activation, as shown in Figure 9.2. The convolution operation calculates the inner products of input (x 0 i s) and weights (w 0 i s), and the pooling operation performs sub-sampling for the inner product results, i.e., generating one result out of several inner products. There are two conventional choices for pooling: max and average. The former chooses the largest element in each pooling region, whereas the latter calculates the arithmetic mean. An activation function is applied before the output, with tanhf(x) = tanh(x), logistic functionf(x) = (1 + e x ) 1 , and ReLUf(x) =max(0;x) being popular choices. 141 From a hardware design perspective, convolution and pooling can be implemented efficiently using basic building blocks (described in Section 9.4.1) to achieve accurate results. The nonlinear activation function, on the other hand, is usually implemented using Look-Up Tables (LUTs), requiring large memories to achieve adequate precision. Moreover, without careful design and optimization, the nonlinear activation function can result in serious accuracy degradation as it directly affects the final output accuracy of the neuron cell and the imprecision can be amplified by the subsequent calculations and activations. Hence, it is imperative to design and optimize the activation function to achieve sufficient accuracy. 9.4 Proposed Hardware-Driven Nonlinear Activation for DCNNs 9.4.1 Stochastic Computing for Neuron Design Stochastic computing (SC) is an encoding scheme that represents a numeric value x2 [0; 1] using a bit stream X, in which the probability of ones is x [BC01a]. For instance, the bit stream “01000” contains a single one in a five-bit stream, thus it rep- resents x = P (X = 1) = 0:2. In addition to this unipolar encoding format, another commonly used format is bipolar format, where a numeric value x2 [1; 1] is pro- cessed byP (X = 1) = x+1 2 . Using bipolar format, the previous example 0:2 could be represented by “10110”. Next, we present the fundamental operations of a neuron using SC components. Convolution. Convolution performs multiplication and addition. An XNOR gate, as shown in Figure 9.3 (a), is used for multiplication with bipolar stochastic encoding as c = 2P (C = 1) 1 = 2(P (A = 1)P (B = 1) +P (A = 0)P (B = 0)) 1 = 142 (2P (A = 1) 1)(2P (B = 1) 1) = ab. Multiplexers (MUXes) can perform bipolar SC addition [BC01a], which randomly selects one input as output. When input size is large, summation using MUXes incurs significant accuracy loss since only one input is selected at a time and all the other inputs are not used. To achieve a good trade- off between accuracy and hardware cost in terms of area, power, and energy, we adopt the Approximate Parallel Counter (APC) developed in [KLC15] for addition instead of MUXes. Pooling. In the SC domain, average pooling can be implemented efficiently with a simple MUX for stochastic inputs. However, as the outputs of APCs (i.e., inputs of pool- ing) are binary, MUX cannot be applied here. Instead, we use a binary adder to calculate the sum and remove the last 2 bits of the sum as a division by 4 operation, as illustrated in Figure 9.3 (b). Note that 4-to-1 average pooling is used in the LeNet-5, whereas max pooling is used in other DCNNs, such as AlexNet [KSH12] and GoogLeNet [SLJ + 15]. The SC-based max pooling design is developed in [RLL + 17b]. Activation. Nonlinear activation is the key operation that enhances the represen- tation capability of a neuron. There are many types of activation functions that have been explored in software DCNNs. Nevertheless, due to the design difficulty, only tanh is designed in the SC domain (e.g., an FSM-based tanh functionStanh() proposed in [BC01a], as shown in Figure 9.3 (c)), and there is a serious lack of studies on other activation functions for SC-based DCNNs. For the convenience of discussions, we use the naming conventions in Table 9.1. 9.4.2 Proposed Neuron Design and Nonlinear Activation We found that directly applying SC to DCNNs leads to severe accuracy degradation which is not acceptable in common cases. The main reason is that the calculation of each computation block (convolution, pooling, and activation) is inaccurate in the SC 143 Figure 9.3: Stochastic computing for neuron design: (a) XNOR gate for bipolar multi- plication, (b) binary adder for average pooling, and (c) FSM-based tanh for stochastic inputs. domain. The overall accuracy of a neuron is affected by number of input streams, bit stream length, hardware configurations, and most importantly the activation functions. In this chapter, we design and optimize the neuron structures as well as the activations by jointly considering all the aforementioned factors in order to reach the accuracy level that can be achieved by fixed point binary arithmetic. To be more specific, we adopt bipolar encoding scheme and average pooling. The proposed neuron designs use XNOR gates and APCs for addition and multiplication (as convolution operation), respectively. Average pooling is implemented using a binary adder (as shown in Figure 9.3 (b)). The nonlinear activations are designed for binary inputs since the output of APC and the pooling result are binary. The output of the proposed neuron, on the other hand, must be a stochastic bit stream, which will be fed into the neurons in the subsequent layers. The proposed neurons using hyperbolic tangent, stochastic logistic and ReLU activations are as follows: SC-tanh: SC-Based Hyperbolic Tangent Neuron. The authors in [KKY + 16] adopted a saturated up/down counter to implement a binary hyperbolic tangent acti- vation function. Nevertheless, this activation function is designed for DBNs and cannot 144 be applied directly for DCNNs. Algorithm 6 presents the proposed SC-based hyperbolic tangent neuron design SC-tanh() for DCNNs, where step 5-9 correspond to the inner product calculation using XNOR gates and APCs, step 11 is the average pooling with a binary adder, and step 12-18 are the tanh activation part, which is implemented with a saturated counter. The connections between two layers are based on the filters that exe- cute convolution operations only for portions of the inputs (i.e., receptive fields), which greatly reduce the connections between consecutive layers. The counted binary num- ber generated by APC is taken by the saturated up/down counter which represents the amount of increase and decrease. By setting half of the states’ output to 1 and the rest to 0 (i.e., setting boundary state toS bound = e 2 ), this neuron design accurately imitates the hyperbolic tangent function. SC-logistic: Proposed SC-Based Logistic Neuron. There are two important dif- ferences between the hyperbolic tangent neuron and the logistic neuron: (i) unlike the Table 9.1: Naming Conventions in a Stochastic Computing Based Neuron m the length of bipolar bit stream q q-to-1 average pooling n input size: the number of input bit streams (or the number of input and weight pairs) in each convolution block x j i thei-th input bit of thej-th convolution block i2 [0;n 1],j2 [0;q 1] w j i thei-th weight bit of thej-th convolution block i2 [0;n 1],j2 [0;q 1] j the set of all the input bits of thej-th convolution block j =fx j i jj2 [0;q 1];8i2 [0;n 1]g j the set of all the weight bits of thej-th convolution block j =fw j i jj2 [0;q 1];8i2 [0;n 1]g t j the sum calculated by thej-th convolution block, which is alog 2 n-bit binary number e the number of states in the state machine z k thek-th stochastic output bit for the SC-based hyperbolic tangent neuron 145 Algorithm 6: Proposed SC-tanh (; ;q;e) input :; are the input bit streams input :q indicates theq-to-1 average pooling input :e is internal FSM state number output:z k is thek-th stochastic output bit for the SC-tanh neuron 1 Smax e 1 ; / * max state * / 2 S bound e=2 ; / * boundary state * / 3 S S bound ; / * current state * / 4 fork 1 tom do / * processing each convolution block * / 5 forj 1 toq do / * inner product calculation * / 6 fori 1 ton do 7 p j i =x j i w j i ; / * XNOR multiplication * / 8 t j = 2 P n i=0 p j i n ; / * APC addition * / / * average pooling and tanh activation * / 9 S S + P q j=1 t j q 10 ifS< 0 then / * saturated counter * / 11 S 0; 12 else ifS>Smax then 13 S Smax 14 ifS>S bound then / * output logic * / 15 z k 1 16 else 17 z k 0 above-mentioned hyperbolic tangent neuron, the output of which is in the range of [- 1,1], the logistic neuron always outputs non-negative number (i.e., within [0,1]), and (ii) they-value of the logistic’s midpoint is 0:5 instead of 0 in hyperbolic tangent. In order to tackle the 1-st difference, we introduce a history shift register arrayH[0 : 1] using shift registers to record the last bits of the output stochastic bit streams. A shadow counter is used to calculate the sum of the stochastic bits in the history shift register array, which is denoted by , i.e., = P 1 i=0 H[i]. Hence, the proposed SC- logistic keeps tracking the last output bits and predicts the sign of the current value based on the sum calculated by the shadow counter. To be more specific, if the sum is less than half of the maximum achievable sum < 2 , the current value is predicted to be negative. Otherwise, it is predicted as positive (note that value 0 is half 1s and half 0s in bipolar format). As any negative output raises an error, the proposed activation 146 mitigates such errors by outputting 1 (as compensation) whenever the predicted current value is negative. As for the 2-nd difference, we need to move they-value of the logistic’s midpoint to 0:5. The probability of 1s in a stochastic bit streamX that represents the valuex = 0:5 isP (X = 0:5) = 3 4 . Hence, we design the logic of activation such that 1 4 of the states output 0s, whereas the other 3 4 portion output 1s. This is realized by setting the boundary state toS bound = e 4 , wheree represents the internal FSM state number (in the saturated counter). Note that the boundary value in hyperbolic tangent is S bound = e 2 since the midpoint ofy-value of tanh is 0 and 0 is represented by a stochastic stream where half of the bits is 0 and the other half is 1 (i.e.,P (X = 0) = 1 2 ). Algorithm 7 provides the pseudo code of the proposed SC-logistic neuron. SC-ReLU: Proposed SC-Based ReLU Neuron. According to [LBH15], the ReLU activation f(x) = max(0;x) becomes the most popular activation function in 2015. We apply the same design concept as in the SC-tanh neuron with the following modi- fications: First, we introduce a history shift register arrayH[0 : 1] and its shadow counter to predict the sign of the current value, and compensate the negative value in the same way SC-logistic neuron does. Second, unlike the SC-logistic neuron that changes boundary state to S bound = e 4 such that the entire waveform is moved up to y = 0:5, the SC-ReLU is centered at (0,0) so the boundary state needs to be kept as S bound = e 2 . Noticing the similarities between the SC-logistic and SC-ReLU, we intro- duce a configuration bit and combine these two neurons, and Algorithm 7 provides the pseudo code of the proposed SC-logistic and SC-ReLU neuron. Note that in the proposed neurons, the FSM state number e is generated using a simple binary search algorithm under different input sizes to yield the highest precision. Unlike the software implementation that has limited allowable degree of parallelism and high coordination overheads, the proposed specific hardware based neurons can execute 147 Algorithm 7: Proposed SC-logistic/ReLU (; ;;;q;e) input :; are the input bit streams input : is the number of registers in the temporary array input : is the configuration bit. 1:SC-logistic, 0:SC-ReLU input :q indicates theq-to-1 average pooling input :e is internal FSM state number output:z k is thek-th stochastic output bit for the SC-logistic/ReLU neuron 1 Smax e 1 ; / * max state * / 2 if == 1 then / * boundary state configuration * / 3 S bound e=4 ; / * SC-logistic * / 4 else 5 S bound e=2 ; / * SC-ReLU * / 6 S S bound ; / * current state * / 7 r 1 ; / * r is an iterator * / 8 H[0 : 1] 0 ; / * initialize history array * / 9 0 ; / * initialize shadow counter * / 10 fork 1 tom do 11 if< 2 then 12 z k 1 ; / * negative value compensation * / 13 else 14 forj 1 toq do 15 fori 1 ton do 16 p j i =x j i w j i ; / * XNOR multiplication * / 17 t j = 2 P n i=0 p j i n ; / * APC addition * / 18 S S + P q j=1 t j q 19 ifS< 0 then / * saturated counter * / 20 S 0 21 else if S>Smax then 22 S Smax 23 ifS>S bound then / * output logic * / 24 z k 1 25 else 26 z k 0 27 whiler 1 do 28 H[r] H[r 1]; / * update the history array * / 29 r r 1 30 H[0] z k 31 P 1 r=0 H[r]; / * update the shadow counter * / 32 r 1 the commands in Algorithm 6 and 7 fully parallelly. Figure 9.4 (a), (b), and (c) show that the proposed SC-tanh, SC-logistic, and SC-ReLU neurons and their corresponding software results are almost identical to each other. Detailed error analysis is provided in Section 9.5. 148 9.5 Experimental Results We now present (i) performance evaluation of the proposed SC neurons, (ii) comparison with binary ASIC neurons, and (iii) DCNN performance evaluation and comparison. The neurons and DCNNs are synthesized in Synopsys Design Compiler with the 45nm Nangate Library [nan09] using Verilog. 9.5.1 Performance Evaluation and Comparison among the Pro- posed Neuron Designs For each neuron, the accuracy is dependent on (i) the bit stream lengthm, and (ii) the input size n. Longer bit stream length yields higher accuracy, and the precision can be leveraged by adjusting the bit stream length without hardware modification. On the other hand, the input sizen, which is determined by the DCNN topology, affects both the accuracy of the neuron as well as the hardware footprint. Figure 9.4: The result comparison between the proposed SC neuron (bit stream m = 1024) and the corresponding original software neuron: (a) SC-tanh vs Tanh, (b) SC- logistic vs Logistic, and (c) SC-ReLU vs ReLU. 149 Figure 9.5: Input size versus absolute inaccuracy under different bit stream lengths for (a) SC-tanh neuron, (b) SC-logistic neuron, and (c) SC-ReLU neuron. Thus, considering the aforementioned factors, we evaluate the inaccuracy of the proposed SC-tanh, SC-logistic and SC-ReLU neurons under a wide range of bit stream lengths and input sizes, as shown in Figure 9.5 (a), (b) and (c), respectively. The cor- responding hardware costs of the proposed SC-tanh, SC-logistic and SC-ReLU neurons are shown in Figure 9.6 (a), (b) and (c), respectively. One can observe that the pre- cisions of SC-logistic and SC-ReLU neurons consistently outperform SC-tanh neurons under different combinations of input size and bit stream length, whereas the area, power and energy of SC-tanh neurons are slightly lower than the other two neurons. Moreover, SC-logistic and SC-ReLU neurons have better scalability in terms of accuracy than the SC-tanh neuron. Note that the inaccuracy here is calculated by comparing with the soft- ware neuron results. A lower inaccuracy only indicates that the hardware neuron with this type of activation is closer to its software version, but this does not indicate that a DCNN implemented with this neuron can yield a lower test error. 9.5.2 Comparison with Binary ASIC Neurons We further compare the proposed SC-based neurons with the binary ASIC hardware neurons. The input size is set to 25 since most neurons in LeNet-5 DCNN are con- nected to 5 5 receptive fields. The binary nonlinear activations logistic and ReLU are implemented using LUTs, whereas binary ReLU is built with a comparator and a MUX. 150 Figure 9.6: Input size versus (a) area, (b) total power, and (c) total energy for the neuron designs using tanh, logistic and ReLU activation functions. Clearly, the number of bits in fixed-point numbers affects both the hardware cost and the accuracy. To make the comparison fair, we use the minimum fixed point (8 bit) that yields a DCNN network accuracy that is almost identical to the software DCNN (with < 0:0003 difference in network test error). Table 9.2 shows the neuron cell performance comparison between the proposed SC neurons and the 8 bit binary ASIC neurons. Com- pared with binary ASIC neurons, the proposed SC neurons achieve up to 201X, 149X, and 61X improvement in terms of power, energy, and area, respectively, indicating sig- nificant hardware savings. 9.5.3 DCNN Performance Evaluation and Comparison To evaluate the network performance, we construct the LeNet-5 DCNNs using the pro- posed SC neurons as well as the 8 bit binary neurons in a pipelined manner. LeNet 5 Table 9.2: Neuron Cell Performance Comparison with 8 Bit Fixed Point Binary Imple- mentation whenn = 25 andm = 1024 8 bit binary ASIC proposed SC neuron improvement tanh logistic ReLU tanh logistic ReLU tanh logistic ReLU power 34760 34760 32176 173 223 231 201X 156X 139X (W ) energy 128267 128267 94346 858 1130 1147 149X 114X 82X (fJ) area 42319 42319 35506 699 916 916 61X 46X 39X (m 2 ) 151 Table 9.3: Comparison among Software DCNN, Binary ASIC DCNN, and Various SC Based DCNN Designs Implementing LeNet 5 activation approach bit stream valid. error test error area (mm2) power (W ) energy (J) tanh SC 1024 1.74% 1.41% 12.5 3.1 15.8 512 1.57% 1.65% 7.9 256 1.71% 1.61% 3.9 128 1.84% 2.13% 2.0 64 2.37% 2.34% 1.0 binary - 1.42% 1.34% 769.3 470.0 2.0 CPU - 1.41% 1.34% 263 130.0 198200 GPU - 1.41% 1.34% 520 225.0 96443 logistic SC 1024 3.98% 4.49% 15.8 3.9 20.1 512 4.35% 4.70% 10.0 256 4.24% 4.34% 5.0 128 5.23% 5.58% 2.5 64 6.30% 6.06% 1.3 binary - 2.88% 3.01% 769.3 585.7 2.4 CPU - 2.87% 2.99% 263 130.0 198200 GPU - 2.87% 2.99% 520 225.0 96443 ReLU SC 1024 1.69% 1.69% 15.8 3.9 20.3 512 1.67% 1.69% 10.1 256 1.67% 1.63% 5.1 128 1.65% 1.67% 2.5 64 1.67% 1.63% 1.3 binary - 1.65% 1.65% 664.9 557.5 1.8 CPU - 1.64% 1.64% 263 130.0 198200 GPU - 1.64% 1.64% 520 225.0 96443 is a widely-used DCNN structure with a configuration of 784-11520-2880-3200-800- 500-10. The DCNNs are evaluated with the MNIST handwritten digit image dataset [Den12], which consists of 60,000 training data and 10,000 testing data. We apply the same training time in software so as to make a fair comparison among different activa- tions. Table 9.3 concludes the performance of DCNNs using CPU, GPU, binary neurons and the proposed SC neurons. The CPU approach uses two Intel Xeon W5580, whereas the GPU approach utilizes NVIDIA Tesla C2075. Note that the power for software is estimated using Thermal Design Power (TDP), and the energy is calculated by multiply- ing the run time and TDP. On the other hand, the power, energy, and area for hardware are calculated using the synthesized netlists with Synopsys Design Compiler. One can 152 observe that for each type of activation, the proposed SC based DCNNs have much smaller area and power consumption than the corresponding binary DCNNs, with up to 61X, 151X, 2X improvement in terms of area, power, and energy, respectively. Note that though the the binary ASIC has a competitive energy performance, it is an ideal pipelined structure. The extremely large area (> 600mm 2 ) and power (> 400W ) makes binary ASIC unpractical for implementation. Moreover, Table 9.3 shows that the pro- posed SC approach achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198200X and 96443X of the energy, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. Among different activations, with a long bit stream (m> 128), SC-tanh is the most accurate. Otherwise (m 128), SC-ReLU has the highest precision. SC-logistic has the lowest precision due to the following reasons: (i) logistic activation in software has the worst accuracy performance, and (ii) the imprecision of activation (if larger than a certain threshold) amplifies the inaccuracy in DCNNs. The bit stream length can be reduced to improve energy performance. One important observation is that the proposed SC-ReLU has better scalability than SC-tanh (i.e., with bit stream length decreasing, the accuracy degradation of SC-ReLU is slower than SC-tanh). Hence, ReLU activation is suggested for future SC based DCNNs considering its superior accuracy, area, and energy performance under a small bit stream length (e.g., m = 64). Note that the small bit stream length leads to significant improvement in terms of delay and energy performance. 9.6 Conclusion In this chapter we presented three novel SC neurons designs using tanh, logistic, and ReLU nonlinear activations. LeNet-5 DCNNs were constructed using the proposed 153 neurons. Experimental results on the MNIST dataset demonstrated that compared to the binary ASIC DCNN, the proposed SC based DCNNs were able to significantly reduce the area, power and energy footprint with a small accuracy degradation. ReLU was suggested for future SC based DCNNs implementations. 154 Chapter 10 Softmax Regression Design for Stochastic Computing Based Deep Convolutional Neural Networks Nowadays, Deep Convolutional Neural Network (DCNN) is the dominant approach for classification and detection tasks for images, video, speech as well as audio [LBH15]. DCNNs implement the backpropagation algorithm, which points out the parameters that should be updated. These parameters are used to compute the representation in each layer from the output of the previous layer. Clearly, the huge amount of computation power of DCNNs prevents their widespread applications in wearable and Internet of Things (IoT) devices [KKY + 16, LRL + 17a, XLNB17]. Compared to the studies using conventional binary arithmetic computing, Stochastic Computing (SC) is a fascinating solution to the above issues due to its superior perfor- mance in terms of area and power consumption as well as high tolerance to soft errors [LD16a, BC01a, LD16b]. SC represents a number by the probability of 1s in a ran- dom bit stream. Many complex arithmetic operations can be implemented with very simple hardware logic in the SC framework, which alleviates the extensive computation complexity [BC01a, LRL + 16]. On this account, a mass of research efforts have been put into designing neural networks using SC [KKY + 16, BC01a, LRL + 16, RLL + 17a]. Both of the recent designs [KKY + 16, RLL + 17a] successfully implement the SC-based neuron cells and the layerwise structure of neural networks. Nevertheless, there is no 155 existing design flow for Softmax Regression (SR) function after the fully-connected layer for DCNNs. SR is the generalization of logistic regression function when multiple categories need to be classified. It is one of the most significant part in deep learning networks due to the fact that it directly affects the final result. 10.1 Introduction In this chapter, we first propose a SC-based Softmax Regression function block. The design parameters are optimized in order to achieve best performance for accuracy. After that we conduct a comprehensive comparison between binary ASIC SR function and SC-based SR function under different input sizes and stochastic bit stream lengths. Moreover, we further construct and investigate the network performances between the conventional binary SR design and the proposed SC-based SR design in a practical DCNN. 10.2 DCNN Architecture and Softmax Regression Function 10.2.1 Deep Convolutional Neural Network A general DCNN architecture consists of a stack of convolutional layers, pooling layers, and fully connected layers. A convolutional layer is associated with a set of learnable filters (or kernels), and common patterns in local regions of inputs are extracted by con- volving this kind of filter over the inputs [LRL + 17b]. A feature map is built to store the convolution result. After that, a subsampling step is applied to aggregate statistics of these features in the pooling layer for the sake of reducing the dimensions of data and 156 alleviating over-fitting issues. In addition, a nonlinear activation function is applied here to generate the output of the layer [LYL + 17a]. After several convolutional and pooling layers, the high-level reasoning fully connected layer is applied in order to further aggre- gate the local information learned in previous layers. After that, a Softmax Regression function should be applied for classification. 10.2.2 Stochastic Computing (SC) Stochastic computing is a technology that represents a numeric value x 2 [0; 1] by counting the number of 1s in a bit stream, e.g., the value of a 4-bit sequence “0100” is x =P (X = 1) = 0:25. In addition to this unipolar format, another widely used format is bipolar format. In this coding scheme, a valuex2 [1; 1] is processed byP (X = 1) = x+1 2 . With SC, addition, multiplication, and division can be implemented using significantly smaller circuits, compared to the conventional binary arithmetic [BC01a] as shown in Figure 1 (a), (b) and (c). To be more specific, multiplications are executed using XNOR gates in bipolar for- mat. The stochastic number of C is calculated as c = 2P (C) 1 = 4P (A)P (B) 2P (A) 2P (B) + 1 = (2P (A) 1)(2P (B) 1) = ab. Multiplexers (MUXes) are used to processed addition in SC [BC01a]. In order to achieve a better accuracy with little deficit in terms of power, area and energy, we adopt the Approximate Parallel Counter (APC) proposed in [KLC15] Division can only be accomplished in an approximate form in the stochastic number representation schemes [BC01a]. Given input X and Y , outputQ = Y X is represented as Q =(XQY ) (10.1) 157 where is a positive parameter which controls the rate change for the counter state. A SC-based unipolar division circuit is implemented by adopting the gradient descent technique with a saturated counter as an integrator. Division is implemented by incre- menting the counter when Y is 1 and decrementing the counter when both X and Q are 1s. 10.2.3 Softmax Regression (SR) Function Softmax Regression (SR) is a generalization of logistic regression for the sake of clas- sifying multiple mutually exclusive classes. SR is placed after fully connected layer in order to assign probabilities to an object being one of several different things. SR is composed of two parts, i.e., summation and softmax. Summation is used to add up the Figure 10.1: Stochastic computing for neuron design: (a) XNOR gate for bipolar multi- plication, (b) binary adder, and (c) unipolar division. 158 pixel intensities. It is quite similar to normal neuron cell operation except the activation function. Given inputx, outputZ in classi is calculated as Z i = X j W i;j x j +b i (10.2) whereW i;j is the weight andb i stands for extra parameter called bias of classi. These parameters are adjusted during the backpropagation process. The subsequent softmax step acts like an activation function which changes the linear function into different nonlinear shapes. In this scenario, the summation result is shaping into a probability distribution function over different classes. Given inputx, outputP for classi is defined as P i = exp(x i ) P j exp(x j ) (10.3) The exponential function means little increase in input x i will result in dramatically growth in resultexp(x i ) and indeed increase the probability in classi. This enables SR to distinguish among different categories and select the most similar result. 10.3 SC-Softmax Regression Design 10.3.1 Overall Structure The proposed structure of SC-SR is shown in Figure 10.2, which is composed of SC- exponential, SC-normalization, SC-comparator and counter blocks. Bipolar encoding scheme is employed. The proposed SC-SR design adopt XNOR gates and APCs for addition and multiplication (same as convolution), respectively. Note that the outputs of APCs are binary. In addition, SC-exponential accomplishes a binary input to unipo- lar stochastic bit stream conversion. After that, the SC-normalization step converts the 159 w1 w2 w3 wn x1 x2 x3 xn ... Σ Inner Product Exponential SC-Exponential SC- Normalization Counter Unipolar Bipolar Binary w1 w2 w3 wn x1 x2 x3 xn ... Σ Inner Product Exponential SC-Exponential w1 w2 w3 wn x1 x2 x3 xn ... Σ Inner Product Exponential SC-Exponential SC- Normalization Counter Bipolar Binary SC- Normalization Counter Bipolar Binary Unipolar Unipolar Number of rows depend on number of classes Comparator Number of inputs depend on the number of output neurons in Fully-Connected layer. For different rows, the inputs are the same but the weights are unique. Figure 10.2: Structure for SC based Softmax Regression function. unipolar output from exponential block to bipolar stochastic bit stream. For the conve- nience of discussions, we follow the naming conventions in Table I. 10.3.2 SC-exponential The author in [KKY + 16] use a saturated up/down counter to implement a binary hyper- bolic tangent function. We adopt the saturated counter idea and implant in our design. Algorithm 1 presents the proposed SC-exponential function where step 8-11 corre- spond to convolution using XNOR gates and APCs and step 12-16 are the adopted 160 Table 10.1: Naming Conventions in a SC based SR m the length of bit stream q number of classes n input size: the number of input bit streams (or the number of input and weight pairs) in each convolution block x j i thei-th input bit of thej-th convolution block i2 [0;n 1],j2 [0;q 1] w j i thei-th weight bit of thej-th convolution block i2 [0;n 1],j2 [0;q 1] t j the sum calculated by thej-th convolution block, which is alog 2 n-bit binary number Figure 10.3: Input size versus (a) total power, (b) area, and (c) total energy for the proposed SC-SR. saturated counter. The counted binary number generated by APC is taken by the sat- urated up/down counter which represents the amount of increase and decrease. Besides, we also use a history shift register H[0:-1] in order to count the last bits of output stochastic bit stream. By using the sum of the bits, which is denoted by , we can predict the next stochastic output bit. To be more specific, if the sum of last bits falls in range of [0:3, 0:4], the output is predicted as 0 because the possibility of this value being large is small. On the other hand, if it falls into [0:6, 0:7], this output is pre- dicted to be a 1. Another reason for this is that the normalization block in our proposed design is created by adopting the unipolar division method in [BC01a]. Note that the outputs of our SC-exponential block are in unipolar encoding scheme. 161 Algorithm 8: Designated Softmax Regression SC-Exponential (m;n;x j i ;w j i ;;e) input :q indicates number of classes input : is the number of registers in the temporary array input :e is internal FSM state number output:z k is thek-th stochastic output bit for the SC-exponential 1 Smax e 1 ; / * max state * / 2 S e 2 ; / * current state * / 3 r 1 ; / * r is an iterator * / 4 H[0 : 1] 0 ; / * initialize history array * / 5 2 ; / * initialize shadow counter * / 6 fork 1 tom do 7 fori 1 ton do 8 p j i =x j i w j i ; / * XNOR multiplication * / 9 t j = 2 P n i=0 p j i n ; / * APC addition * / 10 S S +t j 11 ifS< 0 then / * saturated counter * / 12 S 0 13 else if S>Smax then 14 S Smax 15 ifS< n 4 then / * output logic * / 16 if 0:4>> 0:3 then 17 z k 0 18 else ifS> S 2 then 19 z k 1 20 else 21 z k 0 22 else 23 if 0:7>> 0:6 then 24 z k 1 25 else ifS> S 2 then 26 z k 1 27 else 28 z k 0 29 whiler 1 do 30 H[r] H[r 1]; / * update the history array * / 31 r r 1 32 H[0] z k 33 P 1 r=0 H[r]; / * update the shadow counter * / 34 r 1 10.3.3 SC-normalization Since the output from previous SC-exponential block is in unipolar encoding format. In Algorithm 2, We adopt the unipolar division circuit from [BC01a]. Step5-17 correspond to the summation of all the previous outputs and determine the divisor value by using an AND gate. We create designated SC-exponential and SC-normalization for each class. Therefore, the dividend is just the output of previous SC-exponential block for 162 Table 10.2: Network Accuracy input size bit stream validation(%) test(%) software 256 - 99.03 99.10 512 - 98.96 99.04 binary 256 - 99.04 99.10 512 - 98.96 99.07 SC 256 16 59.04 59.75 32 78.26 77.23 64 98.80 98.91 256 99.02 99.08 1024 99.00 99.14 512 16 19.72 19.53 32 98.85 99.00 64 98.04 98.29 256 98.93 99.09 1024 98.78 99.06 corresponding class. We have 4 different conditions to control the amount of increase and decrease in the implanted saturated counter as shown in step19-30. Since the output is a non-negative bipolar stochastic number, a history shift register H[0:-1] is also implemented in this algorithm in order to eliminate negative value. To be more specific, if the sum is less than 2 , the current value is predicted to be negative. Since each negative value will raise an error, we mitigate such error by outputting a 1 instead of a 0. Figure 10.4: Input size versus absolute inaccuracy under different bit stream lengths for SC-SR. 163 Algorithm 9: Designated Softmax Regression SC-Normalization (m;x j ;;e) input :f is rate change for the FSM input :x j is the input bit of thej-th exponential block input :e is number of different classes output:z j k is thek-th stochastic output bit of classj for the SC-normalization 1 Algorithm 1 step 1-4 2 0 ; / * initialize shadow counter * / 3 fork 1 tom do 4 p = P e j=1 x j ; / * summation of output bits from exponential blocks * / 5 ifp> e 2 then / * nomralize summation result * / 6 h 1 7 else 8 h 0 9 ifS> e 2 then / * Divisor * / 10 X h 11 else 12 X 0 13 Y =x j ; / * Dividend * / 14 ifX == 1&&Y == 1 then / * Next state logic * / 15 S =Sp +fe 16 ifX == 0&&Y == 1 then 17 S =S +fe 18 ifX == 1&&Y == 0 then 19 S =Sp 20 ifX == 0&&Y == 0 then 21 S =S 22 ifS< 0 then / * saturated counter * / 23 S 0 24 else ifS>Smax then 25 S Smax 26 if< 2 then / * compensate for negative value * / 27 z j k 1 28 else 29 if 0:7>> 0:6 then / * output logic * / 30 z j k 1 31 else ifS> e 2 then 32 z j k 1 33 else 34 z j k 0 35 Algorithm 1 step 35-41 Table 10.3: Performance Comparison with 8 Bit Fixed Point Binary Design whenn = 800 andq = 10 SC-800bits binary 800bits improvement dynamic power(uW) 10981 3503800 319X leakage power(uW) 1078 58986 55X total power(uW) 12058 3562100 295X area(um2) 50083 3094968 62X delay(ns) 5.05 44.74 8.8X energy(pJ) 61 159368 2617X 164 10.4 Experimental Results 10.4.1 Performance analysis for SC-SR For SC-SR, the accuracy depends on the bit stream lengthm and input sizen. Hence, in order to consider the aforementioned factors, we create and analyze the inaccuracy of SC-SR using LeNet-5 [LBBH98] classification method based on the wide range of input size and bit stream length as shown in Figure 4. The corresponding hardware costs of proposed SC-SR are shown in Figure 3 (a), (b) and (c). The SRs and DCNNs are synthesized in Synopsys Design Compiler with the 45nm Nangate Library [nan09] using Verilog. Note that the inaccuracy here is calculated by comparing with the soft- ware results. It is obviously that SC-SR will be less accurate provided that the input size increases and it will be precision if the bit stream length increases. we further test the proposed design under different number of classes using AlexNet [KSH12] classifica- tion method, the classification accuracies for different input size and bit stream length are all 100 % which means there is no accuracy degradation. 10.4.2 Comparison with Binary ASIC SR We further compare the performance of the proposed SC-based Softmax Regresion block with the binary ASIC hardware SR. The input is set to 800 and the number of classes is set to 10. The binary exponential function is built using LUTs, whereas the normalization block is built using divider. Clearly, the number of bits in fixed-point numbers affect both the hardware cost and accuracy. To make the comparison fair, we adopt minimum fixed point (8 bit) that yields a DCNN network accuracy that is almost identical to the software DCNN (with< 0.0003 difference in network test error). Table 10.3 shows the performance comparison between binary SR and SC-SR. Compared with 165 binary ASIC SR, the proposed SC-SR achieves up to 295X, 62X, and 2617X improve- ment in terms of power, energy and area, respectively, indicating significant hardware savings. 10.4.3 DCNN Accuracy Evaluation To evaluate the network accuracy, we construct a LeNet-5 DCNN, which is a widely- used DCNN structure, by replacing the software Softmax function with the proposed SC-SR as well as binary SR. We evaluate two SR configurations, i.e., the LeNet-5 with configurations of 784-11520-2880-3200-800-256-10 and 784-11520-2880-3200-800- 512-10. The DCNNs are evaluated using the MNIST handwritten digit image dataset [Den12]. we apply the same amount of training time in software for these two DCNN architectures. Table 10.2 summarizes the accuracy of DCNNs using SC-SR and binary SR. One can observe that with a long input bit stream length (m>= 64), SC-SR reaches the same precision level as binary SR and software SR. As we discuss above, compared to binary SR, SC-SR has better performance in terms of power, area and energy. Hence, binary SR is suggested for future DCNNs when bit stream length is short (e.g.,m = 64), whereas SC-SR is recommended for long bit stream length DCNNs. 10.5 Conclusion In this chapter, we present a novel SC based Softmax Regresion function design. We test the proposed SC-SR under different input size and bit stream length as well as output classes. In addition, we implant the proposed design into LeNet-5 DCNN. Experimental results on the MNIST dataset demonstrate that compared to the binary SR, the proposed SC-SR under long bit stream length input were able to significantly reduce the area, power and energy footprint with nearly no accuracy degradation. 166 Chapter 11 Normalization and Dropout for Stochastic Computing-Based Deep Convolutional Neural Networks Deep Convolutional Neural Network (DCNN) has recently achieved unprecedented suc- cess in various applications, such as image recognition [KSH12], natural language pro- cessing [HLLC14], video recognition [SZ14a], and speech processing [SMKR13]. As DCNN breaks several long-time records in different popular datasets, it is recognized as the dominant approach for almost all pattern detection and classification tasks [LBH15]. With the fast advancement and widespread deployment of Internet of Things (IoTs) and wearable devices [XLNB17], implementing DCNNs in embedded and portable systems is becoming increasingly attractive. As large-scale DCNNs may use millions of neurons, the intensive computation of DCNNs inhibits their deployment from cloud clusters to local platforms. To resolve this issue, numerous hardware-based DCNNs that use General-Purpose Graphics Processing Units (GPGPUs) [JSD + 14, BBB + 11], FPGA [ZLS + 15] and ASIC [CLL + 14, CKES16] are proposed to accelerate the deep learning systems with huge power, energy and area reduction compared with software. Nevertheless, novel computing paradigms are required to make DCNNs compact enough for light-weight IoT and wearable devices with stringent power requirements. 167 11.1 Introduction The recent advancements [LRL + 17a, LYL + 17a, LRL + 16, RLL + 17a, LRL + 17b, YLL + 17] demonstrate that Stochastic Computing (SC), as a low-cost and soft error resilient alternative to conventional binary computing [JRML15, LD17, KKY + 16, LD16b], can radically simplify the hardware footprint of arithmetic units in DCNNs and has the potential to satisfy the stringent power requirements in embedded devices. Stochastic designs for fundamental operations (i.e., inner product, pooling, activa- tion, and softmax regression) of DCNNs have been proposed in [LRL + 17a, LYL + 17a, LRL + 16, LRL + 17b, YLL + 17], and medium scale DCNNs have been implemented in the SC regime [RLL + 17a]. Despite power and area efficiency achieved by the existing SC approach compared with the conventional binary approach, no prior work has investigated the two software techniques essential for large scale DCNNs: (i) Local Response Normalization (LRN) and (ii) dropout. LRN is used to form a local maxima and increase the sensory per- ception [KSH12], which is vital to state-of-the-art DCNNs, e.g., 1.2-1.4% accuracy improvement was reported in AlexNet [KSH12]. Dropout, mitigates the overfitting problem, which results in poor DCNN performance on held-out test data when it is trained with a small set of training data [HSK + 12, SHK + 14]. Without a careful design of LRN and integration of dropout in the SC domain, the current SC framework devel- oped in [KKY + 16, LRL + 17a, RLL + 17a] cannot be extended to more advanced and larger-scaled DCNNs like AlexNet [KSH12] without performance degradation. In this chapter, we integrate the LRN layer and dropout techniques into the exist- ing SC-based DCNN frameworks [LYL + 17b]. Unlike the previous studies [KKY + 16, LRL + 17a, RLL + 17a], we consider max pooling and Rectified Linear Units (ReLU) in DCNNs. This is because max pooling and ReLU are more commonly applied in the state-of-the-art DCNNs with superior performance than average pooling and hyperbolic 168 tangent activation, respectively. The basic building block in software DCNNs is Fea- ture Extraction Block (FEB), which extracts high-level features from the raw inputs or previous low-level abstractions. Accordingly, we present an SC-based FEB design with Approximate Parallel Counter (APC)-based inner product unit, hardware-oriented max pooling, and SC-based ReLU activation. Due to the inherent stochastic nature, SC- based FEB exhibits a certain amount of imprecision. Hence, the entire FEB block is carefully optimized to achieve sufficient accuracy. In addition, we propose an optimized SC-based LRN design that is composed of division, activation, square and summation units. Finally, the dropout is inserted in the software training phase and the learned weights are adjusted during the hardware implementation. The contributions of this work are threefold. First, we present the SC-based FEB design with max pooling and ReLU, which are widely used in most of the state-of- the-art DCNNs. The near-max pooling proposed in [RLL + 17a] is improved to achieve better performance. Second, we are the first to propose the stochastic LRN design for SC-based DCNNs. Third, this is the first work to integrate the dropout technique into the existing SC-based DCNN frameworks, in order to reduce the training time and mit- igate the overfitting issue. Experimental results on AlexNet with the ImageNet dataset show that the SC-based DCNN with the proposed LRN and dropout techniques achieves 3.26% top-1 accuracy improvement and 3.05% top-5 accuracy improvement compared with the SC-based DCNN without these two essential techniques, demonstrating the effectiveness of proposed normalization and dropout designs. 169 Figure 11.1: APC-based inner product. 11.2 Proposed Stochastic Computing-Based Feature Extraction Block The FEB considered in this work is composed of inner product, max pooling, and ReLU activation units, which is implemented by APC-based inner product, hardware-oriented max pooling, and SC-based ReLU activation, respectively. 11.2.1 APC-Based Inner Product Figure 11.1 illustrates the APC-based hardware inner product design, where the mul- tiplication is calculated using XNOR gates and addition is performed by an APC. We denote the number of bipolar inputs and stochastic stream length byn andm, respec- tively. Accordingly, n XNOR gates are used to generate n products of inputs (x 0 i s) and weights (w 0 i s), and then the APC accumulates the sum of 1s in each column of the products. Note that the output of APC is a binary number with more than 1-bit width. For a basic FEB design using APC for inner product, MUX for average pooling and the Btanh proposed in [KKY + 16] for activation, the accuracy, area, power, and energy performance with respect to the input size are shown in Figure 11.2 (a), (b), (c), and (d), respectively, under the fixed bit stream length 1024. 170 Figure 11.2: Using the fixed bit stream length of 1024, the number of inputs versus (a) accuracy, (b) area, (c) power and (d) energy for an FEB using APC for inner product, MUX for average pooling and the Btanh proposed in [KKY + 16] for activation. As illustrated in Figure 11.2 (a), a very slow accuracy degradation is observed as input size increases. However, the area, power, and energy of the entire FEB increases near linearly as the input size grows, as shown in Figure 11.2 (b), (c), and (d), respec- tively. The reason is as follows: With the efficient implementation ofBtanh() func- tion, the hardware of Btanh() increases logarithmically as the input increases, since the input width ofBtanh() islog 2 n. On the other hand, the number of XNOR gates and the size of the APC grow linearly as the input size increases. Hence, the inner prod- uct calculation part, i.e., XNOR array and APC, is dominant in an APC-based neuron, and the area, power, and energy of the entire APC-based neuron cell also increase at the same rate as the inner product part when the input size increases. Since the length of the stochastic bit stream is effective, we investigate the accuracy of FEBs using different stream lengths under different input sizes. As shown in Figure 11.3, longer bit stream length consistently outperforms lower bit stream length in terms 171 Figure 11.3: The length of bit stream versus accuracy under different input numbers for an FEB using APC inner product, MUX based average pooling and Btanh activation. Table 11.1: Comparison between APC-based FEB and MUX-based FEB using MUX for average pooling and tanh activation under 1024 bit stream FEB with APC Inner Product FEB with MUX Inner Product Ratio of APC/MUX (%) Input size 16 32 64 16 32 64 16 32 64 Absolute error 0.15 0.16 0.17 0.29 0.56 0.91 51.94 27.56 18.34 Area (m 2 ) 209.9 417.6 543.2 110.7 175.3 279.8 189.7 238.2 194.1 Power (W ) 80.7 95.9 130.5 206.5 242.9 271.2 39.1 39.5 48.1 Energy (fJ) 177.4 383.7 548.1 110.0 169.1 238.9 161.3 226.9 229.5 of accuracy in FEBs with different input sizes. However, designers should consider the latency and energy overhead caused by long bit streams. As mentioned in Section 2.2.3, MUX can perform addition as well. Table 11.1 provides the performance between FEBs using APC-based inner product and MUX- based inner product under a fixed bit stream length equal to 1024 with different input sizes. Clearly, APC-based inner product is more accurate, more power efficient but has more area and latency than MUX-based inner product. As large-scale DCNN contains lots of FEBs with more than 64 inputs, the large absolute error in FEBs using MUX inner product will cause significant network performance degradation. Therefore, in this work, we only consider APC for inner product calculation. 11.2.2 Pooling Design Average pooling calculates a mean of a small matrix and can be implemented efficiently by a stack of MUXes. For four bit-streams representing pixels in a 2 2 region in a 172 Figure 11.4: Pooling design in SC: (a) average pooling and (b) near-max pooling. Table 11.2: Precision of the improved max pooling for an FEB with 16-bit input size under 1024 bit stream DCNN Pooling Segment Size Absolute Error Absolute Error Reduction over [RLL + 17a] LeNet-5 [LJB + 95] 4-to-1 16 0.3126 0.110 32 0.2609 AlexNet [KSH12] 9-to-1 16 0.2143 0.095 32 0.2341 feature map, we can use a 4-to-1 MUX to calculate the mean of four bit-streams, as shown in Figure 11.4 (a). Despite of the simple hardware implementation of average pooling, we consider max pooling in this work, which is adopted in most state-of-the- art DCNNs due to its better performance in practice. The authors in [RLL + 17a] pro- posed a hardware-oriented max pooling design, where the largest bit-stream in the most recent segment is selected as the near-max output, as shown in Figure 11.4 (b). Dif- ferent from the hardware-oriented near-max pooling design in [RLL + 17a], we select the maximum among the current bit as output instead of predicting based on the most recent bits. Table 11.2 demonstrates the precision of the improved hardware-oriented near-max pooling for representative pooling units in DCNNs with up to 0:110 abso- lute error reduction compared with [RLL + 17a]. For a LeNet-5 [LJB + 95] DCNN using MNIST dataset with 1024 bit stream, the network accuracy is improved by 0.11% using the improved hardware-oriented near-max pooling compared with the DCNN using max pooling in [RLL + 17a]. 173 D Q Y A CLK Figure 11.5: Stochastic square circuit using a DFF and an XNOR gate. 11.2.3 Activation Design The ReLU activationf(x) =max(0;x) becomes the most popular activation function in state-of-the-art DCNNs [LBH15]. This necessitates the design of SC-based ReLU block in order to accommodate the SC technique in state-of-the-art large-scale DCNNs, such as AlexNet for ImageNet applications. In this work, we adopt the SC-ReLU activation developed in [LYL + 17a]. Unlike tanh that generates output in the range of [1; 1], the ReLU always outputs non-negative number within [0; 1]. Accordingly, we use a shift register array to record the latest bits of the output stochastic bit streams and a counter to calculate their sum. Hence, the SC-ReLU keeps tracking the last output bits and predicts the sign of the current value based on the sum calculated by the counter. To be more specific, if the sum is less than half of the maximum achievable sum, the current value is predicted to be negative. Otherwise, it is predicted as positive (note that value 0 is half 1s and half 0s in the bipolar format). As the output cannot be negative for ReLU, the SC-ReLU activation mitigates such errors by outputting 1 (as compensation) whenever the predicted current value is negative. 174 SC-square SC-square SC-square SC-square SC-square Adder Neuron output-1 Neuron output-2 Neuron output-3 Neuron output-4 Neuron output-5 ................. ................. Normalization activation function SC-Division Normalized bipolar stochastic output for neuron output-3 Total n adjacent neurons K, aplha, beta Neuron output-3 Divisor Divident Figure 11.6: The overall stochastic normalization design. 11.3 Proposed Normalization and Dropout for SC- based DCNNs 11.3.1 Proposed Stochastic Normalization Design The overall stochastic normalization design is shown in Figure 11.6. The structure fol- lows the Local Response Normalization (LRN) equation presented in [KSH12]. Given a i x;y denotes the neuron computation results after applying kerneli at position (x;y) and ReLU nonlinearity activation function, outputb i x;y is calculated as b i x;y = a i x;y k + P min(N1;i+n=2) j=max(0;in=2) a j x;y 2 (11.1) where the summation part of this equation counts all the adjacent neuron outputs with b i x;y and N is the total number of neurons in this layer. k, n, and are parameters that affect the overall accuracy of network. The validation set used in the AlexNet is k = 2;n = 5; = 1 and = 0:75 [KSH12]. The complex relationship in Eqn. (11.1) is decoupled into three basic operations: (i) square and summation, which calculatesk + P min(N1;i+n=2) j=max(0;in=2) a j x;y 2 , (ii) activation, that performs the () operation, and (iii) division for the final output. Accordingly, the hardware structure for stochastic normalization design is separated into three units 175 for the above-mentioned operations: (i) Square and Summation, (ii) Activation and (iii) Division. Square and Summation As show in Figure 11.5, the stochastic square circuit consists of a D Flip-Flop and an XNOR gate. Squaring a signal in stochastic process is similar to multiplying 2 signals together. As mentioned in Section 2.2.3, multiplication is performed by XNOR gate. However, squaring a signal by using an XNOR gate alone will always result in a 1 since the two input signals in this case are correlated. To avoid this, a DFF is applied in this circuit to force one of the input signals to arrive late [BC01b]. After delayed by one clock, the input signals become uncorrelated and therefore an XNOR gate can be used to do multiplication. For the summation part, we use an APC mentioned in Section 2.2.3 to add up all the elements. Note that the output of the adder is in binary format. Activation Function for Normalization FSM can be used to build stochastic approximation of activation functions. One chal- lenge is that the max value that bipolar stochastic number can reach is 1 but the denom- inator of Eqn. (11.1) can easily be greater than 1 in software normalization, e.g. k > 1 and > 1. In order to resolve the above issue, we reshape the ReLU function (men- tioned in section 11.2.3) to imitate (k +x) . To be more specific, we change the slope and intercept of SC-ReLU activation to make its shape close to (k +x) by re- configuring the hardware component of ReLU. During this process, the input range is set tox2 [0; 1] and the output is limited toy2 [0; 1]. Here sincex2 [0; 1], we set to be a constant 1. The imprecision of the activation operation can be compensated by jointly optimizing the parameters of activation unit and the following division unit, in order to make the final normalization result accurate. 176 Division Our Algorithm 1 outlines the proposed SC-Division design where steps 7-11 correspond to an AND operation between divisor input and the previous stochastic output. Only when both divisor input and previous feedback stochastic output are 1, the saturated counter is allowed to decrease. We have three different conditions to control the incre- ment and decrement of the saturated counter as shown in steps 13-21. Besides, we also use a history shift registerH[0 :1] in order to count the last bits of output stochas- tic bit stream. By using the sum of the bits, which is denoted by, we can predict the next stochastic output bit. Since the output is a non-negative bipolar stochastic number, we need to eliminate all possible negative error. To be more specific, if the sum is less than 2 , the current value is predicted to be negative. Since each negative value raises an error, we mitigate such error by outputting a 1 instead of a 0. 11.3.2 Integrating Dropout into SC-Based DCNN DCNN can automatically learn complex functions directly from raw data by extracting representations at multiple level of abstraction [LBH15]. The deep architecture, together with advanced training algorithms, has significantly enhanced the self-learning capac- ity of DCNN. However, the learning process to determine the parameters in DCNNs becomes computationally intensive due to the large number of layers and parameters. The dropout technique is proposed to randomly omit feature detectors with a certain probabilityp (usuallyp = 0:5) on each training case [KSH12]. After applying dropout, a hidden neuron cannot rely on the particular other hidden neurons presence and has to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Accordingly, complex co-adaption of neurons on the train- ing data is prevented. Applying dropout can also be viewed as training lots of different 177 Algorithm 10: Designated Division Algorithm for Normalization(m;x j ;y j ;;e) input :m indicates bit stream length input :f is rate change for the FSM input :x j is the input divisor bit of thej-th division block input :y j is the input dividend bit of thej-th division block input : is the number of registers in the temporary array input :e is number of adjacent neurons output:z k is thek-th stochastic output bit for the SC-division 1 Smax e1 ; / * max state * / 2 S e 2 ; / * current state * / 3 r 1 ; / * r is an iterator * / 4 H[0 :1] 0 ; / * initialize history array * / 5 0 ; / * initialize shadow counter * / 6 fork 1 tom do 7 ifS> e 2 then / * Divisor * / 8 X x j 9 else 10 X 0 11 Y =y j ; / * Dividend * / 12 ifX == 1 then / * Next state logic * / 13 S =Sf 14 ifX == 0&&Y == 1 then 15 S =S+2f 16 ifX == 0&&Y == 0 then 17 S =S 18 ifS< 0 then / * saturated counter * / 19 S 0 20 else ifS>Smax then 21 S Smax 22 if< 2 then / * compensate for negative value * / 23 z j k 1 24 else 25 else ifS> e 2 then 26 z j k 1 27 else 28 z j k 0 29 whiler 1 do 30 H[r] H[r1]; / * update the history array * / 31 r r1 32 H[0] z k 33 P 1 r=0 H[r]; / * update the shadow counter * / 34 r 1 networks and averaging their predictions. The trained weights need to be multiplied by p as an approximation to taking the geometric mean of the prediction distributions pro- duced by the dropout networks. The dropout technique is applied in both training and inference. The training time is shortened due to the simplified network for each input training data, whereas the performance is improved as overfitting is mitigated. Note that there is no hardware overhead for applying dropout since only learned weights need to adjusted for the SC-based DCNN. 178 11.4 Experimental Results In this section, we present (i) performance evaluation of the SC FEBs, (ii) performance evaluation of the proposed SC-LRN design, and (iii) impact of SC-LRN and dropout on the overall DCNN performance. The FEBs and DCNNs are synthesized in Synopsys Design Compiler with the 45nm Nangate Library [nan09] using Verilog. 11.4.1 Performance Evaluation of the Proposed FEB Designs For the overall FEB Designs, the accuracy depends on the bitstream length and input size. In order to see the effects of the aforementioned factors, we analyze the inaccuracy based on a wide range of input sizes and bitstream lengths for FEBs using 4-to-1 pooling (used in LeNet-5 [LJB + 95]) and 9-to-1 pooling (deployed in AlexNet [KSH12]), as shown in Table 11.3 and Table 11.4, respectively. Noted that the inaccuracy here is calculated compare to software results. One can observe from Table 11.3 and 11.4 that increasing the bit stream length tends to reduce the absolute error of FEB under a given input size. Besides, with the improved near-max pooling design, the FEB tends to be more accurate as input size increases under a fixed bit stream length. This means that for larger neurons deployed in large-scale DCNNs (i.e., FEB with larger input size), the proposed FEB will be more accurate. The corresponding hardware area, power and energy of proposed FEB (with 4-to-1 pooling) are shown in Figure 11.7 (a), (b) and (c), respectively. As for the 9-to-1 pooling FEB, the hardware area, power and energy of proposed FEB are shown in Figure 11.8 (a), (b) and (c), respectively. Despite of the reduced absolute error, the area, power, and energy for both FEBs all increase as the input size increases. Note that max pooling is placed before ReLU in the FEB with 4-to-1 pooling, whereas pooling in after ReLU 179 Table 11.3: Absolute error of FEBs with 4-to-1 pooling (commonly used in LeNet-5 [LJB + 95]) under different bit stream lengths and input sizes. Input Size Bit Stream Length 128 256 512 1024 16 0.154 0.152 0.152 0.151 32 0.115 0.112 0.110 0.112 64 0.086 0.084 0.084 0.080 128 0.071 0.066 0.064 0.066 Table 11.4: Absolute error of FEBs with 9-to-1 pooling (commonly used in AlexNet [KSH12]) under different bit stream lengths and input sizes. Input Size Bit Stream Length 128 256 512 1024 16 0.120 0.121 0.117 0.119 32 0.073 0.064 0.065 0.065 64 0.051 0.037 0.031 0.033 128 0.042 0.026 0.021 0.018 in the FEB with 9-to-1 pooling. This order of pooling and ReLU operation follows the software implementation in the original LeNet-5 [LJB + 95] and AlexNet [KSH12]. Figure 11.7: Input size versus (a) total power, (b) area, and (c) total energy for the FEB design (4-to-1 pooling). 11.4.2 Performance Evaluation of the Proposed SC-LRN Designs As for SC-LRN, several parameters may result in accuracy degradation, i.e.,n,,k and bit stream length. The set of parameters used in software cannot be directly applied here due to the imprecision induced by the stochastic components. Hence, we need to run experiments to find the best setting of these parameters in the proposed SC-LRN. We set up the first experiment between the inaccuracy of SC-LRN under different bit stream length and number of adjacent neurons, and the results are illustrated in Figure 180 Figure 11.8: Input size versus (a) total power, (b) area, and (c) total energy for the FEB design (9-to-1 pooling). 11.9 (a). It is obviously that, as n increase, there is a continuous accuracy decrement regardless of bit stream length. In addition, longer bit stream length yields higher accu- racy, and this indicates that the precision can be leveraged by adjusting the bit stream length without hardware modification. In the second experiment, we evaluate the effect of andk to the overall accuracy of SC-LRN under a wide range of2 [0; 2] andk2 [1; 12], and the results are provided in Figure 11.9 (b). Due to the fact that (k +x) is not a monotone function, we need to discuss the inaccuracy based on the range ofk and. When > 0:75, increasingk value will result in imprecision. If < 0:25, the accuracy will increase ask increase. For 0:25 0:75, the accuracy of SC-LRN is barely affected byk. Figure 11.9: Performance of the proposed LRN: (a) number of adjacent neurons versus absolute inaccuracy under different bit stream lengths and (b) different K values versus absolute inaccuracy under different. 181 Table 11.5: SC-LRN versus Binary-LRN hardware cost Type Area(um 2 ) Power(uW) Energy(fJ) Delay(ns) Inaccuracy SC-LRN 290.2 64.6 77.5 1.2 0.06 Binary-LRN 1786.7 333.5 700.4 2.1 0.04 In the third experiment, we further compare the proposed SC-based LRN with the binary ASIC hardware LRN design. The parameter values we choose for the tests are as follow: n=5, k=2 and=0.75. Clearly, the number of bits in fixed-point number affect both the hardware cost and accuracy. To make the comparison fair, we adopt minimum fixed point (8 bit) that yields a DCNN network accuracy that is almost identical to the software DCNN. Table 11.5 shows the performance comparison between binary-LRN and SC-LRN. Compared with binary ASIC LRN, the proposed SC-LRN achieves up to 5X, 9X, and 6X improvement in terms of power, energy and area, respectively, indicat- ing significant hardware savings with only a little accuracy deficit. 11.4.3 Impact of SC-LRN and Dropout on the Overall DCNN Per- formance In this section, we re-train AlexNet [KSH12] models on ImageNet challenge [DDS + 09] with four different configurations, 1) Original AlexNet (with both LRN and Dropout), 2) AlexNet without Dropout, 3) AlexNet without LRN, and 4) AlexNet without Dropout and LRN. Please note we do not pre-process the ImageNet data with data augmentation as [KSH12] suggested and we scale the input image pixel values to [0,1] from [0,255] so that the inputs fed into the network range from [-1,1] after processing. Different software-based model accuracies on the test set are achieved. Then we evaluate the SC-based inference accuracy with SC-based components (including proposed LRN and Dropout) given bit-stream length as 1024 to show the application-level degradation from trained models. 182 Table 11.6: AlexNet Accuracy results No. Configuration Model Accuracy Inference Accuracy Top-1 (%) Top-5 (%) Top-1 (%) Top-5 (%) 1 Original with LRN & Dropout[KSH12] 57.63 81.35 56.49 80.47 2 w/o Dropout 55.83 79.94 54.75 78.86 3 w/o LRN 55.97 80.53 54.94 79.73 4 w/o Dropout & LRN 54.25 78.37 53.23 77.42 As shown in Table 11.6, we observe that with the proposed SC-LRN and dropout, the hardware inference accuracy can achieve top-1 and top-5 accuracy as high as 57.63% and 81.35%, respectively. Please note that top-1 accuracy counts when the predicted label with the highest probability is exactly the same as the ground-truth while top-5 accuracy counts when the ground-truth falls in the first five predicted labels with high- est probabilities. The top-1 and top-5 accuracy degradations of hardware based designs are about 1%, which is a small degradation from the software DCNNs. Furthermore, we can see from No.2 and No.4 that the network degrades by only 0:1% 0:2% with SC-LRN; from No.3 and No.4, Dropout nearly shows no degradation compared with software trained model accuracies. Finally, the No.1 configuration with the proposed SC-LRN and Dropout achieves 3.26% top-1 accuracy improvement and 3.05% top-5 accuracy improvement compared with the No.4 configuration without these two essen- tial techniques, demonstrating the effectiveness of proposed normalization and dropout designs. 11.5 Conclusion We presented hardware implementations of normalization and dropout for SC-based DCNNs. First, FEBs in DCNNs were built with APC for inner product, an improved near-max pooling block for pooling, and SC-ReLU for activation. Then, a novel SC- based LRN design was proposed, comprising square and summation unit, activation, 183 and division units. The dropout technique was integrated in the training phase and the corresponding weights were adjusted for the hardware implementation. Experimental results on AlexNet validated the effectiveness of the proposed LRN and dropout design. 184 Chapter 12 Conclusion The central theme of this dissertation is to improve efficiency to advance resilient com- puting from two aspects: (i) fast evaluation of the error rate caused by radiation-induced soft error, and (ii) cloud resource allocation by utilizing Deep Neural Networks (DNNs) as well as design and optimization of Deep Convolutional Neural Networks (DCNNs) using Stochastic Computing (SC). The proposed works have broad impact on (i) cost effective evaluation of Soft Error Rate (SER) for combinational and sequential circuits and (ii) bringing the success of DNNs and DCNNs to power management of cloud plat- forms and emerging IoT and wearable devices. Key contributions in this thesis are listed as follows: Accelerated SER estimation for combinational circuits, achieving up to 560.2X times speedup with less than 3% difference in terms of SER results compared with the baseline algorithm. Schematic and layout co-simulation approach for Multiple Cell Upset (MCU) modeling, which effectively captures the SER contributed by the radiation- hardened storage elements. Fast and comprehensive SER evaluation framework that takes 119.23s to evaluate the largest ISCAS89 benchmark circuit with 3,000 flip-flops and 17,000 gates. DRL-Cloud for power management in data centers that achieves up to 320% energy cost efficiency improvement, compared with the state-of-the-art energy efficient algorithms. 185 SC-Based DCNN block design, that achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation. SC-DCNN with ultra-low hardware footprint and low power (energy) consump- tions: the LeNet5 implemented in SC-DCNN consumes only 17 mm 2 area and 1.53 W power, achieves throughput of 781,250 images/s, area efficiency of 45,946 images/s/mm 2 , and energy efficiency of 510,734 images/J. Non-linear activation for SC-DCNNs that achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198,200X and 96,443X of the energy for the LeNet-5 implementation, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. Softmax Regression (SR) for SC-DCNNs, that can reach the same level of accu- racy with the improvement of 295X, 62X, 2,617X in terms of power, area and energy, respectively, compared with a binary SR under long bit stream. A novel SC-based normalization design and the integration of dropout tech- nique that achieve 3.26% top-1 accuracy improvement and 3.05% top-5 accuracy improvement on AlexNet with the ImageNet dataset, compared with the SC-based DCNN without these two essential techniques. In what follows, I try to identify a few of the obstacles remaining in resilient com- puting. The first issue is Quality-of-Service (QoS) aware fine-grained online scheduling in DRL-cloud. The current DRL-cloud schedules tasks to different hours on the fly. If we want to have a fine-grained scheduling, e.g., schedule to different minutes, the DQN system needs to be divided into more stages with a cohesive training method. On the other hand, the framework needs to be aware of the priority of tasks on the fly, 186 where task migration technique can be applied. The second issue is the tapeout of the SC-DCNN. Currently, most of the SC-DCNN results are simulation results. We have a tapeout of several SC-based neurons, and the feasibility and field performance of the proposed SC-based DCNNs need larger scale of tapeout for testing and verification. 187 Reference List [ACRB16] Renzo Andri, Lukas Cavigelli, Davide Rossi, and Luca Benini. Yodann: An ultra-low power convolutional neural network accelerator based on binary weights. arXiv preprint arXiv:1606.05487, 2016. [AEGT16] JJ Allaire, Dirk Eddelbuettel, Nick Golding, and Yuan Tang. tensorflow: R Interface to TensorFlow, 2016. [AH13] Armin Alaghi and John P Hayes. Survey of stochastic computing. ACM Transactions on Embedded computing systems (TECS), 12(2s):92, 2013. [AHSB14] Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830, 2014. [ALPO + 15] Arash Ardakani, Franc ¸ois Leduc-Primeau, Naoya Onizawa, Takahiro Hanyu, and Warren J Gross. Vlsi implementation of deep neural network using integral stochastic computing. arXiv preprint arXiv:1509.08972, 2015. [ARG15] Jamal Abushnaf, Alexander Rassau, and Włodzimierz G´ ornisiewicz. Impact of dynamic energy pricing schemes on a novel multi-user home energy management system. Electric power systems research, 125:124– 132, 2015. [ASC + 15] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Datta, Gi-Joon Nam, Brian Taba, Michael Beakes, Bernard Brezzo, Jente B. Kuang, Rajit Manohar, William P. Risk, Bryan Jackson, and Dharmen- dra S. Modha. Truenorth: Design and tool flow of a 65 mw 1 mil- lion neuron programmable neurosynaptic chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(10):1537– 1557, 2015. 188 [AvSE + 14] Hussam Amrouch, Victor M van Santen, Thomas Ebi, V olker Wenzel, and J¨ org Henkel. Towards interdependencies of aging mechanisms. In 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 478–485. IEEE, 2014. [AWM + 06] Oluwole Amusan, Arthur F Witulski, Lloyd W Massengill, Bharat L Bhuva, Patrick R Fleming, Michael L Alles, Andrew L Sternberg, Jef- frey D Black, Ronald D Schrimpf, et al. Charge collection and charge sharing in a 130 nm cmos technology. Nuclear Science, IEEE Transac- tions on, 53(6):3253–3258, 2006. [BBB + 11] James Bergstra, Fr´ ed´ eric Bastien, Olivier Breuleux, Pascal Lamblin, Raz- van Pascanu, Olivier Delalleau, Guillaume Desjardins, David Warde- Farley, Ian Goodfellow, Arnaud Bergeron, et al. Theano: Deep learning on gpus with python. In NIPS 2011, BigLearning Workshop, Granada, Spain. Citeseer, 2011. [BC01a] Bradley D Brown and Howard C Card. Stochastic neural computation. i. computational elements. IEEE Transactions on computers, 50(9):891– 905, 2001. [BC01b] Bradley D Brown and Howard C Card. Stochastic neural computation. ii. soft competitive learning. IEEE Transactions on Computers, 50(9):906– 920, 2001. [BCH13] Luiz Andr´ e Barroso, Jimmy Clidaras, and Urs H¨ olzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1–154, 2013. [BD95] Steven J Bradtke and Michael O Duff. Reinforcement learning methods for continuous-time markov decision problems. In Advances in neural information processing systems, pages 393–400, 1995. [Ben09] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1–127, 2009. [CDLQ14] Shuming Chen, Yankang Du, Biwei Liu, and Junrui Qin. Calculating the soft error vulnerabilities of combinational circuits by re-considering the sensitive area. IEEE Trans Nucl Sci, 61:646–653, 2014. [CGG + 14] Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations, 1(1):5–28, 2014. 189 [CKES16] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neu- ral networks. IEEE Journal of Solid-State Circuits, 2016. [CLD + 17] Huimei Cheng, Ji Li, Jeffrey Draper, Shahin Nazarian, and Yanzhi Wang. Deadline-aware joint optimization of sleep transistor and supply voltage for finfet based embedded systems. In Proceedings of the on Great Lakes Symposium on VLSI 2017, pages 427–430. ACM, 2017. [CLL + 14] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 609–622. IEEE Computer Society, 2014. [CLN + 16] Tiansong Cui, Ji Li, Shahin Nazarian, Massoud Pedram, et al. An explo- ration of applying gate-length-biasing techniques to deeply-scaled finfets operating in multiple voltage regimes. IEEE Transactions on Emerging Topics in Computing, 2016. [CLN18] Mingxi Cheng, Ji Li, and Shahin Nazarian. Drl-cloud: Deep reinforce- ment learning-based resource provisioning and task scheduling for cloud service providers. In Design Automation Conference (ASP-DAC), 2018 23rd Asia and South Pacific. IEEE, 2018. [CLS + 16] Tiansong Cui, Ji Li, Alireza Shafaei, Shahin Nazarian, and Massoud Pedram. An efficient timing analysis model for 6t finfet sram using current-based method. In 2016 17th International Symposium on Qual- ity Electronic Design (ISQED), pages 263–268. IEEE, 2016. [CMM + 11] Dan C Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and J¨ urgen Schmidhuber. Flexible, high performance convolutional neu- ral networks for image classification. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1237, 2011. [CS10] Youngmin Cho and Lawrence K Saul. Large-margin classification in infi- nite neural networks. Neural computation, 22(10):2678–2697, 2010. [cs216] Stanford cs class, cs231n: Convolutional neural networks for visual recog- nition, 2016. [CW08] Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167. ACM, 2008. 190 [ddr14] 1Gb X4 X8 X16 DDR3 SDRAM, Micro Inc., 2014. [DDS + 09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009. [Del14] Pierre Delforge. Americas data centers consuming and wasting growing amounts of energy. Natural Resource Defence Councle, 2014. [Den12] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012. [DJV + 14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pages 647–655, 2014. [DLW + 17] Caiwen Ding, Ning Liu, Yanzhi Wang, Ji Li, Soroush Heidari, Jingtong Hu, and Yongpan Liu. Multisource indoor energy harvesting for non- volatile processors. IEEE Design & Test, 34(3):42–49, 2017. [DLZ + 17] Caiwen Ding, Ji Li, Weiwei Zheng, Naehyuck Chang, Xue Lin, and Yanzhi Wang. Algorithm accelerations for luminescent solar concentrator- enhanced reconfigurable onboard photovoltaic system. In Design Automa- tion Conference (ASP-DAC), 2017 22nd Asia and South Pacific, pages 318–323. IEEE, 2017. [DY14] Li Deng and Dong Yu. Deep learning. Signal Processing, 7:3–4, 2014. [EACC13] Adrian Evans, Dan Alexandrescu, Enrico Costenaro, and Liang Chen. Hierarchical rtl-based combinatorial ser estimation. In On-Line Test- ing Symposium (IOLTS), 2013 IEEE 19th International, pages 139–144. IEEE, 2013. [EAM + 15] Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dharmendra S Modha. Backpropagation for energy-efficient neuro- morphic computing. In Advances in Neural Information Processing Sys- tems, pages 1117–1125, 2015. [EET + 15] Mojtaba Ebrahimi, Adrian Evans, Mehdi B Tahoori, Enrico Costenaro, Dan Alexandrescu, Vikas Chandra, and Razi Seyyedi. Comprehensive analysis of sequential and combinational soft errors in an embedded pro- cessor. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 34(10):1586–1599, 2015. 191 [ele] Real-time hourly market electrical price. [EMA + 16] Steven K. Esser, Paul A. Merolla, John V . Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jef- frey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmen- dra S. Modha. Convolutional networks for fast, energy-efficient neuro- morphic computing. CoRR, abs/1603.08270, 2016. [ESCT15] Mojtaba Ebrahimi, Razi Seyyedi, Liang Chen, and Mehdi B Tahoori. Event-driven transient error propagation: A scalable and accurate soft error rate estimation approach. In Design Automation Conference (ASP- DAC), 2015 20th Asia and South Pacific, pages 743–748. IEEE, 2015. [Far10] Ahmad Faruqui. The ethics of dynamic pricing. The Electricity Journal, 23(6):13–27, 2010. [FCMG13] Veronique Ferlet-Cavrois, Lloyd W Massengill, and Pascale Gouker. Sin- gle event transients in digital cmos-a review. Nuclear Science, IEEE Transactions on, 60(3):1767–1790, 2013. [FPME07] Mahdi Fazeli, Ahmad Patooghy, Seyed Ghassem Miremadi, and Alireza Ejlali. Feedback redundancy: a power efficient seu-tolerant latch design for deep sub-micron technologies. In Dependable Systems and Net- works, 2007. DSN’07. 37th Annual IEEE/IFIP International Conference on, pages 276–285. IEEE, 2007. [FSK15] Jun Furuta, Eiji Sonezaki, and Kazutoshi Kobayashi. Radiation hardness evaluations of 65 nm fully depleted silicon on insulator and bulk processes by measuring single event transient pulse widths and single event upset rates. Japanese Journal of Applied Physics, 54(4S):04DC15, 2015. [Gai67] Brian R Gaines. Stochastic computing. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 149–156. ACM, 1967. [GBB + 12] Xavier Gili, Salvador Barcel´ o, Sebasti` a Bota, Jaume Segura, et al. Analyt- ical modeling of single event transients propagation in combinational logic gates. Nuclear Science, IEEE Transactions on, 59(4):971–979, 2012. [GBR + 12] Al Geist, Shekhar Borkar, Eric Roman, Bert Still, Robert Clay SNL, John Wu, Christian Engelmann, Nathan DeBardeleben, Larry Kaplan, Martin Schulz, et al. Us department of energy fault management workshop. In Workshop report submitted to the US Department of Energy, 2012. [goota] Google cluster data. 192 [GWGP13] Yue Gao, Yanzhi Wang, Sandeep K Gupta, and Massoud Pedram. An energy and deadline aware resource provisioning, scheduling and opti- mization framework for cloud systems. In Hardware/Software Codesign and System Synthesis, pages 1–10. IEEE, 2013. [HHW15] Ryan H-M Huang, Dennis K-H Hsu, and Charles H-P Wen. A determinate radiation hardened technique for safety-critical cmos designs. Journal of Electronic Testing, 31(2):181–192, 2015. [HLC + 14] Miao Hu, Hai Li, Yiran Chen, Qing Wu, Garrett S Rose, and Richard W Linderman. Memristor crossbar-based neuromorphic computing system: A case study. IEEE transactions on neural networks and learning systems, 25(10):1864–1878, 2014. [HLH15] Zhengfeng Huang, Huaguo Liang, and Sybille Hellebrand. A high perfor- mance seu tolerant latch. Journal of Electronic Testing, 31(4):349–359, 2015. [HLLC14] Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pages 2042–2050, 2014. [HLM + 16] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: efficient inference engine on com- pressed deep neural network. arXiv preprint arXiv:1602.01528, 2016. [HSK + 12] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012. [HW14] Ryan H-M Huang and Charles H-P Wen. Advanced soft-error-rate (ser) estimation with striking-time and multi-cycle effects. In Design Automa- tion Conference (DAC), 2014 51st ACM/EDAC/IEEE, pages 1–6. IEEE, 2014. [HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classifi- cation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015. [IBY + 07] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fet- terly. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS operating systems review, volume 41, pages 59–72. ACM, 2007. 193 [JG03] Niraj K Jha and Sandeep Gupta. Testing of digital systems. Cambridge University Press, 2003. [JRML15] Yuan Ji, Feng Ran, Cong Ma, and David J Lilja. A hardware implemen- tation of a radial basis function neural network using stochastic logic. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pages 880–883. EDA Consortium, 2015. [JSD + 14] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014. [KKM + 14] Kaoru Kobayashi, K Kubota, Masahiro Masuda, Y Manzawa, J Furuta, S Kanda, and Hidetoshi Onodera. A low-power and area-efficient radiation-hard redundant flip-flop, dice acff, in a 65 nm thin-box fd-soi. Nuclear Science, IEEE Transactions on, 61(4):1881–1888, 2014. [KKY + 16] Kyounghoon Kim, Jungki Kim, Joonsang Yu, Jungwoo Seo, Jongeun Lee, and Kiyoung Choi. Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks. In Proceedings of the 53rd Annual Design Automation Conference, page 124. ACM, 2016. [KLC15] Kyounghoon Kim, Jongeun Lee, and Kiyoung Choi. Approximate de- randomizer for stochastic circuits. Proc. ISOCC, 2015. [KLC16] Kyounghoon Kim, Jongeun Lee, and Kiyoung Choi. An energy-efficient random number generator for stochastic circuits. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pages 256–261. IEEE, 2016. [KMH12] Smita Krishnaswamy, Igor L Markov, and John P Hayes. Design, Analy- sis and Test of Logic Circuits Under Uncertainty, volume 115. Springer Science & Business Media, 2012. [KOTN14] Saman Kiamehr, Thomas Osiecki, Mehdi Tahoori, and Sani Nassif. Radiation-induced soft error analysis of srams in soi finfet technology: A device to circuit approach. In Proceedings of the 51st Annual Design Automation Conference, pages 1–6. ACM, 2014. [KR11] Natalja Kehl and Wolfgang Rosenstiel. An efficient ser estimation method for combinational circuits. Reliability, IEEE Transactions on, 60(4):742– 747, 2011. 194 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas- sification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [KTS + 14] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with con- volutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014. [LBBH98] Yann LeCun, L´ eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [LD16a] Ji Li and Jeffrey Draper. Accelerating soft-error-rate (ser) estimation in the presence of single event transients. In Proceedings of the 53rd Annual Design Automation Conference, page 55. ACM, 2016. [LD16b] Ji Li and Jeffrey Draper. Joint soft-error-rate (ser) estimation for com- binational logic and sequential elements. In VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on, pages 737–742. IEEE, 2016. [LD17] Ji Li and Jeffrey Draper. Accelerated soft-error-rate (ser) estimation for combinational and sequential circuits. ACM Transactions on Design Automation of Electronic Systems (TODAES), 22(3):57, 2017. [LeC15] Yann LeCun. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 2015. [len16] Convolutional neural networks (lenet), 2016. [LJB + 95] Yann LeCun, LD Jackel, Leon Bottou, A Brunot, Corinna Cortes, JS Denker, Harris Drucker, I Guyon, UA Muller, Eduard Sackinger, et al. Comparison of learning algorithms for handwritten digit recognition. In International conference on artificial neural networks, volume 60, pages 53–60. Perth, Australia, 1995. [LKMO06] Daniel Larkin, Andrew Kinane, Valentin Muresan, and Noel OConnor. An efficient hardware architecture for a neural network activation function generator. In International Symposium on Neural Networks, pages 1319– 1327. Springer, 2006. 195 [LLNP17] Ji Li, Xue Lin, Shahin Nazarian, and Massoud Pedram. Cts2m: concurrent task scheduling and storage management for residential energy consumers under dynamic energy pricing. IET Cyber-Physical Systems: Theory & Applications, 2017. [LLQ + 12] Peng Li, David J Lilja, Weikang Qian, Kia Bazargan, and Marc Riedel. The synthesis of complex arithmetic computation on stochastic bit streams using sequential logic. In Proceedings of the International Conference on Computer-Aided Design, pages 480–487. ACM, 2012. [LLQ16] Yilan Li, Zhe Li, and Qinru Qiu. Assisting fuzzy offline handwriting recognition using recurrent belief propagation. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8, Dec 2016. [LLX + 17] Ning Liu, Zhe Li, Jielong Xu, Zhiyuan Xu, Sheng Lin, Qinru Qiu, Jian Tang, and Yanzhi Wang. A hierarchical framework of cloud resource allocation and power management using deep reinforcement learning. In Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on, pages 372–382. IEEE, 2017. [LLY + 17] Hongjia Li, Ji Li, Wang Yao, Shahin Nazarian, Xue Lin, and Yanzhi Wang. Fast and energy-aware resource provisioning and task schedul- ing for cloud systems. In Quality Electronic Design (ISQED), 2017 18th International Symposium on, pages 174–179. IEEE, 2017. [LR12] Daniel B Limbrick and William H Robinson. Characterizing single event transient pulse widths in an open-source cell library using spice. In IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), 2012. [LRL + 16] Zhe Li, Ao Ren, Ji Li, Qinru Qiu, Yanzhi Wang, and Bo Yuan. Dscnn: Hardware-oriented optimization for stochastic computing based deep con- volutional neural networks. In Computer Design (ICCD), 2016 IEEE 34th International Conference on, pages 678–681. IEEE, 2016. [LRL + 17a] Ji Li, Ao Ren, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, and Yanzhi Wang. Towards acceleration of deep convolutional neural networks using stochastic computing. In Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific, pages 115–120. IEEE, 2017. [LRL + 17b] Zhe Li, Ao Ren, Ji Li, Qinru Qiu, Bo Yuan, Jeffrey Draper, and Yanzhi Wang. Structural design optimization for deep convolutional neural net- works using stochastic computing. In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 250–253. IEEE, 2017. 196 [LSN12] Endre L´ aszl´ o, P´ eter Szolgay, and Zolt´ an Nagy. Analysis of a gpu based cnn implementation. In 2012 13th International Workshop on Cellular Nanoscale Networks and their Applications, pages 1–5. IEEE, 2012. [LWC + 14] Ji Li, Yanzhi Wang, Tiansong Cui, Shahin Nazarian, and Massoud Pedram. Negotiation-based task scheduling to minimize users electric- ity bills under dynamic energy prices. In OnlineGreencomm, pages 1–6. IEEE, 2014. [LWL + 15] Ji Li, Yanzhi Wang, Xue Lin, Shahin Nazarian, and Massoud Pedram. Negotiation-based task scheduling and storage control algorithm to min- imize user’s electric bills under dynamic prices. In Design Automation Conference (ASP-DAC), 2015 20th Asia and South Pacific, pages 261– 266. IEEE, 2015. [LWL + 16] Ji Li, Yanzhi Wang, Xue Lin, Shahin Nazarian, and Massoud Pedram. Negotiation-based resource provisioning and task scheduling algorithm for cloud systems. In Quality Electronic Design (ISQED), 2016 17th Inter- national Symposium on, pages 338–343. IEEE, 2016. [LXW + 15] Ji Li, Qing Xie, Yanzhi Wang, Shahin Nazarian, and Massoud Pedram. Leakage power reduction for deeply-scaled finfet circuits operating in multiple voltage regimes using fine-grained gate-length biasing technique. In Proceedings of the 2015 Design, Automation & Test in Europe Confer- ence & Exhibition, pages 1579–1582. EDA Consortium, 2015. [LYL + 17a] Ji Li, Zihao Yuan, Zhe Li, Caiwen Ding, Ao Ren, Qinru Qiu, Jeffrey Draper, and Yanzhi Wang. Hardware-driven nonlinear activation for stochastic computing based deep convolutional neural networks. In Neu- ral Networks (IJCNN), 2017 International Joint Conference on. IEEE, 2017. [LYL + 17b] Ji Li, Zihao Yuan, Zhe Li, Ao Ren, Caiwen Ding, Jeffrey Draper, Shahin Nazarian, Qinru Qiu, Bo Yuan, and Yanzhi Wang. Normalization and dropout for stochastic computing-based deep convolutional neural net- works. Integration, the VLSI Journal, 2017. [MBS10] Subhasish Mitra, Kevin Brelsford, and Pia N Sanda. Cross-layer resilience challenges: Metrics and optimization. In 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010), pages 1029–1034. IEEE, 2010. [MGA + 14] NN Mahatme, NJ Gaspard, T Assis, Sarangapani Jagannathan, I Chat- terjee, TD Loveless, BL Bhuva, Lloyd W Massengill, SJ Wen, and Rita 197 Wong. Impact of technology scaling on the combinational logic soft error rate. In Reliability Physics Symposium, 2014 IEEE International, pages 5F–2. IEEE, 2014. [MGAG16] Mohammad Motamedi, Philipp Gysel, Venkatesh Akella, and Soheil Ghi- asi. Design space exploration of fpga-based deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Con- ference (ASP-DAC), pages 575–580. IEEE, 2016. [MHC + 18] Raghav Mehta, Yuyang Huang, Mingxi Cheng, Shrey Bagga, Nishant Mathur, Ji Li, Jeffrey Draper, and Shahin Nazarian. High performance training of deep neural networks using pipelined hardware acceleration and distributed memory. In 2018 19th International Symposium on Qual- ity Electronic Design (ISQED). IEEE, 2018. [MHN] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinear- ities improve neural network acoustic models. In Proc. ICML, volume 30. [MKS + 13] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioan- nis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [MKS + 15] V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep rein- forcement learning. Nature, 518(7540):529–533, 2015. [MRLG10] Amir-Hamed Mohsenian-Rad and Alberto Leon-Garcia. Optimal resi- dential load control with price prediction in real-time electricity pricing environments. IEEE transactions on Smart Grid, 1(2):120–133, 2010. [MZM07] Natasa Miskov-Zivanov and Diana Marculescu. Soft error rate analysis for sequential circuits. In Proceedings of the conference on Design, automa- tion and test in Europe, pages 1436–1441. EDA Consortium, 2007. [MZM08] Natasa Miskov-Zivanov and Diana Marculescu. Modeling and opti- mization for soft-error reliability of sequential circuits. Computer- Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 27(5):803–816, 2008. [nan09] Nangate 45nm Open Library, Nangate Inc., 2009. [Nic10] Michael Nicolaidis. Soft errors in modern electronic systems, volume 41. Springer Science & Business Media, 2010. 198 [NL14] Daniel Neil and Shih-Chii Liu. Minitaur, an event-driven fpga-based spik- ing network accelerator. IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, 22(12):2621–2628, 2014. [PK15] Vladimir Petrovic and Milo Krstic. Design flow for radhard tmr flip-flops. In Design and Diagnostics of Electronic Circuits & Systems (DDECS), 2015 IEEE 18th International Symposium on, pages 203–208. IEEE, 2015. [ptmel] Predictive Technology Model. [PY95] Behraoz Parhami and Chi-Hsiang Yeh. Accumulative parallel counters. In Signals, Systems and Computers, 1995. 1995 Conference Record of the Twenty-Ninth Asilomar Conference on, volume 2, pages 966–970. IEEE, 1995. [RCBS07] Rajeev R Rao, Kaviraj Chopra, David T Blaauw, and M Sylvester. Com- puting the soft error rate of a combinational logic circuit using parame- terized descriptors. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(3):468–479, 2007. [Ric89] B Ricc. Estimate of signal probability in combinational logic networks. 1989. [RKV + 06] R Rajaraman, JS Kim, Narayanan Vijaykrishnan, Yuan Xie, and Mary Jane Irwin. Seat-la: a soft error analysis tool for combinational logic. In VLSI Design, 2006. Held jointly with 5th International Confer- ence on Embedded Systems and Design., 19th International Conference on, pages 4–pp. IEEE, 2006. [RLC16] Atul Rahman, Jongeun Lee, and Kiyoung Choi. Efficient fpga acceleration of convolutional neural networks using logical-3d compute array. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1393–1398. IEEE, 2016. [RLL + 17a] Ao Ren, Ji Li, Zhe Li, Caiwen Ding, Xuehai Qian, Qinru Qiu, Bo Yuan, and Yanzhi Wang. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Lan- guages and Operating Systems, pages 405–418. ACM, 2017. [RLL + 17b] Ao Ren, Ji Li, Zhe Li, Caiwen Ding, Qinru Qiu, Yanzhi Wang, Xuehai Qian, and Bo Yuan. Sc-dcnn: Highly-scalable deep convolutional neural network using stochastic computing. In Proceedings of the Twenty-Second 199 International Conference on Architectural Support for Programming Lan- guages and Operating Systems, pages 405–418. ACM, 2017. [RLW + 16] Ao Ren, Zhe Li, Yanzhi Wang, Qinru Qiu, and Bo Yuan. Designing recon- figurable large-scale deep learning systems using stochastic computing. In Rebooting Computing (ICRC), IEEE International Conference on, pages 1–7. IEEE, 2016. [RR16] John W Rittinghouse and James F Ransome. Cloud computing: imple- mentation, management, and security. CRC press, 2016. [RRV + 08] Krishnan Ramakrishnan, R Rajaraman, Narayanan Vijaykrishnan, Yuan Xie, Mary Jane Irwin, and Kenan Unlu. Hierarchical soft error estima- tion tool (hseet). In Quality Electronic Design, 2008. ISQED 2008. 9th International Symposium on, pages 680–683. IEEE, 2008. [SDS15] GEORGE V ALENTIN STOICA, RADU DOGARU, and CE Stoica. High performance cuda based cnn image processor, 2015. [SGMA15] Kayode Sanni, Guillaume Garreau, Jamal Lottier Molin, and Andreas G Andreou. Fpga implementation of a deep belief network architecture for character recognition using stochastic computation. In Information Sci- ences and Systems (CISS), 2015 49th Annual Conference on, pages 1–5. IEEE, 2015. [SHK + 14] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neu- ral networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. [SHM + 16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. [SLJ + 15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1– 9, 2015. [SLW + 06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free reinforcement learning. In Pro- ceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006. 200 [SMKR13] Tara N Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana Ramabhadran. Deep convolutional neural networks for lvcsr. In 2016 IEEE International Conference on Acoustics, Speech and Signal Process- ing, pages 8614–8618. IEEE, 2013. [SNA + 03] Shigeo Sato, Ken Nemoto, Shunsuke Akimoto, Mitsunaga Kinjo, and Koji Nakajima. Implementation of a new neurochip using stochastic logic. IEEE Transactions on Neural Networks, 14(5):1122–1127, 2003. [SNG + 15] Evangelos Stromatias, Daniel Neil, Francesco Galluppi, Michael Pfeiffer, Shih-Chii Liu, and Steve Furber. Scalable energy-efficient, low-latency implementations of trained spiking deep belief networks on spinnaker. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2015. [SSSW08] Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley Williams. The missing memristor found. nature, 453(7191):80–83, 2008. [Sta06] JEDEC Standard. Jesd89a. Measurement and Reporting of Alpha Par- ticles and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, 2006. [SWA + 14] Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cap- pello, Bill Carlson, et al. Addressing failures in exascale computing. International Journal of High Performance Computing Applications, page 1094342014522573, 2014. [SZ14a] Karen Simonyan and Andrew Zisserman. Two-stream convolutional net- works for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014. [SZ14b] Karen Simonyan and Andrew Zisserman. Very deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [TMAJ08] S Thoziyoor, N Muralimanohar, JH Ahn, and N Jouppi. Cacti 5.3. HP Laboratories, Palo Alto, CA, 2008. [TQF00] SL Toral, JM Quero, and LG Franquelo. Stochastic pulse coded arith- metic. In Circuits and Systems, 2000. Proceedings. ISCAS 2000 Geneva. The 2000 IEEE International Symposium on, volume 1, pages 599–602. IEEE, 2000. 201 [TTYYN15] Masakazu Tanomoto, Shinya Takamaeda-Yamazaki, Jun Yao, and Yasuhiko Nakashima. A cgra-based approach for accelerating convolu- tional neural networks. In Embedded Multicore/Many-core Systems-on- Chip (MCSoC), 2015 IEEE 9th International Symposium on, pages 73–80. IEEE, 2015. [VLTC11] Jyothi Velamala, Robert LiV olsi, Myra Torres, and Yu Cao. Design sensi- tivity of single event transients in scaled logic circuits. In Design Automa- tion Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 694–699. IEEE, 2011. [WA10] Fan Wang and Vishwani D Agrawal. Soft error rate determination for nanoscale sequential logic. In Quality Electronic Design (ISQED), 2010 11th International Symposium on, pages 225–230. IEEE, 2010. [WBTY11] Philipp Wieder, Joe M Butler, Wolfgang Theilmann, and Ramin Yahyapour. Service level agreements for cloud computing. Springer Sci- ence & Business Media, 2011. [WDT + 14] F Wrobel, L Dilillo, AD Touboul, V Pouget, and Fr´ ed´ eric Saign´ e. Deter- mining realistic parameters for the double exponential law that mod- els transient current pulses. Nuclear Science, IEEE Transactions on, 61(4):1813–1818, 2014. [WK09] Daniel Warneke and Odej Kao. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd workshop on many-task computing on grids and supercomputers, page 8. ACM, 2009. [WLGK17] Shangxing Wang, Hanpeng Liu, Pedro Henrique Gomes, and Bhaskar Krishnamachari. Deep reinforcement learning for dynamic multichan- nel access. In International Conference on Computing, Networking and Communications (ICNC), 2017. [WSC + 15] Luhao Wang, Alireza Shafaei, Shuang Chen, Yanzhi Wang, Shahin Nazar- ian, and Massoud Pedram. 10nm gate-length junctionless gate-all-around (jl-gaa) fets based 8t sram design under process variation using a cross- layer simulation. In SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), 2015 IEEE, pages 1–2. IEEE, 2015. [WWZ17] Tianshu Wei, Yanzhi Wang, and Qi Zhu. Deep reinforcement learning for building hvac control. In Proceedings of the 54th Annual Design Automa- tion Conference 2017, DAC ’17, pages 22:1–22:6, New York, NY , USA, 2017. ACM. 202 [XLNB17] Yuankun Xue, Ji Li, Shahin Nazarian, and Paul Bogdan. Fundamental challenges toward making the iot a reachable reality: A model-centric investigation. ACM Transactions on Design Automation of Electronic Sys- tems (TODAES), 22(3):53, 2017. [XLT + 16] Lixue Xia, Boxun Li, Tianqi Tang, Peng Gu, Xiling Yin, Wenqin Huangfu, Pai-Yu Chen, Shimeng Yu, Yu Cao, Yu Wang, Yuan Xie, and Huazhong Yang. Mnsim: Simulation platform for memristor-based neuromorphic computing system. In 2016 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE), pages 469–474. IEEE, 2016. [XTQ13] Yang Xiaoguang, Chen Tingbin, and Zhang Qisong. Research on cloud computing schedule based on improved hybrid pso. In Computer Science and Network Technology (ICCSNT), 2013 3rd International Conference on, pages 388–391. IEEE, 2013. [YKD + 18] Hanchen Yang, Feiyang Kang, Caiwen Ding, Ji Li, Kim Jaemin, Donkyu Baek, Shahin Nazarian, Xue Lin, Paul Bogdan, and Naehyuck Chang. Prediction-based fast thermoelectric generator reconfiguration for energy harvesting from vehicle radiators. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018. IEEE, 2018. [YLL + 17] Zihao Yuan, Ji Li, Zhe Li, Caiwen Ding, Ao Ren, Bo Yuan, Qinru Qiu, Jeffrey Draper, and Yanzhi Wang. Softmax regression design for stochas- tic computing based deep convolutional neural networks. In Proceedings of the on Great Lakes Symposium on VLSI 2017, pages 467–470. ACM, 2017. [YZW16] Bo Yuan, Chuan Zhang, and Zhongfeng Wang. Design space exploration for hardware-efficient stochastic computing: A case study on discrete cosine transformation. In 2016 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 6555–6559. IEEE, 2016. [Zac16] Giancarlo Zaccone. Getting Started with TensorFlow. Packt Publishing Ltd, 2016. [ZFKO14] Kuiyuan Zhang, Jun Furuta, Kaoru Kobayashi, and Hidetoshi Onodera. Dependence of cell distance and well-contact density on mcu rates by device simulations and neutron experiments in a 65-nm bulk process. Nuclear Science, IEEE Transactions on, 61(4):1583–1589, 2014. [ZLS + 15] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional 203 neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170. ACM, 2015. [ZMM + 06] Ming Zhang, Subhasish Mitra, TM Mak, Norbert Seifert, Nicholas J Wang, Quan Shi, Kee Sup Kim, Naresh R Shanbhag, and Sanjay J Patel. Sequential element design with built-in soft error resilience. Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 14(12):1368– 1378, 2006. [ZWO06] Bin Zhang, Wei-Shen Wang, and Michael Orshansky. Faser: Fast analysis of soft error susceptibility for cell-based designs. In Proceedings of the 7th International Symposium on Quality Electronic Design, pages 755– 760. IEEE Computer Society, 2006. [ZZBH13] Qi Zhang, Mohamed Faten Zhani, Raouf Boutaba, and Joseph L Heller- stein. Harmony: Dynamic heterogeneity-aware resource provisioning in the cloud. In Distributed Computing Systems (ICDCS) , 2013 IEEE 33rd International Conference on, pages 510–519. IEEE, 2013. 204
Abstract (if available)
Abstract
This thesis is dedicated to improving the efficiency of resilient computing through both a classic approach and a novel approach in parallel involving emerging resilient systems. ❧ The first part of this thesis is focused on one of the most important problems in resilient computing, i.e., evaluating the impacts of radiation-induced soft errors, which is one of the major threats to the resilience of modern electronic systems. A fast and comprehensive Soft Error Rate (SER) evaluation framework is developed for conventional computing circuits in three steps. The first step is an accelerated SER estimation algorithm for combinational logic, which accelerates the most computationally expensive process of the SER estimation framework, i.e., the propagation of Single-Event Transient (SET) pulses, by using dynamically maintained lookup tables (LUTs). Simulation results demonstrate 560.2X times speedup is achieved with less than 3% difference in terms of SER results compared with the baseline algorithm. With the aggressive down-scaling of the process technology, multiple upsets can be induced by a single particle strike due to the charge sharing and parasitic bipolar effects, which is called Multiple Cell Upsets (MCUs). Hence, the second step integrates the MCU modeling into the framework by proposing a schematic and layout co-simulation method. The third step introduces an efficient time frame expansion method for analyzing feedback loops in sequential logic. Simulation results show that presented SER evaluation framework can analyze the largest ISCAS89 benchmark circuit with more than 3,000 flip-flops and 17,000 gates in 119.23s. ❧ The second part of the thesis is focused on the emerging resilient Deep Convolutional Neural Network (DCNN) and Deep Neural Network (DNN) systems, which completely tolerate radiation-induced soft errors. The thesis proposes a deep reinforcement learning-based framework (DRL-Cloud) to solve the cloud resource allocation problem by utilizing the resilient hierarchical DNN architecture, which cannot be solved efficiently by the traditional methods when problem scale is large. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement, compared with the state-of-the-art energy efficient algorithms. Then, the thesis adopts the Stochastic Computing (SC) technology to achieve significantly improved area, power and energy efficiency, in order to bring the DCNN resilient architecture to resource-constrained Internet-of-Things (IoT) and wearable devices. Basic operational blocks in DCNNs are first designed and optimized, then the entire DCNN network is designed with joint optimizations for feature extraction blocks and optimized weight storage schemes. The LeNet5 implemented in SC-based DCNN achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation. Non-linear activation is design for SC-DCNNs, which achieves up to 21X and 41X of the area, 41X and 72X of the power, and 198,200X and 96,443X of the energy for the LeNet-5 implementation, compared with CPU and GPU approaches, respectively, while the error is increased by less than 3.07%. Finally, softmax regression is designed for SC-DCNNs, that can reach the same level of accuracy with the improvement of 295X, 62X, 2,617X in terms of power, area and energy, respectively, compared with the binary version under long bit stream.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Hardware techniques for efficient communication in transactional systems
PDF
A framework for soft error tolerant SRAM design
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Integration of energy-efficient infrastructures and policies in smart grid
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Resiliency-aware scheduling
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Adaptive and resilient stream processing on cloud infrastructure
Asset Metadata
Creator
Li, Ji
(author)
Core Title
Improving efficiency to advance resilient computing
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/02/2018
Defense Date
11/21/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,cloud computing,deep learning,deep reinforcement learning,OAI-PMH Harvest,resilience,soft error,stochastic computing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Nazarian, Shahin (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
jiangsuliji@gmail.com,jli724@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-466859
Unique identifier
UC11266934
Identifier
etd-LiJi-5980.pdf (filename),usctheses-c40-466859 (legacy record id)
Legacy Identifier
etd-LiJi-5980.pdf
Dmrecord
466859
Document Type
Dissertation
Rights
Li, Ji
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithm
cloud computing
deep learning
deep reinforcement learning
resilience
soft error
stochastic computing